This is a note to let you know that I've just added the patch titled
sched: Make resched_cpu() unconditional
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sched-make-resched_cpu-unconditional.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com>
Date: Mon, 18 Sep 2017 08:54:40 -0700
Subject: sched: Make resched_cpu() unconditional
From: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/sched/core.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -507,8 +507,7 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
- if (!raw_spin_trylock_irqsave(&rq->lock, flags))
- return;
+ raw_spin_lock_irqsave(&rq->lock, flags);
resched_curr(rq);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
Patches currently in stable-queue which might be from paulmck(a)linux.vnet.ibm.com are
queue-4.9/sched-make-resched_cpu-unconditional.patch
This is a note to let you know that I've just added the patch titled
lib/mpi: call cond_resched() from mpi_powm() loop
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Tue, 7 Nov 2017 14:15:27 -0800
Subject: lib/mpi: call cond_resched() from mpi_powm() loop
From: Eric Biggers <ebiggers(a)google.com>
commit 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca upstream.
On a non-preemptible kernel, if KEYCTL_DH_COMPUTE is called with the
largest permitted inputs (16384 bits), the kernel spends 10+ seconds
doing modular exponentiation in mpi_powm() without rescheduling. If all
threads do it, it locks up the system. Moreover, it can cause
rcu_sched-stall warnings.
Notwithstanding the insanity of doing this calculation in kernel mode
rather than in userspace, fix it by calling cond_resched() as each bit
from the exponent is processed. It's still noninterruptible, but at
least it's preemptible now.
Do the cond_resched() once per bit rather than once per MPI limb because
each limb might still easily take 100+ milliseconds on slow CPUs.
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Herbert Xu <herbert(a)gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
lib/mpi/mpi-pow.c | 2 ++
1 file changed, 2 insertions(+)
--- a/lib/mpi/mpi-pow.c
+++ b/lib/mpi/mpi-pow.c
@@ -26,6 +26,7 @@
* however I decided to publish this code under the plain GPL.
*/
+#include <linux/sched.h>
#include <linux/string.h>
#include "mpi-internal.h"
#include "longlong.h"
@@ -256,6 +257,7 @@ int mpi_powm(MPI res, MPI base, MPI exp,
}
e <<= 1;
c--;
+ cond_resched();
}
i--;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-4.9/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
This is a note to let you know that I've just added the patch titled
sched: Make resched_cpu() unconditional
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sched-make-resched_cpu-unconditional.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com>
Date: Mon, 18 Sep 2017 08:54:40 -0700
Subject: sched: Make resched_cpu() unconditional
From: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/sched/core.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -600,8 +600,7 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
- if (!raw_spin_trylock_irqsave(&rq->lock, flags))
- return;
+ raw_spin_lock_irqsave(&rq->lock, flags);
resched_curr(rq);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
Patches currently in stable-queue which might be from paulmck(a)linux.vnet.ibm.com are
queue-4.4/sched-make-resched_cpu-unconditional.patch
This is a note to let you know that I've just added the patch titled
lib/mpi: call cond_resched() from mpi_powm() loop
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Tue, 7 Nov 2017 14:15:27 -0800
Subject: lib/mpi: call cond_resched() from mpi_powm() loop
From: Eric Biggers <ebiggers(a)google.com>
commit 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca upstream.
On a non-preemptible kernel, if KEYCTL_DH_COMPUTE is called with the
largest permitted inputs (16384 bits), the kernel spends 10+ seconds
doing modular exponentiation in mpi_powm() without rescheduling. If all
threads do it, it locks up the system. Moreover, it can cause
rcu_sched-stall warnings.
Notwithstanding the insanity of doing this calculation in kernel mode
rather than in userspace, fix it by calling cond_resched() as each bit
from the exponent is processed. It's still noninterruptible, but at
least it's preemptible now.
Do the cond_resched() once per bit rather than once per MPI limb because
each limb might still easily take 100+ milliseconds on slow CPUs.
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Herbert Xu <herbert(a)gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
lib/mpi/mpi-pow.c | 2 ++
1 file changed, 2 insertions(+)
--- a/lib/mpi/mpi-pow.c
+++ b/lib/mpi/mpi-pow.c
@@ -26,6 +26,7 @@
* however I decided to publish this code under the plain GPL.
*/
+#include <linux/sched.h>
#include <linux/string.h>
#include "mpi-internal.h"
#include "longlong.h"
@@ -256,6 +257,7 @@ int mpi_powm(MPI res, MPI base, MPI exp,
}
e <<= 1;
c--;
+ cond_resched();
}
i--;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-4.4/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
This is a note to let you know that I've just added the patch titled
serdev: fix registration of second slave
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serdev-fix-registration-of-second-slave.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 08fcee289f341786eb3b44e5f2d1dc850943238e Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan(a)kernel.org>
Date: Tue, 10 Oct 2017 18:09:49 +0200
Subject: serdev: fix registration of second slave
From: Johan Hovold <johan(a)kernel.org>
commit 08fcee289f341786eb3b44e5f2d1dc850943238e upstream.
Serdev currently only supports a single slave device, but the required
sanity checks to prevent further registration attempts were missing.
If a serial-port node has two child nodes with compatible properties,
the OF code would try to register two slave devices using the same id
and name. Driver core will not allow this (and there will be loud
complaints), but the controller's slave pointer would already have been
set to address of the soon to be deallocated second struct
serdev_device. As the first slave device remains registered, this can
lead to later use-after-free issues when the slave callbacks are
accessed.
Note that while the serdev registration helpers are exported, they are
typically only called by serdev core. Any other (out-of-tree) callers
must serialise registration and deregistration themselves.
Fixes: cd6484e1830b ("serdev: Introduce new bus for serial attached devices")
Cc: Rob Herring <robh(a)kernel.org>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serdev/core.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
--- a/drivers/tty/serdev/core.c
+++ b/drivers/tty/serdev/core.c
@@ -65,21 +65,32 @@ static int serdev_uevent(struct device *
*/
int serdev_device_add(struct serdev_device *serdev)
{
+ struct serdev_controller *ctrl = serdev->ctrl;
struct device *parent = serdev->dev.parent;
int err;
dev_set_name(&serdev->dev, "%s-%d", dev_name(parent), serdev->nr);
+ /* Only a single slave device is currently supported. */
+ if (ctrl->serdev) {
+ dev_err(&serdev->dev, "controller busy\n");
+ return -EBUSY;
+ }
+ ctrl->serdev = serdev;
+
err = device_add(&serdev->dev);
if (err < 0) {
dev_err(&serdev->dev, "Can't add %s, status %d\n",
dev_name(&serdev->dev), err);
- goto err_device_add;
+ goto err_clear_serdev;
}
dev_dbg(&serdev->dev, "device %s registered\n", dev_name(&serdev->dev));
-err_device_add:
+ return 0;
+
+err_clear_serdev:
+ ctrl->serdev = NULL;
return err;
}
EXPORT_SYMBOL_GPL(serdev_device_add);
@@ -90,7 +101,10 @@ EXPORT_SYMBOL_GPL(serdev_device_add);
*/
void serdev_device_remove(struct serdev_device *serdev)
{
+ struct serdev_controller *ctrl = serdev->ctrl;
+
device_unregister(&serdev->dev);
+ ctrl->serdev = NULL;
}
EXPORT_SYMBOL_GPL(serdev_device_remove);
@@ -295,7 +309,6 @@ struct serdev_device *serdev_device_allo
return NULL;
serdev->ctrl = ctrl;
- ctrl->serdev = serdev;
device_initialize(&serdev->dev);
serdev->dev.parent = &ctrl->dev;
serdev->dev.bus = &serdev_bus_type;
Patches currently in stable-queue which might be from johan(a)kernel.org are
queue-4.14/serdev-fix-registration-of-second-slave.patch
This is a note to let you know that I've just added the patch titled
sched: Make resched_cpu() unconditional
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sched-make-resched_cpu-unconditional.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com>
Date: Mon, 18 Sep 2017 08:54:40 -0700
Subject: sched: Make resched_cpu() unconditional
From: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/sched/core.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -505,8 +505,7 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
- if (!raw_spin_trylock_irqsave(&rq->lock, flags))
- return;
+ raw_spin_lock_irqsave(&rq->lock, flags);
resched_curr(rq);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
Patches currently in stable-queue which might be from paulmck(a)linux.vnet.ibm.com are
queue-4.14/sched-make-resched_cpu-unconditional.patch
This is a note to let you know that I've just added the patch titled
lib/mpi: call cond_resched() from mpi_powm() loop
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Tue, 7 Nov 2017 14:15:27 -0800
Subject: lib/mpi: call cond_resched() from mpi_powm() loop
From: Eric Biggers <ebiggers(a)google.com>
commit 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca upstream.
On a non-preemptible kernel, if KEYCTL_DH_COMPUTE is called with the
largest permitted inputs (16384 bits), the kernel spends 10+ seconds
doing modular exponentiation in mpi_powm() without rescheduling. If all
threads do it, it locks up the system. Moreover, it can cause
rcu_sched-stall warnings.
Notwithstanding the insanity of doing this calculation in kernel mode
rather than in userspace, fix it by calling cond_resched() as each bit
from the exponent is processed. It's still noninterruptible, but at
least it's preemptible now.
Do the cond_resched() once per bit rather than once per MPI limb because
each limb might still easily take 100+ milliseconds on slow CPUs.
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Herbert Xu <herbert(a)gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
lib/mpi/mpi-pow.c | 2 ++
1 file changed, 2 insertions(+)
--- a/lib/mpi/mpi-pow.c
+++ b/lib/mpi/mpi-pow.c
@@ -26,6 +26,7 @@
* however I decided to publish this code under the plain GPL.
*/
+#include <linux/sched.h>
#include <linux/string.h>
#include "mpi-internal.h"
#include "longlong.h"
@@ -256,6 +257,7 @@ int mpi_powm(MPI res, MPI base, MPI exp,
}
e <<= 1;
c--;
+ cond_resched();
}
i--;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-4.14/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
This is a note to let you know that I've just added the patch titled
cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
cpufreq-schedutil-reset-cached_raw_freq-when-not-in-sync-with-next_freq.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 Mon Sep 17 00:00:00 2001
From: Viresh Kumar <viresh.kumar(a)linaro.org>
Date: Wed, 8 Nov 2017 20:23:55 +0530
Subject: cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
From: Viresh Kumar <viresh.kumar(a)linaro.org>
commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.
Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then see the CPU wasn't idle recently and choose to keep the next
freq as 1.2 GHz.
- Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
1.2 GHz.
- Now if the utilization doesn't change in then next request, then the
next target frequency will still be 780 MHz and it will match with
cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.
Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/sched/cpufreq_schedutil.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -282,8 +282,12 @@ static void sugov_update_single(struct u
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
*/
- if (busy && next_f < sg_policy->next_freq)
+ if (busy && next_f < sg_policy->next_freq) {
next_f = sg_policy->next_freq;
+
+ /* Reset cached freq as next_freq has changed */
+ sg_policy->cached_raw_freq = 0;
+ }
}
sugov_update_commit(sg_policy, time, next_f);
}
Patches currently in stable-queue which might be from viresh.kumar(a)linaro.org are
queue-4.14/cpufreq-schedutil-reset-cached_raw_freq-when-not-in-sync-with-next_freq.patch
This is a note to let you know that I've just added the patch titled
sched: Make resched_cpu() unconditional
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sched-make-resched_cpu-unconditional.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com>
Date: Mon, 18 Sep 2017 08:54:40 -0700
Subject: sched: Make resched_cpu() unconditional
From: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/sched/core.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -632,8 +632,7 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
- if (!raw_spin_trylock_irqsave(&rq->lock, flags))
- return;
+ raw_spin_lock_irqsave(&rq->lock, flags);
resched_curr(rq);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
Patches currently in stable-queue which might be from paulmck(a)linux.vnet.ibm.com are
queue-3.18/sched-make-resched_cpu-unconditional.patch
This is a note to let you know that I've just added the patch titled
lib/mpi: call cond_resched() from mpi_powm() loop
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Tue, 7 Nov 2017 14:15:27 -0800
Subject: lib/mpi: call cond_resched() from mpi_powm() loop
From: Eric Biggers <ebiggers(a)google.com>
commit 1d9ddde12e3c9bab7f3d3484eb9446315e3571ca upstream.
On a non-preemptible kernel, if KEYCTL_DH_COMPUTE is called with the
largest permitted inputs (16384 bits), the kernel spends 10+ seconds
doing modular exponentiation in mpi_powm() without rescheduling. If all
threads do it, it locks up the system. Moreover, it can cause
rcu_sched-stall warnings.
Notwithstanding the insanity of doing this calculation in kernel mode
rather than in userspace, fix it by calling cond_resched() as each bit
from the exponent is processed. It's still noninterruptible, but at
least it's preemptible now.
Do the cond_resched() once per bit rather than once per MPI limb because
each limb might still easily take 100+ milliseconds on slow CPUs.
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Herbert Xu <herbert(a)gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
lib/mpi/mpi-pow.c | 2 ++
1 file changed, 2 insertions(+)
--- a/lib/mpi/mpi-pow.c
+++ b/lib/mpi/mpi-pow.c
@@ -26,6 +26,7 @@
* however I decided to publish this code under the plain GPL.
*/
+#include <linux/sched.h>
#include <linux/string.h>
#include "mpi-internal.h"
#include "longlong.h"
@@ -256,6 +257,7 @@ int mpi_powm(MPI res, MPI base, MPI exp,
}
e <<= 1;
c--;
+ cond_resched();
}
i--;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-3.18/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
Include the OF-based modalias in the uevent sent when registering devices
on the sunxi RSB bus, so that user space has a chance to autoload the
kernel module for the device.
Fixes a regression caused by commit 3f241bfa60bd ("arm64: allwinner: a64:
pine64: Use dcdc1 regulator for mmc0"). When the axp20x-rsb module for
the AXP803 PMIC is built as a module, it is not loaded and the system
ends up with an disfunctional MMC controller.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Stefan Brüns <stefan.bruens(a)rwth-aachen.de>
---
drivers/bus/sunxi-rsb.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/bus/sunxi-rsb.c b/drivers/bus/sunxi-rsb.c
index 328ca93781cf..37cb57244cbe 100644
--- a/drivers/bus/sunxi-rsb.c
+++ b/drivers/bus/sunxi-rsb.c
@@ -173,11 +173,24 @@ static int sunxi_rsb_device_remove(struct device *dev)
return drv->remove(to_sunxi_rsb_device(dev));
}
+static int sunxi_rsb_device_uevent(struct device *dev,
+ struct kobj_uevent_env *env)
+{
+ int ret;
+
+ ret = of_device_uevent_modalias(dev, env);
+ if (ret != -ENODEV)
+ return ret;
+
+ return 0;
+}
+
static struct bus_type sunxi_rsb_bus = {
.name = RSB_CTRL_NAME,
.match = sunxi_rsb_device_match,
.probe = sunxi_rsb_device_probe,
.remove = sunxi_rsb_device_remove,
+ .uevent = sunxi_rsb_device_uevent,
};
static void sunxi_rsb_dev_release(struct device *dev)
--
2.15.0
This is a note to let you know that I've just added the patch titled
ACPI / APEI: Remove arch_apei_flush_tlb_one()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-apei-remove-arch_apei_flush_tlb_one.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 4a75aeacda3c2455954596593d89187df5420d0a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:27 +0000
Subject: ACPI / APEI: Remove arch_apei_flush_tlb_one()
From: James Morse <james.morse(a)arm.com>
commit 4a75aeacda3c2455954596593d89187df5420d0a upstream.
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_pte_vaddr() to do the invalidation when called from clear_fixmap()
Remove arch_apei_flush_tlb_one().
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/acpi/apei.c | 5 -----
include/acpi/apei.h | 1 -
2 files changed, 6 deletions(-)
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -55,8 +55,3 @@ void arch_apei_report_mem_error(int sev,
apei_mce_report_mem_error(sev, mem_err);
#endif
}
-
-void arch_apei_flush_tlb_one(unsigned long addr)
-{
- __flush_tlb_one(addr);
-}
--- a/include/acpi/apei.h
+++ b/include/acpi/apei.h
@@ -44,7 +44,6 @@ int erst_clear(u64 record_id);
int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
-void arch_apei_flush_tlb_one(unsigned long addr);
#endif
#endif
Patches currently in stable-queue which might be from james.morse(a)arm.com are
queue-4.9/acpi-apei-remove-arch_apei_flush_tlb_one.patch
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3f5fe9fef5b2da06b6319fab8123056da5217c3f Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Wed, 22 Nov 2017 13:05:48 +0100
Subject: [PATCH] sched/debug: Fix task state recording/printout
The recent conversion of the task state recording to use task_state_index()
broke the sched_switch tracepoint task state output.
task_state_index() returns surprisingly an index (0-7) which is then
printed with __print_flags() applying bitmasks. Not really working and
resulting in weird states like 'prev_state=t' instead of 'prev_state=I'.
Use TASK_REPORT_MAX instead of TASK_STATE_MAX to report preemption. Build a
bitmask from the return value of task_state_index() and store it in
entry->prev_state, which makes __print_flags() work as expected.
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: stable(a)vger.kernel.org
Fixes: efb40f588b43 ("sched/tracing: Fix trace_sched_switch task-state printing")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1711221304180.1751@nanos
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 306b31de5194..bc01e06bc716 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -116,9 +116,9 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
* RUNNING (we will not have dequeued if state != RUNNING).
*/
if (preempt)
- return TASK_STATE_MAX;
+ return TASK_REPORT_MAX;
- return task_state_index(p);
+ return 1 << task_state_index(p);
}
#endif /* CREATE_TRACE_POINTS */
@@ -164,7 +164,7 @@ TRACE_EVENT(sched_switch,
{ 0x40, "P" }, { 0x80, "I" }) :
"R",
- __entry->prev_state & TASK_STATE_MAX ? "+" : "",
+ __entry->prev_state & TASK_REPORT_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b6366f048e0caff28af5335b7af2031266e1b06b Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt(a)goodmis.org>
Date: Wed, 18 Mar 2015 14:49:46 -0400
Subject: [PATCH] sched/rt: Use IPI to trigger RT task push migration instead
of pulling
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.
Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.
The test that was run was the following:
cyclictest --numa -p95 -m -d0 -i100
This created a thread on each CPU, that would set its wakeup in iterations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.
cyclictest is maintained at:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.
To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks especially since the taking was done via a double rq lock, which
means that several of the CPUs had their own rq locks held while trying
to take this rq lock. As these locks were blocked, any wakeups or load
balanceing on these CPUs would also block on these locks, and the wait
time escalated.
I've tried various methods to lessen the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.
Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.
With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.
I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.
When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.
To enabled or disable this at run time:
# mount -t debugfs nodev /sys/kernel/debug
# echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
# echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and one CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.
The latest design sends out only one IPI to the first overloaded CPU. It tries to
push any tasks that it can, and then looks for the next overloaded CPU that can
push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
tasks that have priorities greater than the source CPU are covered. In case the
source CPU lowers its priority again, a flag is set to tell the IPI traversal to
restart with the first RT overloaded CPU after the source CPU.
Parts-suggested-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Joern Engel <joern(a)purestorage.com>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Mike Galbraith <umgwanakikbuti(a)gmail.com>
Cc: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d117fe6..91e33cd485f6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -56,6 +56,19 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
*/
SCHED_FEAT(TTWU_QUEUE, true)
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * In order to avoid a thundering herd attack of CPUs that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ */
+SCHED_FEAT(RT_PUSH_IPI, true)
+#endif
+
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4d4b077eba0..ad0241561c3e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -6,6 +6,7 @@
#include "sched.h"
#include <linux/slab.h>
+#include <linux/irq_work.h>
int sched_rr_timeslice = RR_TIMESLICE;
@@ -59,6 +60,10 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
+#ifdef CONFIG_SMP
+static void push_irq_work_func(struct irq_work *work);
+#endif
+
void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
struct rt_prio_array *array;
@@ -78,7 +83,14 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
+
+#ifdef HAVE_RT_PUSH_IPI
+ rt_rq->push_flags = 0;
+ rt_rq->push_cpu = nr_cpu_ids;
+ raw_spin_lock_init(&rt_rq->push_lock);
+ init_irq_work(&rt_rq->push_work, push_irq_work_func);
#endif
+#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1778,6 +1790,164 @@ static void push_rt_tasks(struct rq *rq)
;
}
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * The search for the next cpu always starts at rq->cpu and ends
+ * when we reach rq->cpu again. It will never return rq->cpu.
+ * This returns the next cpu to check, or nr_cpu_ids if the loop
+ * is complete.
+ *
+ * rq->rt.push_cpu holds the last cpu returned by this function,
+ * or if this is the first instance, it must hold rq->cpu.
+ */
+static int rto_next_cpu(struct rq *rq)
+{
+ int prev_cpu = rq->rt.push_cpu;
+ int cpu;
+
+ cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
+
+ /*
+ * If the previous cpu is less than the rq's CPU, then it already
+ * passed the end of the mask, and has started from the beginning.
+ * We end if the next CPU is greater or equal to rq's CPU.
+ */
+ if (prev_cpu < rq->cpu) {
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+
+ } else if (cpu >= nr_cpu_ids) {
+ /*
+ * We passed the end of the mask, start at the beginning.
+ * If the result is greater or equal to the rq's CPU, then
+ * the loop is finished.
+ */
+ cpu = cpumask_first(rq->rd->rto_mask);
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+ }
+ rq->rt.push_cpu = cpu;
+
+ /* Return cpu to let the caller know if the loop is finished or not */
+ return cpu;
+}
+
+static int find_next_push_cpu(struct rq *rq)
+{
+ struct rq *next_rq;
+ int cpu;
+
+ while (1) {
+ cpu = rto_next_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ break;
+ next_rq = cpu_rq(cpu);
+
+ /* Make sure the next rq can push to this rq */
+ if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ break;
+ }
+
+ return cpu;
+}
+
+#define RT_PUSH_IPI_EXECUTING 1
+#define RT_PUSH_IPI_RESTART 2
+
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu;
+
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ raw_spin_lock(&rq->rt.push_lock);
+ /* Make sure it's still executing */
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ /*
+ * Tell the IPI to restart the loop as things have
+ * changed since it started.
+ */
+ rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
+ raw_spin_unlock(&rq->rt.push_lock);
+ return;
+ }
+ raw_spin_unlock(&rq->rt.push_lock);
+ }
+
+ /* When here, there's no IPI going around */
+
+ rq->rt.push_cpu = rq->cpu;
+ cpu = find_next_push_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+
+ irq_work_queue_on(&rq->rt.push_work, cpu);
+}
+
+/* Called from hardirq context */
+static void try_to_push_tasks(void *arg)
+{
+ struct rt_rq *rt_rq = arg;
+ struct rq *rq, *src_rq;
+ int this_cpu;
+ int cpu;
+
+ this_cpu = rt_rq->push_cpu;
+
+ /* Paranoid check */
+ BUG_ON(this_cpu != smp_processor_id());
+
+ rq = cpu_rq(this_cpu);
+ src_rq = rq_of_rt_rq(rt_rq);
+
+again:
+ if (has_pushable_tasks(rq)) {
+ raw_spin_lock(&rq->lock);
+ push_rt_task(rq);
+ raw_spin_unlock(&rq->lock);
+ }
+
+ /* Pass the IPI to the next rt overloaded queue */
+ raw_spin_lock(&rt_rq->push_lock);
+ /*
+ * If the source queue changed since the IPI went out,
+ * we need to restart the search from that CPU again.
+ */
+ if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
+ rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
+ rt_rq->push_cpu = src_rq->cpu;
+ }
+
+ cpu = find_next_push_cpu(src_rq);
+
+ if (cpu >= nr_cpu_ids)
+ rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
+ raw_spin_unlock(&rt_rq->push_lock);
+
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ /*
+ * It is possible that a restart caused this CPU to be
+ * chosen again. Don't bother with an IPI, just see if we
+ * have more to push.
+ */
+ if (unlikely(cpu == rq->cpu))
+ goto again;
+
+ /* Try the next RT overloaded CPU */
+ irq_work_queue_on(&rt_rq->push_work, cpu);
+}
+
+static void push_irq_work_func(struct irq_work *work)
+{
+ struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
+
+ try_to_push_tasks(rt_rq);
+}
+#endif /* HAVE_RT_PUSH_IPI */
+
static int pull_rt_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1793,6 +1963,13 @@ static int pull_rt_task(struct rq *this_rq)
*/
smp_rmb();
+#ifdef HAVE_RT_PUSH_IPI
+ if (sched_feat(RT_PUSH_IPI)) {
+ tell_cpu_to_push(this_rq);
+ return 0;
+ }
+#endif
+
for_each_cpu(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435a2779..c2c0d7bd5027 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/irq_work.h>
#include <linux/tick.h>
#include <linux/slab.h>
@@ -418,6 +419,11 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}
+/* RT IPI pull logic requires IRQ_WORK */
+#ifdef CONFIG_IRQ_WORK
+# define HAVE_RT_PUSH_IPI
+#endif
+
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
@@ -435,7 +441,13 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+#ifdef HAVE_RT_PUSH_IPI
+ int push_flags;
+ int push_cpu;
+ struct irq_work push_work;
+ raw_spinlock_t push_lock;
#endif
+#endif /* CONFIG_SMP */
int rt_queued;
int rt_throttled;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b6366f048e0caff28af5335b7af2031266e1b06b Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt(a)goodmis.org>
Date: Wed, 18 Mar 2015 14:49:46 -0400
Subject: [PATCH] sched/rt: Use IPI to trigger RT task push migration instead
of pulling
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.
Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.
The test that was run was the following:
cyclictest --numa -p95 -m -d0 -i100
This created a thread on each CPU, that would set its wakeup in iterations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.
cyclictest is maintained at:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.
To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks especially since the taking was done via a double rq lock, which
means that several of the CPUs had their own rq locks held while trying
to take this rq lock. As these locks were blocked, any wakeups or load
balanceing on these CPUs would also block on these locks, and the wait
time escalated.
I've tried various methods to lessen the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.
Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.
With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.
I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.
When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.
To enabled or disable this at run time:
# mount -t debugfs nodev /sys/kernel/debug
# echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
# echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and one CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.
The latest design sends out only one IPI to the first overloaded CPU. It tries to
push any tasks that it can, and then looks for the next overloaded CPU that can
push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
tasks that have priorities greater than the source CPU are covered. In case the
source CPU lowers its priority again, a flag is set to tell the IPI traversal to
restart with the first RT overloaded CPU after the source CPU.
Parts-suggested-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Joern Engel <joern(a)purestorage.com>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Mike Galbraith <umgwanakikbuti(a)gmail.com>
Cc: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d117fe6..91e33cd485f6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -56,6 +56,19 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
*/
SCHED_FEAT(TTWU_QUEUE, true)
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * In order to avoid a thundering herd attack of CPUs that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ */
+SCHED_FEAT(RT_PUSH_IPI, true)
+#endif
+
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4d4b077eba0..ad0241561c3e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -6,6 +6,7 @@
#include "sched.h"
#include <linux/slab.h>
+#include <linux/irq_work.h>
int sched_rr_timeslice = RR_TIMESLICE;
@@ -59,6 +60,10 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
+#ifdef CONFIG_SMP
+static void push_irq_work_func(struct irq_work *work);
+#endif
+
void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
struct rt_prio_array *array;
@@ -78,7 +83,14 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
+
+#ifdef HAVE_RT_PUSH_IPI
+ rt_rq->push_flags = 0;
+ rt_rq->push_cpu = nr_cpu_ids;
+ raw_spin_lock_init(&rt_rq->push_lock);
+ init_irq_work(&rt_rq->push_work, push_irq_work_func);
#endif
+#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1778,6 +1790,164 @@ static void push_rt_tasks(struct rq *rq)
;
}
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * The search for the next cpu always starts at rq->cpu and ends
+ * when we reach rq->cpu again. It will never return rq->cpu.
+ * This returns the next cpu to check, or nr_cpu_ids if the loop
+ * is complete.
+ *
+ * rq->rt.push_cpu holds the last cpu returned by this function,
+ * or if this is the first instance, it must hold rq->cpu.
+ */
+static int rto_next_cpu(struct rq *rq)
+{
+ int prev_cpu = rq->rt.push_cpu;
+ int cpu;
+
+ cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
+
+ /*
+ * If the previous cpu is less than the rq's CPU, then it already
+ * passed the end of the mask, and has started from the beginning.
+ * We end if the next CPU is greater or equal to rq's CPU.
+ */
+ if (prev_cpu < rq->cpu) {
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+
+ } else if (cpu >= nr_cpu_ids) {
+ /*
+ * We passed the end of the mask, start at the beginning.
+ * If the result is greater or equal to the rq's CPU, then
+ * the loop is finished.
+ */
+ cpu = cpumask_first(rq->rd->rto_mask);
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+ }
+ rq->rt.push_cpu = cpu;
+
+ /* Return cpu to let the caller know if the loop is finished or not */
+ return cpu;
+}
+
+static int find_next_push_cpu(struct rq *rq)
+{
+ struct rq *next_rq;
+ int cpu;
+
+ while (1) {
+ cpu = rto_next_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ break;
+ next_rq = cpu_rq(cpu);
+
+ /* Make sure the next rq can push to this rq */
+ if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ break;
+ }
+
+ return cpu;
+}
+
+#define RT_PUSH_IPI_EXECUTING 1
+#define RT_PUSH_IPI_RESTART 2
+
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu;
+
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ raw_spin_lock(&rq->rt.push_lock);
+ /* Make sure it's still executing */
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ /*
+ * Tell the IPI to restart the loop as things have
+ * changed since it started.
+ */
+ rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
+ raw_spin_unlock(&rq->rt.push_lock);
+ return;
+ }
+ raw_spin_unlock(&rq->rt.push_lock);
+ }
+
+ /* When here, there's no IPI going around */
+
+ rq->rt.push_cpu = rq->cpu;
+ cpu = find_next_push_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+
+ irq_work_queue_on(&rq->rt.push_work, cpu);
+}
+
+/* Called from hardirq context */
+static void try_to_push_tasks(void *arg)
+{
+ struct rt_rq *rt_rq = arg;
+ struct rq *rq, *src_rq;
+ int this_cpu;
+ int cpu;
+
+ this_cpu = rt_rq->push_cpu;
+
+ /* Paranoid check */
+ BUG_ON(this_cpu != smp_processor_id());
+
+ rq = cpu_rq(this_cpu);
+ src_rq = rq_of_rt_rq(rt_rq);
+
+again:
+ if (has_pushable_tasks(rq)) {
+ raw_spin_lock(&rq->lock);
+ push_rt_task(rq);
+ raw_spin_unlock(&rq->lock);
+ }
+
+ /* Pass the IPI to the next rt overloaded queue */
+ raw_spin_lock(&rt_rq->push_lock);
+ /*
+ * If the source queue changed since the IPI went out,
+ * we need to restart the search from that CPU again.
+ */
+ if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
+ rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
+ rt_rq->push_cpu = src_rq->cpu;
+ }
+
+ cpu = find_next_push_cpu(src_rq);
+
+ if (cpu >= nr_cpu_ids)
+ rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
+ raw_spin_unlock(&rt_rq->push_lock);
+
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ /*
+ * It is possible that a restart caused this CPU to be
+ * chosen again. Don't bother with an IPI, just see if we
+ * have more to push.
+ */
+ if (unlikely(cpu == rq->cpu))
+ goto again;
+
+ /* Try the next RT overloaded CPU */
+ irq_work_queue_on(&rt_rq->push_work, cpu);
+}
+
+static void push_irq_work_func(struct irq_work *work)
+{
+ struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
+
+ try_to_push_tasks(rt_rq);
+}
+#endif /* HAVE_RT_PUSH_IPI */
+
static int pull_rt_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1793,6 +1963,13 @@ static int pull_rt_task(struct rq *this_rq)
*/
smp_rmb();
+#ifdef HAVE_RT_PUSH_IPI
+ if (sched_feat(RT_PUSH_IPI)) {
+ tell_cpu_to_push(this_rq);
+ return 0;
+ }
+#endif
+
for_each_cpu(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435a2779..c2c0d7bd5027 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/irq_work.h>
#include <linux/tick.h>
#include <linux/slab.h>
@@ -418,6 +419,11 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}
+/* RT IPI pull logic requires IRQ_WORK */
+#ifdef CONFIG_IRQ_WORK
+# define HAVE_RT_PUSH_IPI
+#endif
+
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
@@ -435,7 +441,13 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+#ifdef HAVE_RT_PUSH_IPI
+ int push_flags;
+ int push_cpu;
+ struct irq_work push_work;
+ raw_spinlock_t push_lock;
#endif
+#endif /* CONFIG_SMP */
int rt_queued;
int rt_throttled;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b6366f048e0caff28af5335b7af2031266e1b06b Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt(a)goodmis.org>
Date: Wed, 18 Mar 2015 14:49:46 -0400
Subject: [PATCH] sched/rt: Use IPI to trigger RT task push migration instead
of pulling
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.
Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.
The test that was run was the following:
cyclictest --numa -p95 -m -d0 -i100
This created a thread on each CPU, that would set its wakeup in iterations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.
cyclictest is maintained at:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.
To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks especially since the taking was done via a double rq lock, which
means that several of the CPUs had their own rq locks held while trying
to take this rq lock. As these locks were blocked, any wakeups or load
balanceing on these CPUs would also block on these locks, and the wait
time escalated.
I've tried various methods to lessen the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.
Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.
With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.
I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.
When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.
To enabled or disable this at run time:
# mount -t debugfs nodev /sys/kernel/debug
# echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
# echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and one CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.
The latest design sends out only one IPI to the first overloaded CPU. It tries to
push any tasks that it can, and then looks for the next overloaded CPU that can
push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
tasks that have priorities greater than the source CPU are covered. In case the
source CPU lowers its priority again, a flag is set to tell the IPI traversal to
restart with the first RT overloaded CPU after the source CPU.
Parts-suggested-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Joern Engel <joern(a)purestorage.com>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Mike Galbraith <umgwanakikbuti(a)gmail.com>
Cc: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d117fe6..91e33cd485f6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -56,6 +56,19 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
*/
SCHED_FEAT(TTWU_QUEUE, true)
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * In order to avoid a thundering herd attack of CPUs that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ */
+SCHED_FEAT(RT_PUSH_IPI, true)
+#endif
+
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4d4b077eba0..ad0241561c3e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -6,6 +6,7 @@
#include "sched.h"
#include <linux/slab.h>
+#include <linux/irq_work.h>
int sched_rr_timeslice = RR_TIMESLICE;
@@ -59,6 +60,10 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
+#ifdef CONFIG_SMP
+static void push_irq_work_func(struct irq_work *work);
+#endif
+
void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
struct rt_prio_array *array;
@@ -78,7 +83,14 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
+
+#ifdef HAVE_RT_PUSH_IPI
+ rt_rq->push_flags = 0;
+ rt_rq->push_cpu = nr_cpu_ids;
+ raw_spin_lock_init(&rt_rq->push_lock);
+ init_irq_work(&rt_rq->push_work, push_irq_work_func);
#endif
+#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1778,6 +1790,164 @@ static void push_rt_tasks(struct rq *rq)
;
}
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * The search for the next cpu always starts at rq->cpu and ends
+ * when we reach rq->cpu again. It will never return rq->cpu.
+ * This returns the next cpu to check, or nr_cpu_ids if the loop
+ * is complete.
+ *
+ * rq->rt.push_cpu holds the last cpu returned by this function,
+ * or if this is the first instance, it must hold rq->cpu.
+ */
+static int rto_next_cpu(struct rq *rq)
+{
+ int prev_cpu = rq->rt.push_cpu;
+ int cpu;
+
+ cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
+
+ /*
+ * If the previous cpu is less than the rq's CPU, then it already
+ * passed the end of the mask, and has started from the beginning.
+ * We end if the next CPU is greater or equal to rq's CPU.
+ */
+ if (prev_cpu < rq->cpu) {
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+
+ } else if (cpu >= nr_cpu_ids) {
+ /*
+ * We passed the end of the mask, start at the beginning.
+ * If the result is greater or equal to the rq's CPU, then
+ * the loop is finished.
+ */
+ cpu = cpumask_first(rq->rd->rto_mask);
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+ }
+ rq->rt.push_cpu = cpu;
+
+ /* Return cpu to let the caller know if the loop is finished or not */
+ return cpu;
+}
+
+static int find_next_push_cpu(struct rq *rq)
+{
+ struct rq *next_rq;
+ int cpu;
+
+ while (1) {
+ cpu = rto_next_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ break;
+ next_rq = cpu_rq(cpu);
+
+ /* Make sure the next rq can push to this rq */
+ if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ break;
+ }
+
+ return cpu;
+}
+
+#define RT_PUSH_IPI_EXECUTING 1
+#define RT_PUSH_IPI_RESTART 2
+
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu;
+
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ raw_spin_lock(&rq->rt.push_lock);
+ /* Make sure it's still executing */
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ /*
+ * Tell the IPI to restart the loop as things have
+ * changed since it started.
+ */
+ rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
+ raw_spin_unlock(&rq->rt.push_lock);
+ return;
+ }
+ raw_spin_unlock(&rq->rt.push_lock);
+ }
+
+ /* When here, there's no IPI going around */
+
+ rq->rt.push_cpu = rq->cpu;
+ cpu = find_next_push_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+
+ irq_work_queue_on(&rq->rt.push_work, cpu);
+}
+
+/* Called from hardirq context */
+static void try_to_push_tasks(void *arg)
+{
+ struct rt_rq *rt_rq = arg;
+ struct rq *rq, *src_rq;
+ int this_cpu;
+ int cpu;
+
+ this_cpu = rt_rq->push_cpu;
+
+ /* Paranoid check */
+ BUG_ON(this_cpu != smp_processor_id());
+
+ rq = cpu_rq(this_cpu);
+ src_rq = rq_of_rt_rq(rt_rq);
+
+again:
+ if (has_pushable_tasks(rq)) {
+ raw_spin_lock(&rq->lock);
+ push_rt_task(rq);
+ raw_spin_unlock(&rq->lock);
+ }
+
+ /* Pass the IPI to the next rt overloaded queue */
+ raw_spin_lock(&rt_rq->push_lock);
+ /*
+ * If the source queue changed since the IPI went out,
+ * we need to restart the search from that CPU again.
+ */
+ if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
+ rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
+ rt_rq->push_cpu = src_rq->cpu;
+ }
+
+ cpu = find_next_push_cpu(src_rq);
+
+ if (cpu >= nr_cpu_ids)
+ rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
+ raw_spin_unlock(&rt_rq->push_lock);
+
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ /*
+ * It is possible that a restart caused this CPU to be
+ * chosen again. Don't bother with an IPI, just see if we
+ * have more to push.
+ */
+ if (unlikely(cpu == rq->cpu))
+ goto again;
+
+ /* Try the next RT overloaded CPU */
+ irq_work_queue_on(&rt_rq->push_work, cpu);
+}
+
+static void push_irq_work_func(struct irq_work *work)
+{
+ struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
+
+ try_to_push_tasks(rt_rq);
+}
+#endif /* HAVE_RT_PUSH_IPI */
+
static int pull_rt_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1793,6 +1963,13 @@ static int pull_rt_task(struct rq *this_rq)
*/
smp_rmb();
+#ifdef HAVE_RT_PUSH_IPI
+ if (sched_feat(RT_PUSH_IPI)) {
+ tell_cpu_to_push(this_rq);
+ return 0;
+ }
+#endif
+
for_each_cpu(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435a2779..c2c0d7bd5027 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/irq_work.h>
#include <linux/tick.h>
#include <linux/slab.h>
@@ -418,6 +419,11 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}
+/* RT IPI pull logic requires IRQ_WORK */
+#ifdef CONFIG_IRQ_WORK
+# define HAVE_RT_PUSH_IPI
+#endif
+
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
@@ -435,7 +441,13 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+#ifdef HAVE_RT_PUSH_IPI
+ int push_flags;
+ int push_cpu;
+ struct irq_work push_work;
+ raw_spinlock_t push_lock;
#endif
+#endif /* CONFIG_SMP */
int rt_queued;
int rt_throttled;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b6366f048e0caff28af5335b7af2031266e1b06b Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt(a)goodmis.org>
Date: Wed, 18 Mar 2015 14:49:46 -0400
Subject: [PATCH] sched/rt: Use IPI to trigger RT task push migration instead
of pulling
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.
Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.
The test that was run was the following:
cyclictest --numa -p95 -m -d0 -i100
This created a thread on each CPU, that would set its wakeup in iterations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.
cyclictest is maintained at:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.
To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks especially since the taking was done via a double rq lock, which
means that several of the CPUs had their own rq locks held while trying
to take this rq lock. As these locks were blocked, any wakeups or load
balanceing on these CPUs would also block on these locks, and the wait
time escalated.
I've tried various methods to lessen the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.
Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.
With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.
I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.
When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.
To enabled or disable this at run time:
# mount -t debugfs nodev /sys/kernel/debug
# echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
# echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and one CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.
The latest design sends out only one IPI to the first overloaded CPU. It tries to
push any tasks that it can, and then looks for the next overloaded CPU that can
push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
tasks that have priorities greater than the source CPU are covered. In case the
source CPU lowers its priority again, a flag is set to tell the IPI traversal to
restart with the first RT overloaded CPU after the source CPU.
Parts-suggested-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Joern Engel <joern(a)purestorage.com>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Mike Galbraith <umgwanakikbuti(a)gmail.com>
Cc: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d117fe6..91e33cd485f6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -56,6 +56,19 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
*/
SCHED_FEAT(TTWU_QUEUE, true)
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * In order to avoid a thundering herd attack of CPUs that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ */
+SCHED_FEAT(RT_PUSH_IPI, true)
+#endif
+
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4d4b077eba0..ad0241561c3e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -6,6 +6,7 @@
#include "sched.h"
#include <linux/slab.h>
+#include <linux/irq_work.h>
int sched_rr_timeslice = RR_TIMESLICE;
@@ -59,6 +60,10 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
+#ifdef CONFIG_SMP
+static void push_irq_work_func(struct irq_work *work);
+#endif
+
void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
struct rt_prio_array *array;
@@ -78,7 +83,14 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
+
+#ifdef HAVE_RT_PUSH_IPI
+ rt_rq->push_flags = 0;
+ rt_rq->push_cpu = nr_cpu_ids;
+ raw_spin_lock_init(&rt_rq->push_lock);
+ init_irq_work(&rt_rq->push_work, push_irq_work_func);
#endif
+#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1778,6 +1790,164 @@ static void push_rt_tasks(struct rq *rq)
;
}
+#ifdef HAVE_RT_PUSH_IPI
+/*
+ * The search for the next cpu always starts at rq->cpu and ends
+ * when we reach rq->cpu again. It will never return rq->cpu.
+ * This returns the next cpu to check, or nr_cpu_ids if the loop
+ * is complete.
+ *
+ * rq->rt.push_cpu holds the last cpu returned by this function,
+ * or if this is the first instance, it must hold rq->cpu.
+ */
+static int rto_next_cpu(struct rq *rq)
+{
+ int prev_cpu = rq->rt.push_cpu;
+ int cpu;
+
+ cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
+
+ /*
+ * If the previous cpu is less than the rq's CPU, then it already
+ * passed the end of the mask, and has started from the beginning.
+ * We end if the next CPU is greater or equal to rq's CPU.
+ */
+ if (prev_cpu < rq->cpu) {
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+
+ } else if (cpu >= nr_cpu_ids) {
+ /*
+ * We passed the end of the mask, start at the beginning.
+ * If the result is greater or equal to the rq's CPU, then
+ * the loop is finished.
+ */
+ cpu = cpumask_first(rq->rd->rto_mask);
+ if (cpu >= rq->cpu)
+ return nr_cpu_ids;
+ }
+ rq->rt.push_cpu = cpu;
+
+ /* Return cpu to let the caller know if the loop is finished or not */
+ return cpu;
+}
+
+static int find_next_push_cpu(struct rq *rq)
+{
+ struct rq *next_rq;
+ int cpu;
+
+ while (1) {
+ cpu = rto_next_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ break;
+ next_rq = cpu_rq(cpu);
+
+ /* Make sure the next rq can push to this rq */
+ if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ break;
+ }
+
+ return cpu;
+}
+
+#define RT_PUSH_IPI_EXECUTING 1
+#define RT_PUSH_IPI_RESTART 2
+
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu;
+
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ raw_spin_lock(&rq->rt.push_lock);
+ /* Make sure it's still executing */
+ if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+ /*
+ * Tell the IPI to restart the loop as things have
+ * changed since it started.
+ */
+ rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
+ raw_spin_unlock(&rq->rt.push_lock);
+ return;
+ }
+ raw_spin_unlock(&rq->rt.push_lock);
+ }
+
+ /* When here, there's no IPI going around */
+
+ rq->rt.push_cpu = rq->cpu;
+ cpu = find_next_push_cpu(rq);
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+
+ irq_work_queue_on(&rq->rt.push_work, cpu);
+}
+
+/* Called from hardirq context */
+static void try_to_push_tasks(void *arg)
+{
+ struct rt_rq *rt_rq = arg;
+ struct rq *rq, *src_rq;
+ int this_cpu;
+ int cpu;
+
+ this_cpu = rt_rq->push_cpu;
+
+ /* Paranoid check */
+ BUG_ON(this_cpu != smp_processor_id());
+
+ rq = cpu_rq(this_cpu);
+ src_rq = rq_of_rt_rq(rt_rq);
+
+again:
+ if (has_pushable_tasks(rq)) {
+ raw_spin_lock(&rq->lock);
+ push_rt_task(rq);
+ raw_spin_unlock(&rq->lock);
+ }
+
+ /* Pass the IPI to the next rt overloaded queue */
+ raw_spin_lock(&rt_rq->push_lock);
+ /*
+ * If the source queue changed since the IPI went out,
+ * we need to restart the search from that CPU again.
+ */
+ if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
+ rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
+ rt_rq->push_cpu = src_rq->cpu;
+ }
+
+ cpu = find_next_push_cpu(src_rq);
+
+ if (cpu >= nr_cpu_ids)
+ rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
+ raw_spin_unlock(&rt_rq->push_lock);
+
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ /*
+ * It is possible that a restart caused this CPU to be
+ * chosen again. Don't bother with an IPI, just see if we
+ * have more to push.
+ */
+ if (unlikely(cpu == rq->cpu))
+ goto again;
+
+ /* Try the next RT overloaded CPU */
+ irq_work_queue_on(&rt_rq->push_work, cpu);
+}
+
+static void push_irq_work_func(struct irq_work *work)
+{
+ struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
+
+ try_to_push_tasks(rt_rq);
+}
+#endif /* HAVE_RT_PUSH_IPI */
+
static int pull_rt_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1793,6 +1963,13 @@ static int pull_rt_task(struct rq *this_rq)
*/
smp_rmb();
+#ifdef HAVE_RT_PUSH_IPI
+ if (sched_feat(RT_PUSH_IPI)) {
+ tell_cpu_to_push(this_rq);
+ return 0;
+ }
+#endif
+
for_each_cpu(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435a2779..c2c0d7bd5027 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/irq_work.h>
#include <linux/tick.h>
#include <linux/slab.h>
@@ -418,6 +419,11 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}
+/* RT IPI pull logic requires IRQ_WORK */
+#ifdef CONFIG_IRQ_WORK
+# define HAVE_RT_PUSH_IPI
+#endif
+
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
@@ -435,7 +441,13 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+#ifdef HAVE_RT_PUSH_IPI
+ int push_flags;
+ int push_cpu;
+ struct irq_work push_work;
+ raw_spinlock_t push_lock;
#endif
+#endif /* CONFIG_SMP */
int rt_queued;
int rt_throttled;
This is a note to let you know that I've just added the patch titled
vsock: use new wait API for vsock_stream_sendmsg()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vsock-use-new-wait-api-for-vsock_stream_sendmsg.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 499fde662f1957e3cb8d192a94a099ebe19c714b Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong(a)gmail.com>
Date: Fri, 19 May 2017 11:21:59 -0700
Subject: vsock: use new wait API for vsock_stream_sendmsg()
From: WANG Cong <xiyou.wangcong(a)gmail.com>
commit 499fde662f1957e3cb8d192a94a099ebe19c714b upstream.
As reported by Michal, vsock_stream_sendmsg() could still
sleep at vsock_stream_has_space() after prepare_to_wait():
vsock_stream_has_space
vmci_transport_stream_has_space
vmci_qpair_produce_free_space
qp_lock
qp_acquire_queue_mutex
mutex_lock
Just switch to the new wait API like we did for commit
d9dc8b0f8b4e ("net: fix sleeping for sk_wait_event()").
Reported-by: Michal Kubecek <mkubecek(a)suse.cz>
Cc: Stefan Hajnoczi <stefanha(a)redhat.com>
Cc: Jorgen Hansen <jhansen(a)vmware.com>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Claudio Imbrenda <imbrenda(a)linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Cc: "Jorgen S. Hansen" <jhansen(a)vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/vmw_vsock/af_vsock.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1524,8 +1524,7 @@ static int vsock_stream_sendmsg(struct s
long timeout;
int err;
struct vsock_transport_send_notify_data send_data;
-
- DEFINE_WAIT(wait);
+ DEFINE_WAIT_FUNC(wait, woken_wake_function);
sk = sock->sk;
vsk = vsock_sk(sk);
@@ -1568,11 +1567,10 @@ static int vsock_stream_sendmsg(struct s
if (err < 0)
goto out;
-
while (total_written < len) {
ssize_t written;
- prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ add_wait_queue(sk_sleep(sk), &wait);
while (vsock_stream_has_space(vsk) == 0 &&
sk->sk_err == 0 &&
!(sk->sk_shutdown & SEND_SHUTDOWN) &&
@@ -1581,33 +1579,30 @@ static int vsock_stream_sendmsg(struct s
/* Don't wait for non-blocking sockets. */
if (timeout == 0) {
err = -EAGAIN;
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
}
err = transport->notify_send_pre_block(vsk, &send_data);
if (err < 0) {
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
}
release_sock(sk);
- timeout = schedule_timeout(timeout);
+ timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
lock_sock(sk);
if (signal_pending(current)) {
err = sock_intr_errno(timeout);
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
} else if (timeout == 0) {
err = -EAGAIN;
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
}
-
- prepare_to_wait(sk_sleep(sk), &wait,
- TASK_INTERRUPTIBLE);
}
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
/* These checks occur both as part of and after the loop
* conditional since we need to check before and after
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-4.9/vsock-use-new-wait-api-for-vsock_stream_sendmsg.patch
queue-4.9/ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
This is a note to let you know that I've just added the patch titled
ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 76da0704507bbc51875013f6557877ab308cfd0a Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong(a)gmail.com>
Date: Tue, 20 Jun 2017 11:42:27 -0700
Subject: ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
From: WANG Cong <xiyou.wangcong(a)gmail.com>
commit 76da0704507bbc51875013f6557877ab308cfd0a upstream.
In commit 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
I assumed NETDEV_REGISTER and NETDEV_UNREGISTER are paired,
unfortunately, as reported by jeffy, netdev_wait_allrefs()
could rebroadcast NETDEV_UNREGISTER event until all refs are
gone.
We have to add an additional check to avoid this corner case.
For netdev_wait_allrefs() dev->reg_state is NETREG_UNREGISTERED,
for dev_change_net_namespace(), dev->reg_state is
NETREG_REGISTERED. So check for dev->reg_state != NETREG_UNREGISTERED.
Fixes: 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
Reported-by: jeffy <jeffy.chen(a)rock-chips.com>
Cc: David Ahern <dsahern(a)gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Acked-by: David Ahern <dsahern(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv6/route.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3495,7 +3495,11 @@ static int ip6_route_dev_notify(struct n
net->ipv6.ip6_blk_hole_entry->dst.dev = dev;
net->ipv6.ip6_blk_hole_entry->rt6i_idev = in6_dev_get(dev);
#endif
- } else if (event == NETDEV_UNREGISTER) {
+ } else if (event == NETDEV_UNREGISTER &&
+ dev->reg_state != NETREG_UNREGISTERED) {
+ /* NETDEV_UNREGISTER could be fired for multiple times by
+ * netdev_wait_allrefs(). Make sure we only call this once.
+ */
in6_dev_put(net->ipv6.ip6_null_entry->rt6i_idev);
#ifdef CONFIG_IPV6_MULTIPLE_TABLES
in6_dev_put(net->ipv6.ip6_prohibit_entry->rt6i_idev);
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-4.9/vsock-use-new-wait-api-for-vsock_stream_sendmsg.patch
queue-4.9/ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
This is a note to let you know that I've just added the patch titled
vsock: use new wait API for vsock_stream_sendmsg()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vsock-use-new-wait-api-for-vsock_stream_sendmsg.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 499fde662f1957e3cb8d192a94a099ebe19c714b Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong(a)gmail.com>
Date: Fri, 19 May 2017 11:21:59 -0700
Subject: vsock: use new wait API for vsock_stream_sendmsg()
From: WANG Cong <xiyou.wangcong(a)gmail.com>
commit 499fde662f1957e3cb8d192a94a099ebe19c714b upstream.
As reported by Michal, vsock_stream_sendmsg() could still
sleep at vsock_stream_has_space() after prepare_to_wait():
vsock_stream_has_space
vmci_transport_stream_has_space
vmci_qpair_produce_free_space
qp_lock
qp_acquire_queue_mutex
mutex_lock
Just switch to the new wait API like we did for commit
d9dc8b0f8b4e ("net: fix sleeping for sk_wait_event()").
Reported-by: Michal Kubecek <mkubecek(a)suse.cz>
Cc: Stefan Hajnoczi <stefanha(a)redhat.com>
Cc: Jorgen Hansen <jhansen(a)vmware.com>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Claudio Imbrenda <imbrenda(a)linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Cc: "Jorgen S. Hansen" <jhansen(a)vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/vmw_vsock/af_vsock.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1512,8 +1512,7 @@ static int vsock_stream_sendmsg(struct s
long timeout;
int err;
struct vsock_transport_send_notify_data send_data;
-
- DEFINE_WAIT(wait);
+ DEFINE_WAIT_FUNC(wait, woken_wake_function);
sk = sock->sk;
vsk = vsock_sk(sk);
@@ -1556,11 +1555,10 @@ static int vsock_stream_sendmsg(struct s
if (err < 0)
goto out;
-
while (total_written < len) {
ssize_t written;
- prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ add_wait_queue(sk_sleep(sk), &wait);
while (vsock_stream_has_space(vsk) == 0 &&
sk->sk_err == 0 &&
!(sk->sk_shutdown & SEND_SHUTDOWN) &&
@@ -1569,33 +1567,30 @@ static int vsock_stream_sendmsg(struct s
/* Don't wait for non-blocking sockets. */
if (timeout == 0) {
err = -EAGAIN;
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
}
err = transport->notify_send_pre_block(vsk, &send_data);
if (err < 0) {
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
}
release_sock(sk);
- timeout = schedule_timeout(timeout);
+ timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
lock_sock(sk);
if (signal_pending(current)) {
err = sock_intr_errno(timeout);
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
} else if (timeout == 0) {
err = -EAGAIN;
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
goto out_err;
}
-
- prepare_to_wait(sk_sleep(sk), &wait,
- TASK_INTERRUPTIBLE);
}
- finish_wait(sk_sleep(sk), &wait);
+ remove_wait_queue(sk_sleep(sk), &wait);
/* These checks occur both as part of and after the loop
* conditional since we need to check before and after
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-4.4/vsock-use-new-wait-api-for-vsock_stream_sendmsg.patch
queue-4.4/ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
This is a note to let you know that I've just added the patch titled
ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 76da0704507bbc51875013f6557877ab308cfd0a Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong(a)gmail.com>
Date: Tue, 20 Jun 2017 11:42:27 -0700
Subject: ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
From: WANG Cong <xiyou.wangcong(a)gmail.com>
commit 76da0704507bbc51875013f6557877ab308cfd0a upstream.
In commit 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
I assumed NETDEV_REGISTER and NETDEV_UNREGISTER are paired,
unfortunately, as reported by jeffy, netdev_wait_allrefs()
could rebroadcast NETDEV_UNREGISTER event until all refs are
gone.
We have to add an additional check to avoid this corner case.
For netdev_wait_allrefs() dev->reg_state is NETREG_UNREGISTERED,
for dev_change_net_namespace(), dev->reg_state is
NETREG_REGISTERED. So check for dev->reg_state != NETREG_UNREGISTERED.
Fixes: 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
Reported-by: jeffy <jeffy.chen(a)rock-chips.com>
Cc: David Ahern <dsahern(a)gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Acked-by: David Ahern <dsahern(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv6/route.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3378,7 +3378,11 @@ static int ip6_route_dev_notify(struct n
net->ipv6.ip6_blk_hole_entry->dst.dev = dev;
net->ipv6.ip6_blk_hole_entry->rt6i_idev = in6_dev_get(dev);
#endif
- } else if (event == NETDEV_UNREGISTER) {
+ } else if (event == NETDEV_UNREGISTER &&
+ dev->reg_state != NETREG_UNREGISTERED) {
+ /* NETDEV_UNREGISTER could be fired for multiple times by
+ * netdev_wait_allrefs(). Make sure we only call this once.
+ */
in6_dev_put(net->ipv6.ip6_null_entry->rt6i_idev);
#ifdef CONFIG_IPV6_MULTIPLE_TABLES
in6_dev_put(net->ipv6.ip6_prohibit_entry->rt6i_idev);
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-4.4/vsock-use-new-wait-api-for-vsock_stream_sendmsg.patch
queue-4.4/ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
This is a note to let you know that I've just added the patch titled
ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 76da0704507bbc51875013f6557877ab308cfd0a Mon Sep 17 00:00:00 2001
From: WANG Cong <xiyou.wangcong(a)gmail.com>
Date: Tue, 20 Jun 2017 11:42:27 -0700
Subject: ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
From: WANG Cong <xiyou.wangcong(a)gmail.com>
commit 76da0704507bbc51875013f6557877ab308cfd0a upstream.
In commit 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
I assumed NETDEV_REGISTER and NETDEV_UNREGISTER are paired,
unfortunately, as reported by jeffy, netdev_wait_allrefs()
could rebroadcast NETDEV_UNREGISTER event until all refs are
gone.
We have to add an additional check to avoid this corner case.
For netdev_wait_allrefs() dev->reg_state is NETREG_UNREGISTERED,
for dev_change_net_namespace(), dev->reg_state is
NETREG_REGISTERED. So check for dev->reg_state != NETREG_UNREGISTERED.
Fixes: 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
Reported-by: jeffy <jeffy.chen(a)rock-chips.com>
Cc: David Ahern <dsahern(a)gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Acked-by: David Ahern <dsahern(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv6/route.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2827,7 +2827,11 @@ static int ip6_route_dev_notify(struct n
net->ipv6.ip6_blk_hole_entry->dst.dev = dev;
net->ipv6.ip6_blk_hole_entry->rt6i_idev = in6_dev_get(dev);
#endif
- } else if (event == NETDEV_UNREGISTER) {
+ } else if (event == NETDEV_UNREGISTER &&
+ dev->reg_state != NETREG_UNREGISTERED) {
+ /* NETDEV_UNREGISTER could be fired for multiple times by
+ * netdev_wait_allrefs(). Make sure we only call this once.
+ */
in6_dev_put(net->ipv6.ip6_null_entry->rt6i_idev);
#ifdef CONFIG_IPV6_MULTIPLE_TABLES
in6_dev_put(net->ipv6.ip6_prohibit_entry->rt6i_idev);
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-3.18/ipv6-only-call-ip6_route_dev_notify-once-for-netdev_unregister.patch
Hi,
Customers running VMware Workstation on 4.4 kernels have been hitting a couple of bugs where net/vmw_vsock/af_vsock.c is calling functions that may sleep when not allowed to. These issues have already been fixed in later kernels, and we would like to request these fixes backported to 4.4 and 4.9 LTS branches.
To resolve the above issue, we would like the following commit backported to 4.4 (it applies when using 3-way merge):
commit 265563fc8f123641d006d65f26d18d0b24d3022d "AF_VSOCK: Shrink the area influenced by prepare_to_wait"
and the following commit (that needs to be applied on top of the above) backported to both 4.4 and 4.9:
commit 359669f3b8eab6cbfb83eb1a46ec6ba089d47d18 "vsock: use new wait API for vsock_stream_sendmsg()”
The backports have been tested with kernel 4.4.100 and 4.9.64.
Thanks,
Jorgen
At least 3.18, 4.4 and 4.9 already have
242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
which has bug fixed in:
76da0704507b ("ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER")
This is a note to let you know that I've just added the patch titled
ACPI / APEI: Remove arch_apei_flush_tlb_one()
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-apei-remove-arch_apei_flush_tlb_one.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 4a75aeacda3c2455954596593d89187df5420d0a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:27 +0000
Subject: ACPI / APEI: Remove arch_apei_flush_tlb_one()
From: James Morse <james.morse(a)arm.com>
commit 4a75aeacda3c2455954596593d89187df5420d0a upstream.
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_pte_vaddr() to do the invalidation when called from clear_fixmap()
Remove arch_apei_flush_tlb_one().
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/acpi/apei.c | 5 -----
include/acpi/apei.h | 1 -
2 files changed, 6 deletions(-)
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -52,8 +52,3 @@ void arch_apei_report_mem_error(int sev,
apei_mce_report_mem_error(sev, mem_err);
#endif
}
-
-void arch_apei_flush_tlb_one(unsigned long addr)
-{
- __flush_tlb_one(addr);
-}
--- a/include/acpi/apei.h
+++ b/include/acpi/apei.h
@@ -51,7 +51,6 @@ int erst_clear(u64 record_id);
int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
-void arch_apei_flush_tlb_one(unsigned long addr);
#endif
#endif
Patches currently in stable-queue which might be from james.morse(a)arm.com are
queue-4.14/acpi-apei-remove-arch_apei_flush_tlb_one.patch
This is a note to let you know that I've just added the patch titled
ACPI / APEI: Remove arch_apei_flush_tlb_one()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-apei-remove-arch_apei_flush_tlb_one.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 4a75aeacda3c2455954596593d89187df5420d0a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:27 +0000
Subject: ACPI / APEI: Remove arch_apei_flush_tlb_one()
From: James Morse <james.morse(a)arm.com>
commit 4a75aeacda3c2455954596593d89187df5420d0a upstream.
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_pte_vaddr() to do the invalidation when called from clear_fixmap()
Remove arch_apei_flush_tlb_one().
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/acpi/apei.c | 5 -----
include/acpi/apei.h | 1 -
2 files changed, 6 deletions(-)
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -55,8 +55,3 @@ void arch_apei_report_mem_error(int sev,
apei_mce_report_mem_error(sev, mem_err);
#endif
}
-
-void arch_apei_flush_tlb_one(unsigned long addr)
-{
- __flush_tlb_one(addr);
-}
--- a/include/acpi/apei.h
+++ b/include/acpi/apei.h
@@ -44,7 +44,6 @@ int erst_clear(u64 record_id);
int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
-void arch_apei_flush_tlb_one(unsigned long addr);
#endif
#endif
Patches currently in stable-queue which might be from james.morse(a)arm.com are
queue-4.4/acpi-apei-remove-arch_apei_flush_tlb_one.patch
On Mon, Nov 27, 2017 at 10:39:25AM +0100, Tomas Charvat wrote:
> Ok under heavy load patch
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?i…
>
> cause following error once in few minutes, it doesn't happen instantly
This patch simply reverts to the old behaviour, this bug does not seem
to be related to the revert.
Is the commit that you referring the head commit of your branch,
or did you picked this one to some other branch?
Can you give some details on your setup and send your .config?
The snd_usb_copy_string_desc() retrieves the usb string corresponding to
the index number through the usb_string(). The problem is that the
usb_string() returns the length of the string (>= 0) when successful, but
it can also return a negative value about the error case or status of
usb_control_msg().
If iClockSource is '0' as shown below, usb_string() will returns -EINVAL.
This will result in '0' being inserted into buf[-22], and the following
KASAN out-of-bound error message will be output.
AudioControl Interface Descriptor:
bLength 8
bDescriptorType 36
bDescriptorSubtype 10 (CLOCK_SOURCE)
bClockID 1
bmAttributes 0x07 Internal programmable Clock (synced to SOF)
bmControls 0x07
Clock Frequency Control (read/write)
Clock Validity Control (read-only)
bAssocTerminal 0
iClockSource 0
To fix it, check usb_string() return value and bail out.
==================================================================
BUG: KASAN: stack-out-of-bounds in parse_audio_unit+0x1327/0x1960 [snd_usb_audio]
Write of size 1 at addr ffff88007e66735a by task systemd-udevd/18376
CPU: 0 PID: 18376 Comm: systemd-udevd Not tainted 4.13.0+ #3
Hardware name: LG Electronics 15N540-RFLGL/White Tip Mountain, BIOS 15N5
Call Trace:
dump_stack+0x63/0x8d
print_address_description+0x70/0x290
? parse_audio_unit+0x1327/0x1960 [snd_usb_audio]
kasan_report+0x265/0x350
__asan_store1+0x4a/0x50
parse_audio_unit+0x1327/0x1960 [snd_usb_audio]
? save_stack+0xb5/0xd0
? save_stack_trace+0x1b/0x20
? save_stack+0x46/0xd0
? kasan_kmalloc+0xad/0xe0
? kmem_cache_alloc_trace+0xff/0x230
? snd_usb_create_mixer+0xb0/0x4b0 [snd_usb_audio]
? usb_audio_probe+0x4de/0xf40 [snd_usb_audio]
? usb_probe_interface+0x1f5/0x440
? driver_probe_device+0x3ed/0x660
? build_feature_ctl+0xb10/0xb10 [snd_usb_audio]
? save_stack_trace+0x1b/0x20
? init_object+0x69/0xa0
? snd_usb_find_csint_desc+0xa8/0xf0 [snd_usb_audio]
snd_usb_mixer_controls+0x1dc/0x370 [snd_usb_audio]
? build_audio_procunit+0x890/0x890 [snd_usb_audio]
? snd_usb_create_mixer+0xb0/0x4b0 [snd_usb_audio]
? kmem_cache_alloc_trace+0xff/0x230
? usb_ifnum_to_if+0xbd/0xf0
snd_usb_create_mixer+0x25b/0x4b0 [snd_usb_audio]
? snd_usb_create_stream+0x255/0x2c0 [snd_usb_audio]
usb_audio_probe+0x4de/0xf40 [snd_usb_audio]
? snd_usb_autosuspend.part.7+0x30/0x30 [snd_usb_audio]
? __pm_runtime_idle+0x90/0x90
? kernfs_activate+0xa6/0xc0
? usb_match_one_id_intf+0xdc/0x130
? __pm_runtime_set_status+0x2d4/0x450
usb_probe_interface+0x1f5/0x440
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Jaejoong Kim <climbbb.kim(a)gmail.com>
---
sound/usb/mixer.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/sound/usb/mixer.c b/sound/usb/mixer.c
index e630813..da7cbe7 100644
--- a/sound/usb/mixer.c
+++ b/sound/usb/mixer.c
@@ -204,6 +204,10 @@ static int snd_usb_copy_string_desc(struct mixer_build *state,
int index, char *buf, int maxlen)
{
int len = usb_string(state->chip->dev, index, buf, maxlen - 1);
+
+ if (len < 0)
+ return len;
+
buf[len] = 0;
return len;
}
--
2.7.4
Commit cb0631fd3cf9 ("x86/mm: fix use-after-free of vma during userfaultfd
fault") went into mainline without Cc: stable. It appears to be a
use-after-free reachable by unprivileged users -- at least with
CONFIG_USERFAULTFD=y. Can it please be applied to 4.9-stable?
Eric
This is a note to let you know that I've just added the patch titled
x86/mm: fix use-after-free of vma during userfaultfd fault
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-fix-use-after-free-of-vma-during-userfaultfd-fault.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From cb0631fd3cf9e989cd48293fe631cbc402aec9a9 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka(a)suse.cz>
Date: Wed, 1 Nov 2017 08:21:25 +0100
Subject: x86/mm: fix use-after-free of vma during userfaultfd fault
From: Vlastimil Babka <vbabka(a)suse.cz>
commit cb0631fd3cf9e989cd48293fe631cbc402aec9a9 upstream.
Syzkaller with KASAN has reported a use-after-free of vma->vm_flags in
__do_page_fault() with the following reproducer:
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000011000/0x3000)=nil, 0x3000, 0x1, 0x32, 0xffffffffffffffff, 0x0)
r0 = userfaultfd(0x0)
ioctl$UFFDIO_API(r0, 0xc018aa3f, &(0x7f0000002000-0x18)={0xaa, 0x0, 0x0})
ioctl$UFFDIO_REGISTER(r0, 0xc020aa00, &(0x7f0000019000)={{&(0x7f0000012000/0x2000)=nil, 0x2000}, 0x1, 0x0})
r1 = gettid()
syz_open_dev$evdev(&(0x7f0000013000-0x12)="2f6465762f696e7075742f6576656e742300", 0x0, 0x0)
tkill(r1, 0x7)
The vma should be pinned by mmap_sem, but handle_userfault() might (in a
return to userspace scenario) release it and then acquire again, so when
we return to __do_page_fault() (with other result than VM_FAULT_RETRY),
the vma might be gone.
Specifically, per Andrea the scenario is
"A return to userland to repeat the page fault later with a
VM_FAULT_NOPAGE retval (potentially after handling any pending signal
during the return to userland). The return to userland is identified
whenever FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in
vmf->flags"
However, since commit a3c4fb7c9c2e ("x86/mm: Fix fault error path using
unsafe vma pointer") there is a vma_pkey() read of vma->vm_flags after
that point, which can thus become use-after-free. Fix this by moving
the read before calling handle_mm_fault().
Reported-by: syzbot <bot+6a5269ce759a7bb12754ed9622076dc93f65a1f6(a)syzkaller.appspotmail.com>
Reported-by: Dmitry Vyukov <dvyukov(a)google.com>
Suggested-by: Kirill A. Shutemov <kirill(a)shutemov.name>
Fixes: 3c4fb7c9c2e ("x86/mm: Fix fault error path using unsafe vma pointer")
Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com>
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Eric Biggers <ebiggers3(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/mm/fault.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1393,7 +1393,17 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
+ *
+ * Note that handle_userfault() may also release and reacquire mmap_sem
+ * (and not return with VM_FAULT_RETRY), when returning to userland to
+ * repeat the page fault later with a VM_FAULT_NOPAGE retval
+ * (potentially after handling any pending signal during the return to
+ * userland). The return to userland is identified whenever
+ * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
+ * Thus we have to be careful about not touching vma after handling the
+ * fault, so we read the pkey beforehand.
*/
+ pkey = vma_pkey(vma);
fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;
@@ -1420,7 +1430,6 @@ good_area:
return;
}
- pkey = vma_pkey(vma);
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, &pkey, fault);
Patches currently in stable-queue which might be from vbabka(a)suse.cz are
queue-4.9/x86-mm-fix-use-after-free-of-vma-during-userfaultfd-fault.patch
From: Eric Biggers <ebiggers(a)google.com>
Adding a specially crafted X.509 certificate whose subjectPublicKey
ASN.1 value is zero-length caused x509_extract_key_data() to set the
public key size to SIZE_MAX, as it subtracted the nonexistent BIT STRING
metadata byte. Then, x509_cert_parse() called kmemdup() with that bogus
size, triggering the WARN_ON_ONCE() in kmalloc_slab().
This appears to be harmless, but it still must be fixed since WARNs are
never supposed to be user-triggerable.
Fix it by updating x509_cert_parse() to validate that the value has a
BIT STRING metadata byte, and that the byte is 0 which indicates that
the number of bits in the bitstring is a multiple of 8.
It would be nice to handle the metadata byte in asn1_ber_decoder()
instead. But that would be tricky because in the general case a BIT
STRING could be implicitly tagged, and/or could legitimately have a
length that is not a whole number of bytes.
Here was the WARN (cleaned up slightly):
WARNING: CPU: 1 PID: 202 at mm/slab_common.c:971 kmalloc_slab+0x5d/0x70 mm/slab_common.c:971
Modules linked in:
CPU: 1 PID: 202 Comm: keyctl Tainted: G B 4.14.0-09238-g1d3b78bbc6e9 #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
task: ffff880033014180 task.stack: ffff8800305c8000
Call Trace:
__do_kmalloc mm/slab.c:3706 [inline]
__kmalloc_track_caller+0x22/0x2e0 mm/slab.c:3726
kmemdup+0x17/0x40 mm/util.c:118
kmemdup include/linux/string.h:414 [inline]
x509_cert_parse+0x2cb/0x620 crypto/asymmetric_keys/x509_cert_parser.c:106
x509_key_preparse+0x61/0x750 crypto/asymmetric_keys/x509_public_key.c:174
asymmetric_key_preparse+0xa4/0x150 crypto/asymmetric_keys/asymmetric_type.c:388
key_create_or_update+0x4d4/0x10a0 security/keys/key.c:850
SYSC_add_key security/keys/keyctl.c:122 [inline]
SyS_add_key+0xe8/0x290 security/keys/keyctl.c:62
entry_SYSCALL_64_fastpath+0x1f/0x96
Fixes: 42d5ec27f873 ("X.509: Add an ASN.1 decoder")
Cc: <stable(a)vger.kernel.org> # v3.7+
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
crypto/asymmetric_keys/x509_cert_parser.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/crypto/asymmetric_keys/x509_cert_parser.c b/crypto/asymmetric_keys/x509_cert_parser.c
index dd03fead1ca3..ce2df8c9c583 100644
--- a/crypto/asymmetric_keys/x509_cert_parser.c
+++ b/crypto/asymmetric_keys/x509_cert_parser.c
@@ -409,6 +409,8 @@ int x509_extract_key_data(void *context, size_t hdrlen,
ctx->cert->pub->pkey_algo = "rsa";
/* Discard the BIT STRING metadata */
+ if (vlen < 1 || *(const u8 *)value != 0)
+ return -EBADMSG;
ctx->key = value + 1;
ctx->key_size = vlen - 1;
return 0;
--
2.15.0
From: Eric Biggers <ebiggers(a)google.com>
asn1_ber_decoder() was ignoring errors from actions associated with the
opcodes ASN1_OP_END_SEQ_ACT, ASN1_OP_END_SET_ACT,
ASN1_OP_END_SEQ_OF_ACT, and ASN1_OP_END_SET_OF_ACT. In practice, this
meant the pkcs7_note_signed_info() action (since that was the only user
of those opcodes). Fix it by checking for the error, just like the
decoder does for actions associated with the other opcodes.
This bug allowed users to leak slab memory by repeatedly trying to add a
specially crafted "pkcs7_test" key (requires CONFIG_PKCS7_TEST_KEY).
In theory, this bug could also be used to bypass module signature
verification, by providing a PKCS#7 message that is misparsed such that
a signature's ->authattrs do not contain its ->msgdigest. But it
doesn't seem practical in normal cases, due to restrictions on the
format of the ->authattrs.
Fixes: 42d5ec27f873 ("X.509: Add an ASN.1 decoder")
Cc: <stable(a)vger.kernel.org> # v3.7+
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
lib/asn1_decoder.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/asn1_decoder.c b/lib/asn1_decoder.c
index d77cdfc4b554..dc14beae2c9a 100644
--- a/lib/asn1_decoder.c
+++ b/lib/asn1_decoder.c
@@ -439,6 +439,8 @@ int asn1_ber_decoder(const struct asn1_decoder *decoder,
else
act = machine[pc + 1];
ret = actions[act](context, hdr, 0, data + tdp, len);
+ if (ret < 0)
+ return ret;
}
pc += asn1_op_lengths[op];
goto next_op;
--
2.15.0
From: Eric Biggers <ebiggers(a)google.com>
In asn1_ber_decoder(), indefinitely-sized ASN.1 items were being passed
to the action functions before their lengths had been computed, using
the bogus length of 0x80 (ASN1_INDEFINITE_LENGTH). This resulted in
reading data past the end of the input buffer, when given a specially
crafted message.
Fix it by rearranging the code so that the indefinite length is resolved
before the action is called.
This bug was originally found by fuzzing the X.509 parser in userspace
using libFuzzer from the LLVM project.
KASAN report (cleaned up slightly):
BUG: KASAN: slab-out-of-bounds in memcpy ./include/linux/string.h:341 [inline]
BUG: KASAN: slab-out-of-bounds in x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
Read of size 128 at addr ffff880035dd9eaf by task keyctl/195
CPU: 1 PID: 195 Comm: keyctl Not tainted 4.14.0-09238-g1d3b78bbc6e9 #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0xd1/0x175 lib/dump_stack.c:53
print_address_description+0x78/0x260 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x23f/0x350 mm/kasan/report.c:409
memcpy+0x1f/0x50 mm/kasan/kasan.c:302
memcpy ./include/linux/string.h:341 [inline]
x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
asn1_ber_decoder+0xb4a/0x1fd0 lib/asn1_decoder.c:447
x509_cert_parse+0x1c7/0x620 crypto/asymmetric_keys/x509_cert_parser.c:89
x509_key_preparse+0x61/0x750 crypto/asymmetric_keys/x509_public_key.c:174
asymmetric_key_preparse+0xa4/0x150 crypto/asymmetric_keys/asymmetric_type.c:388
key_create_or_update+0x4d4/0x10a0 security/keys/key.c:850
SYSC_add_key security/keys/keyctl.c:122 [inline]
SyS_add_key+0xe8/0x290 security/keys/keyctl.c:62
entry_SYSCALL_64_fastpath+0x1f/0x96
Allocated by task 195:
__do_kmalloc_node mm/slab.c:3675 [inline]
__kmalloc_node+0x47/0x60 mm/slab.c:3682
kvmalloc ./include/linux/mm.h:540 [inline]
SYSC_add_key security/keys/keyctl.c:104 [inline]
SyS_add_key+0x19e/0x290 security/keys/keyctl.c:62
entry_SYSCALL_64_fastpath+0x1f/0x96
Fixes: 42d5ec27f873 ("X.509: Add an ASN.1 decoder")
Reported-by: Alexander Potapenko <glider(a)google.com>
Cc: <stable(a)vger.kernel.org> # v3.7+
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
lib/asn1_decoder.c | 47 ++++++++++++++++++++++++++---------------------
1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/lib/asn1_decoder.c b/lib/asn1_decoder.c
index 1ef0cec38d78..d77cdfc4b554 100644
--- a/lib/asn1_decoder.c
+++ b/lib/asn1_decoder.c
@@ -313,42 +313,47 @@ int asn1_ber_decoder(const struct asn1_decoder *decoder,
/* Decide how to handle the operation */
switch (op) {
- case ASN1_OP_MATCH_ANY_ACT:
- case ASN1_OP_MATCH_ANY_ACT_OR_SKIP:
- case ASN1_OP_COND_MATCH_ANY_ACT:
- case ASN1_OP_COND_MATCH_ANY_ACT_OR_SKIP:
- ret = actions[machine[pc + 1]](context, hdr, tag, data + dp, len);
- if (ret < 0)
- return ret;
- goto skip_data;
-
- case ASN1_OP_MATCH_ACT:
- case ASN1_OP_MATCH_ACT_OR_SKIP:
- case ASN1_OP_COND_MATCH_ACT_OR_SKIP:
- ret = actions[machine[pc + 2]](context, hdr, tag, data + dp, len);
- if (ret < 0)
- return ret;
- goto skip_data;
-
case ASN1_OP_MATCH:
case ASN1_OP_MATCH_OR_SKIP:
+ case ASN1_OP_MATCH_ACT:
+ case ASN1_OP_MATCH_ACT_OR_SKIP:
case ASN1_OP_MATCH_ANY:
case ASN1_OP_MATCH_ANY_OR_SKIP:
+ case ASN1_OP_MATCH_ANY_ACT:
+ case ASN1_OP_MATCH_ANY_ACT_OR_SKIP:
case ASN1_OP_COND_MATCH_OR_SKIP:
+ case ASN1_OP_COND_MATCH_ACT_OR_SKIP:
case ASN1_OP_COND_MATCH_ANY:
case ASN1_OP_COND_MATCH_ANY_OR_SKIP:
- skip_data:
+ case ASN1_OP_COND_MATCH_ANY_ACT:
+ case ASN1_OP_COND_MATCH_ANY_ACT_OR_SKIP:
+
if (!(flags & FLAG_CONS)) {
if (flags & FLAG_INDEFINITE_LENGTH) {
+ size_t tmp = dp;
+
ret = asn1_find_indefinite_length(
- data, datalen, &dp, &len, &errmsg);
+ data, datalen, &tmp, &len, &errmsg);
if (ret < 0)
goto error;
- } else {
- dp += len;
}
pr_debug("- LEAF: %zu\n", len);
}
+
+ if (op & ASN1_OP_MATCH__ACT) {
+ unsigned char act;
+
+ if (op & ASN1_OP_MATCH__ANY)
+ act = machine[pc + 1];
+ else
+ act = machine[pc + 2];
+ ret = actions[act](context, hdr, tag, data + dp, len);
+ if (ret < 0)
+ return ret;
+ }
+
+ if (!(flags & FLAG_CONS))
+ dp += len;
pc += asn1_op_lengths[op];
goto next_op;
--
2.15.0
From: Daniel Jurgens <danielj(a)mellanox.com>
For now the only LSM security enforcement mechanism available is
specific to InfiniBand. Bypass enforcement for non-IB link types.
This fixes a regression where modify_qp fails for iWARP because
querying the PKEY returns -EINVAL.
Cc: Paul Moore <paul(a)paul-moore.com>
Cc: Don Dutile <ddutile(a)redhat.com>
Cc: stable(a)vger.kernel.org
Reported-by: Potnuri Bharat Teja <bharat(a)chelsio.com>
Fixes: d291f1a65232("IB/core: Enforce PKey security on QPs")
Fixes: 47a2b338fe63("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj(a)mellanox.com>
Reviewed-by: Parav Pandit <parav(a)mellanox.com>
Tested-by: Potnuri Bharat Teja <bharat(a)chelsio.com>
Signed-off-by: Leon Romanovsky <leon(a)kernel.org>
---
drivers/infiniband/core/security.c | 43 ++++++++++++++++++++++++++++++++++++--
1 file changed, 41 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c
index 23278ed5be45..4b7fd68e1174 100644
--- a/drivers/infiniband/core/security.c
+++ b/drivers/infiniband/core/security.c
@@ -417,8 +417,17 @@ void ib_close_shared_qp_security(struct ib_qp_security *sec)
int ib_create_qp_security(struct ib_qp *qp, struct ib_device *dev)
{
+ u8 i = rdma_start_port(dev);
+ bool is_ib = false;
int ret;
+ while (i <= rdma_end_port(dev) && !is_ib)
+ is_ib = rdma_protocol_ib(dev, i++);
+
+ /* If this isn't an IB device don't create the security context */
+ if (!is_ib)
+ return 0;
+
qp->qp_sec = kzalloc(sizeof(*qp->qp_sec), GFP_KERNEL);
if (!qp->qp_sec)
return -ENOMEM;
@@ -441,6 +450,10 @@ EXPORT_SYMBOL(ib_create_qp_security);
void ib_destroy_qp_security_begin(struct ib_qp_security *sec)
{
+ /* Return if not IB */
+ if (!sec)
+ return;
+
mutex_lock(&sec->mutex);
/* Remove the QP from the lists so it won't get added to
@@ -470,6 +483,10 @@ void ib_destroy_qp_security_abort(struct ib_qp_security *sec)
int ret;
int i;
+ /* Return if not IB */
+ if (!sec)
+ return;
+
/* If a concurrent cache update is in progress this
* QP security could be marked for an error state
* transition. Wait for this to complete.
@@ -505,6 +522,10 @@ void ib_destroy_qp_security_end(struct ib_qp_security *sec)
{
int i;
+ /* Return if not IB */
+ if (!sec)
+ return;
+
/* If a concurrent cache update is occurring we must
* wait until this QP security structure is processed
* in the QP to error flow before destroying it because
@@ -565,13 +586,19 @@ int ib_security_modify_qp(struct ib_qp *qp,
bool pps_change = ((qp_attr_mask & (IB_QP_PKEY_INDEX | IB_QP_PORT)) ||
(qp_attr_mask & IB_QP_ALT_PATH));
+ WARN_ONCE((qp_attr_mask & IB_QP_PORT &&
+ rdma_protocol_ib(real_qp->device, qp_attr->port_num) &&
+ !real_qp->qp_sec),
+ "%s: QP security is not initialized for IB QP: %d\n",
+ __func__, real_qp->qp_num);
+
/* The port/pkey settings are maintained only for the real QP. Open
* handles on the real QP will be in the shared_qp_list. When
* enforcing security on the real QP all the shared QPs will be
* checked as well.
*/
- if (pps_change && !special_qp) {
+ if (pps_change && !special_qp` && real_qp->qp_sec) {
mutex_lock(&real_qp->qp_sec->mutex);
new_pps = get_new_pps(real_qp,
qp_attr,
@@ -600,7 +627,7 @@ int ib_security_modify_qp(struct ib_qp *qp,
qp_attr_mask,
udata);
- if (pps_change && !special_qp) {
+ if (pps_change && !special_qpp && real_qp->qp_sec) {
/* Clean up the lists and free the appropriate
* ports_pkeys structure.
*/
@@ -631,6 +658,9 @@ int ib_security_pkey_access(struct ib_device *dev,
u16 pkey;
int ret;
+ if (!rdma_protocol_ib(dev, port_num))
+ return 0;
+
ret = ib_get_cached_pkey(dev, port_num, pkey_index, &pkey);
if (ret)
return ret;
@@ -665,6 +695,9 @@ int ib_mad_agent_security_setup(struct ib_mad_agent *agent,
{
int ret;
+ if (!rdma_protocol_ib(agent->device, agent->port_num))
+ return 0;
+
ret = security_ib_alloc_security(&agent->security);
if (ret)
return ret;
@@ -690,6 +723,9 @@ int ib_mad_agent_security_setup(struct ib_mad_agent *agent,
void ib_mad_agent_security_cleanup(struct ib_mad_agent *agent)
{
+ if (!rdma_protocol_ib(agent->device, agent->port_num))
+ return;
+
security_ib_free_security(agent->security);
if (agent->lsm_nb_reg)
unregister_lsm_notifier(&agent->lsm_nb);
@@ -697,6 +733,9 @@ void ib_mad_agent_security_cleanup(struct ib_mad_agent *agent)
int ib_mad_enforce_security(struct ib_mad_agent_private *map, u16 pkey_index)
{
+ if (!rdma_protocol_ib(map->agent.device, map->agent.port_num))
+ return 0;
+
if (map->agent.qp->qp_type == IB_QPT_SMI && !map->agent.smp_allowed)
return -EACCES;
--
2.15.0
Tree/Branch: v3.2.96
Git describe: v3.2.96
Commit: 07a40fa222 Linux 3.2.96
Build Time: 0 min 3 sec
Passed: 0 / 4 ( 0.00 %)
Failed: 4 / 4 (100.00 %)
Errors: 6
Warnings: 20
Section Mismatches: 0
Failed defconfigs:
x86_64-allnoconfig
arm-allmodconfig
arm-allnoconfig
x86_64-defconfig
Errors:
x86_64-allnoconfig
/home/broonie/build/linux-stable/scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode
/home/broonie/build/linux-stable/kernel/bounds.c:1:0: error: code model kernel does not support PIC mode
arm-allnoconfig
/home/broonie/build/linux-stable/arch/arm/include/asm/div64.h:77:7: error: '__LINUX_ARM_ARCH__' undeclared (first use in this function)
/home/broonie/build/linux-stable/arch/arm/include/asm/glue-cache.h:129:2: error: #error Unknown cache maintenance model
/home/broonie/build/linux-stable/arch/arm/include/asm/glue-df.h:107:2: error: #error Unknown data abort handler type
/home/broonie/build/linux-stable/arch/arm/include/asm/glue-pf.h:54:2: error: #error Unknown prefetch abort handler type
x86_64-defconfig
/home/broonie/build/linux-stable/scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode
/home/broonie/build/linux-stable/kernel/bounds.c:1:0: error: code model kernel does not support PIC mode
-------------------------------------------------------------------------------
defconfigs with issues (other than build errors):
2 warnings 0 mismatches : arm-allmodconfig
19 warnings 0 mismatches : arm-allnoconfig
-------------------------------------------------------------------------------
Errors summary: 6
2 /home/broonie/build/linux-stable/scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode
2 /home/broonie/build/linux-stable/kernel/bounds.c:1:0: error: code model kernel does not support PIC mode
1 /home/broonie/build/linux-stable/arch/arm/include/asm/glue-pf.h:54:2: error: #error Unknown prefetch abort handler type
1 /home/broonie/build/linux-stable/arch/arm/include/asm/glue-df.h:107:2: error: #error Unknown data abort handler type
1 /home/broonie/build/linux-stable/arch/arm/include/asm/glue-cache.h:129:2: error: #error Unknown cache maintenance model
1 /home/broonie/build/linux-stable/arch/arm/include/asm/div64.h:77:7: error: '__LINUX_ARM_ARCH__' undeclared (first use in this function)
Warnings Summary: 20
2 .config:27:warning: symbol value '' invalid for PHYS_OFFSET
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:342:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:272:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:265:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:131:35: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:127:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:121:3: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:120:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/system.h:114:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/swab.h:25:28: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/processor.h:82:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/processor.h:102:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/irqflags.h:11:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/fpstate.h:32:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/cachetype.h:33:7: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/cachetype.h:28:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/cacheflush.h:196:7: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/cacheflush.h:194:7: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/bitops.h:217:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
1 /home/broonie/build/linux-stable/arch/arm/include/asm/atomic.h:30:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
===============================================================================
Detailed per-defconfig build reports below:
-------------------------------------------------------------------------------
x86_64-allnoconfig : FAIL, 2 errors, 0 warnings, 0 section mismatches
Errors:
/home/broonie/build/linux-stable/scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode
/home/broonie/build/linux-stable/kernel/bounds.c:1:0: error: code model kernel does not support PIC mode
-------------------------------------------------------------------------------
arm-allmodconfig : FAIL, 0 errors, 2 warnings, 0 section mismatches
Warnings:
.config:27:warning: symbol value '' invalid for PHYS_OFFSET
.config:27:warning: symbol value '' invalid for PHYS_OFFSET
-------------------------------------------------------------------------------
arm-allnoconfig : FAIL, 4 errors, 19 warnings, 0 section mismatches
Errors:
/home/broonie/build/linux-stable/arch/arm/include/asm/div64.h:77:7: error: '__LINUX_ARM_ARCH__' undeclared (first use in this function)
/home/broonie/build/linux-stable/arch/arm/include/asm/glue-cache.h:129:2: error: #error Unknown cache maintenance model
/home/broonie/build/linux-stable/arch/arm/include/asm/glue-df.h:107:2: error: #error Unknown data abort handler type
/home/broonie/build/linux-stable/arch/arm/include/asm/glue-pf.h:54:2: error: #error Unknown prefetch abort handler type
Warnings:
/home/broonie/build/linux-stable/arch/arm/include/asm/irqflags.h:11:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:114:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:120:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:121:3: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:127:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:131:35: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:265:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:272:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/system.h:342:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/bitops.h:217:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/swab.h:25:28: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/fpstate.h:32:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/processor.h:82:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/processor.h:102:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/atomic.h:30:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/cachetype.h:28:5: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/cachetype.h:33:7: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/cacheflush.h:194:7: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
/home/broonie/build/linux-stable/arch/arm/include/asm/cacheflush.h:196:7: warning: "__LINUX_ARM_ARCH__" is not defined [-Wundef]
-------------------------------------------------------------------------------
x86_64-defconfig : FAIL, 2 errors, 0 warnings, 0 section mismatches
Errors:
/home/broonie/build/linux-stable/scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode
/home/broonie/build/linux-stable/kernel/bounds.c:1:0: error: code model kernel does not support PIC mode
-------------------------------------------------------------------------------
Passed with no errors, warnings or mismatches:
Tree/Branch: v3.16.51
Git describe: v3.16.51
Commit: c45c05f42d Linux 3.16.51
Build Time: 37 min 49 sec
Passed: 9 / 9 (100.00 %)
Failed: 0 / 9 ( 0.00 %)
Errors: 0
Warnings: 7
Section Mismatches: 0
-------------------------------------------------------------------------------
defconfigs with issues (other than build errors):
1 warnings 0 mismatches : arm64-allnoconfig
2 warnings 0 mismatches : arm64-allmodconfig
1 warnings 0 mismatches : arm-multi_v5_defconfig
1 warnings 0 mismatches : arm-multi_v7_defconfig
1 warnings 0 mismatches : x86_64-defconfig
4 warnings 0 mismatches : arm-allmodconfig
1 warnings 0 mismatches : arm-allnoconfig
1 warnings 0 mismatches : x86_64-allnoconfig
3 warnings 0 mismatches : arm64-defconfig
-------------------------------------------------------------------------------
Warnings Summary: 7
7 ../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
2 ../ipc/sem.c:377:6: warning: '___p1' may be used uninitialized in this function [-Wmaybe-uninitialized]
2 ../drivers/net/ethernet/broadcom/genet/bcmgenet.c:1346:17: warning: unused variable 'kdev' [-Wunused-variable]
1 ../drivers/platform/x86/eeepc-laptop.c:279:10: warning: 'value' may be used uninitialized in this function [-Wmaybe-uninitialized]
1 ../drivers/media/platform/davinci/vpfe_capture.c:291:12: warning: 'vpfe_get_ccdc_image_format' defined but not used [-Wunused-function]
1 ../drivers/media/platform/davinci/vpfe_capture.c:1718:1: warning: label 'unlock_out' defined but not used [-Wunused-label]
1 ../arch/x86/kernel/cpu/common.c:961:13: warning: 'syscall32_cpu_init' defined but not used [-Wunused-function]
===============================================================================
Detailed per-defconfig build reports below:
-------------------------------------------------------------------------------
arm64-allnoconfig : PASS, 0 errors, 1 warnings, 0 section mismatches
Warnings:
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
-------------------------------------------------------------------------------
arm64-allmodconfig : PASS, 0 errors, 2 warnings, 0 section mismatches
Warnings:
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
../drivers/net/ethernet/broadcom/genet/bcmgenet.c:1346:17: warning: unused variable 'kdev' [-Wunused-variable]
-------------------------------------------------------------------------------
arm-multi_v5_defconfig : PASS, 0 errors, 1 warnings, 0 section mismatches
Warnings:
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
-------------------------------------------------------------------------------
arm-multi_v7_defconfig : PASS, 0 errors, 1 warnings, 0 section mismatches
Warnings:
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
-------------------------------------------------------------------------------
x86_64-defconfig : PASS, 0 errors, 1 warnings, 0 section mismatches
Warnings:
../drivers/platform/x86/eeepc-laptop.c:279:10: warning: 'value' may be used uninitialized in this function [-Wmaybe-uninitialized]
-------------------------------------------------------------------------------
arm-allmodconfig : PASS, 0 errors, 4 warnings, 0 section mismatches
Warnings:
../drivers/media/platform/davinci/vpfe_capture.c:1718:1: warning: label 'unlock_out' defined but not used [-Wunused-label]
../drivers/media/platform/davinci/vpfe_capture.c:291:12: warning: 'vpfe_get_ccdc_image_format' defined but not used [-Wunused-function]
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
../drivers/net/ethernet/broadcom/genet/bcmgenet.c:1346:17: warning: unused variable 'kdev' [-Wunused-variable]
-------------------------------------------------------------------------------
arm-allnoconfig : PASS, 0 errors, 1 warnings, 0 section mismatches
Warnings:
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
-------------------------------------------------------------------------------
x86_64-allnoconfig : PASS, 0 errors, 1 warnings, 0 section mismatches
Warnings:
../arch/x86/kernel/cpu/common.c:961:13: warning: 'syscall32_cpu_init' defined but not used [-Wunused-function]
-------------------------------------------------------------------------------
arm64-defconfig : PASS, 0 errors, 3 warnings, 0 section mismatches
Warnings:
../ipc/sem.c:377:6: warning: '___p1' may be used uninitialized in this function [-Wmaybe-uninitialized]
../ipc/sem.c:377:6: warning: '___p1' may be used uninitialized in this function [-Wmaybe-uninitialized]
../include/linux/stddef.h:8:14: warning: 'return' with a value, in function returning void
-------------------------------------------------------------------------------
Passed with no errors, warnings or mismatches:
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ee70bc1e7b63ac8023c9ff9475d8741e397316e7 Mon Sep 17 00:00:00 2001
From: Alexander Steffen <Alexander.Steffen(a)infineon.com>
Date: Fri, 8 Sep 2017 17:21:32 +0200
Subject: [PATCH] tpm-dev-common: Reject too short writes
tpm_transmit() does not offer an explicit interface to indicate the number
of valid bytes in the communication buffer. Instead, it relies on the
commandSize field in the TPM header that is encoded within the buffer.
Therefore, ensure that a) enough data has been written to the buffer, so
that the commandSize field is present and b) the commandSize field does not
announce more data than has been written to the buffer.
This should have been fixed with CVE-2011-1161 long ago, but apparently
a correct version of that patch never made it into the kernel.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Steffen <Alexander.Steffen(a)infineon.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c
index 610638a80383..461bf0b8a094 100644
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -110,6 +110,12 @@ ssize_t tpm_common_write(struct file *file, const char __user *buf,
return -EFAULT;
}
+ if (in_size < 6 ||
+ in_size < be32_to_cpu(*((__be32 *) (priv->data_buffer + 2)))) {
+ mutex_unlock(&priv->buffer_mutex);
+ return -EINVAL;
+ }
+
/* atomic tpm command send and result receive. We only hold the ops
* lock during this period so that the tpm can be unregistered even if
* the char dev is held open.
The patch below was submitted to be applied to the 4.14-stable tree.
I fail to see how this patch meets the stable kernel rules as found at
Documentation/process/stable-kernel-rules.rst.
I could be totally wrong, and if so, please respond to
<stable(a)vger.kernel.org> and let me know why this patch should be
applied. Otherwise, it is now dropped from my patch queues, never to be
seen again.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0bbc931a074a741cf8e6279e8045cf7118586780 Mon Sep 17 00:00:00 2001
From: Colin Ian King <colin.king(a)canonical.com>
Date: Fri, 25 Aug 2017 17:45:05 +0100
Subject: [PATCH] tpm_tis: make array cmd_getticks static const to shrink
object code size
Don't populate array cmd_getticks on the stack, instead make it static
const. Makes the object code smaller by over 160 bytes:
Before:
text data bss dec hex filename
18813 3152 128 22093 564d drivers/char/tpm/tpm_tis_core.o
After:
text data bss dec hex filename
18554 3248 128 21930 55aa drivers/char/tpm/tpm_tis_core.o
Cc: stable(a)vger.kernel.org
Signed-off-by: Colin Ian King <colin.king(a)canonical.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 63bc6c3b949e..1e957e923d21 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -445,7 +445,7 @@ static int probe_itpm(struct tpm_chip *chip)
{
struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
int rc = 0;
- u8 cmd_getticks[] = {
+ static const u8 cmd_getticks[] = {
0x00, 0xc1, 0x00, 0x00, 0x00, 0x0a,
0x00, 0x00, 0x00, 0xf1
};
On Sun, Nov 26, 2017 at 11:58:43AM +0000, Chris Wilson wrote:
> Quoting Lukas Wunner (2017-11-26 11:49:19)
> > Hm, the race at hand would be solved by the intel_fbdev_sync() below,
> > or am I missing something? Still wondering why it's necessary to
> > leave the fbdev around...
>
> The race is solved, but if we do free ifbdev, we can't dereference
> ifbdev prior to the sync; and we store the async cookie inside ifbdev.
> Bleugh. Catch 22.
Right. Oh dear god! We could move the cookie into dev_priv, the
fbdev_suspend_work is also there, outside of struct intel_fbdev.
> What we might do then is just pull the struct into dev_priv under
> ifdef FBDEV.
I vaguely remember something that dev_priv deliberately only contains
a pointer to struct intel_fbdev, that it *was* embedded in dev_priv
in the past but moved out for some reason.
> > However the "if (ifbdev->vma)" looks a bit fishy, ifbdev could be NULL
> > (e.g. if BIOS fb was too small but intelfb_alloc() failed) so I think
> > this might lead to a null pointer deref. Does it make a difference
> > if we check for ifbdev versus ifbdev->vma? I also notice that you
> > added a check for ifbdev->vma with 15727ed0d944 but Daniel later
> > removed it with 88be58be886f.
>
> We know that ifbdev is non-NULL and can't become NULL until fini. So
> after the sync point, we want to ask the question of whether the config
> was successful, for that I used to use ->fb which now replaced by ->vma.
Yes if the fbdev is kept around then obviously it's fine to deref it.
Thanks,
Lukas
This is a note to let you know that I've just added the patch titled
ACPI / APEI: Remove arch_apei_flush_tlb_one()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-apei-remove-arch_apei_flush_tlb_one.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 4a75aeacda3c2455954596593d89187df5420d0a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:27 +0000
Subject: ACPI / APEI: Remove arch_apei_flush_tlb_one()
From: James Morse <james.morse(a)arm.com>
commit 4a75aeacda3c2455954596593d89187df5420d0a upstream.
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_pte_vaddr() to do the invalidation when called from clear_fixmap()
Remove arch_apei_flush_tlb_one().
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/acpi/apei.c | 5 -----
include/acpi/apei.h | 1 -
2 files changed, 6 deletions(-)
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -55,8 +55,3 @@ void arch_apei_report_mem_error(int sev,
apei_mce_report_mem_error(sev, mem_err);
#endif
}
-
-void arch_apei_flush_tlb_one(unsigned long addr)
-{
- __flush_tlb_one(addr);
-}
--- a/include/acpi/apei.h
+++ b/include/acpi/apei.h
@@ -44,7 +44,6 @@ int erst_clear(u64 record_id);
int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
-void arch_apei_flush_tlb_one(unsigned long addr);
#endif
#endif
Patches currently in stable-queue which might be from james.morse(a)arm.com are
queue-3.18/acpi-apei-remove-arch_apei_flush_tlb_one.patch
From: Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
When we act as an AP, new firmware versions handle
internally the power saving clients and the driver doesn't
know that the peers went to sleep. It is, hence, possible
that a peer goes to sleep for a long time and stop pulling
frames. This will cause its transmit queue to hang which is
a condition that triggers the recovery flow in the driver.
While this client is certainly buggy (it should have pulled
the frame based on the TIM IE in the beacon), we can't blow
up because of a buggy client.
Change the current implementation to not enable the
transmit queue hang detection on queues that serve peers
when we act as an AP / GO.
We can still enable this mechanism using the debug
configuration which can come in handy when we want to
debug why the client doesn't wake up.
Cc: stable(a)vger.kernel.org # v4.13
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
Signed-off-by: Luca Coelho <luciano.coelho(a)intel.com>
---
drivers/net/wireless/intel/iwlwifi/mvm/utils.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/utils.c b/drivers/net/wireless/intel/iwlwifi/mvm/utils.c
index d46115e2d69e..19c1d1f76e15 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/utils.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/utils.c
@@ -1134,9 +1134,18 @@ unsigned int iwl_mvm_get_wd_timeout(struct iwl_mvm *mvm,
unsigned int default_timeout =
cmd_q ? IWL_DEF_WD_TIMEOUT : mvm->cfg->base_params->wd_timeout;
- if (!iwl_fw_dbg_trigger_enabled(mvm->fw, FW_DBG_TRIGGER_TXQ_TIMERS))
+ if (!iwl_fw_dbg_trigger_enabled(mvm->fw, FW_DBG_TRIGGER_TXQ_TIMERS)) {
+ /*
+ * We can't know when the station is asleep or awake, so we
+ * must disable the queue hang detection.
+ */
+ if (fw_has_capa(&mvm->fw->ucode_capa,
+ IWL_UCODE_TLV_CAPA_STA_PM_NOTIF) &&
+ vif && vif->type == NL80211_IFTYPE_AP)
+ return IWL_WATCHDOG_DISABLED;
return iwlmvm_mod_params.tfd_q_hang_detect ?
default_timeout : IWL_WATCHDOG_DISABLED;
+ }
trigger = iwl_fw_dbg_get_trigger(mvm->fw, FW_DBG_TRIGGER_TXQ_TIMERS);
txq_timer = (void *)trigger->data;
--
2.15.0
This is a note to let you know that I've just added the patch titled
s390/runtime instrumention: fix possible memory corruption
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-runtime-instrumention-fix-possible-memory-corruption.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d6e646ad7cfa7034d280459b2b2546288f247144 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Mon, 11 Sep 2017 11:24:22 +0200
Subject: s390/runtime instrumention: fix possible memory corruption
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit d6e646ad7cfa7034d280459b2b2546288f247144 upstream.
For PREEMPT enabled kernels the runtime instrumentation (RI) code
contains a possible use-after-free bug. If a task that makes use of RI
exits, it will execute do_exit() while still enabled for preemption.
That function will call exit_thread_runtime_instr() via
exit_thread(). If exit_thread_runtime_instr() gets preempted after the
RI control block of the task has been freed but before the pointer to
it is set to NULL, then save_ri_cb(), called from switch_to(), will
write to already freed memory.
Avoid this and simply disable preemption while freeing the control
block and setting the pointer to NULL.
Fixes: e4b8b3f33fca ("s390: add support for runtime instrumentation")
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/runtime_instr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/s390/kernel/runtime_instr.c
+++ b/arch/s390/kernel/runtime_instr.c
@@ -47,11 +47,13 @@ void exit_thread_runtime_instr(void)
{
struct task_struct *task = current;
+ preempt_disable();
if (!task->thread.ri_cb)
return;
disable_runtime_instr();
kfree(task->thread.ri_cb);
task->thread.ri_cb = NULL;
+ preempt_enable();
}
SYSCALL_DEFINE1(s390_runtime_instr, int, command)
@@ -62,9 +64,7 @@ SYSCALL_DEFINE1(s390_runtime_instr, int,
return -EOPNOTSUPP;
if (command == S390_RUNTIME_INSTR_STOP) {
- preempt_disable();
exit_thread_runtime_instr();
- preempt_enable();
return 0;
}
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.9/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.9/s390-fix-transactional-execution-control-register-handling.patch
queue-4.9/s390-runtime-instrumention-fix-possible-memory-corruption.patch
This is a note to let you know that I've just added the patch titled
s390: fix transactional execution control register handling
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-fix-transactional-execution-control-register-handling.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Thu, 9 Nov 2017 12:29:34 +0100
Subject: s390: fix transactional execution control register handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 upstream.
Dan Horák reported the following crash related to transactional execution:
User process fault: interruption code 0013 ilc:3 in libpthread-2.26.so[3ff93c00000+1b000]
CPU: 2 PID: 1 Comm: /init Not tainted 4.13.4-300.fc27.s390x #1
Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
task: 00000000fafc8000 task.stack: 00000000fafc4000
User PSW : 0705200180000000 000003ff93c14e70
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
User GPRS: 0000000000000077 000003ff00000000 000003ff93144d48 000003ff93144d5e
0000000000000000 0000000000000002 0000000000000000 000003ff00000000
0000000000000000 0000000000000418 0000000000000000 000003ffcc9fe770
000003ff93d28f50 000003ff9310acf0 000003ff92b0319a 000003ffcc9fe6d0
User Code: 000003ff93c14e62: 60e0b030 std %f14,48(%r11)
000003ff93c14e66: 60f0b038 std %f15,56(%r11)
#000003ff93c14e6a: e5600000ff0e tbegin 0,65294
>000003ff93c14e70: a7740006 brc 7,3ff93c14e7c
000003ff93c14e74: a7080000 lhi %r0,0
000003ff93c14e78: a7f40023 brc 15,3ff93c14ebe
000003ff93c14e7c: b2220000 ipm %r0
000003ff93c14e80: 8800001c srl %r0,28
There are several bugs with control register handling with respect to
transactional execution:
- on task switch update_per_regs() is only called if the next task has
an mm (is not a kernel thread). This however is incorrect. This
breaks e.g. for user mode helper handling, where the kernel creates
a kernel thread and then execve's a user space program. Control
register contents related to transactional execution won't be
updated on execve. If the previous task ran with transactional
execution disabled then the new task will also run with
transactional execution disabled, which is incorrect. Therefore call
update_per_regs() unconditionally within switch_to().
- on startup the transactional execution facility is not enabled for
the idle thread. This is not really a bug, but an inconsistency to
other facilities. Therefore enable the facility if it is available.
- on fork the new thread's per_flags field is not cleared. This means
that a child process inherits the PER_FLAG_NO_TE flag. This flag can
be set with a ptrace request to disable transactional execution for
the current process. It should not be inherited by new child
processes in order to be consistent with the handling of all other
PER related debugging options. Therefore clear the per_flags field in
copy_thread_tls().
Reported-and-tested-by: Dan Horák <dan(a)danny.cz>
Fixes: d35339a42dd1 ("s390: add support for transactional memory")
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner(a)linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/include/asm/switch_to.h | 2 +-
arch/s390/kernel/early.c | 4 +++-
arch/s390/kernel/process.c | 1 +
3 files changed, 5 insertions(+), 2 deletions(-)
--- a/arch/s390/include/asm/switch_to.h
+++ b/arch/s390/include/asm/switch_to.h
@@ -34,8 +34,8 @@ static inline void restore_access_regs(u
save_access_regs(&prev->thread.acrs[0]); \
save_ri_cb(prev->thread.ri_cb); \
} \
+ update_cr_regs(next); \
if (next->mm) { \
- update_cr_regs(next); \
set_cpu_flag(CIF_FPU); \
restore_access_regs(&next->thread.acrs[0]); \
restore_ri_cb(next->thread.ri_cb, prev->thread.ri_cb); \
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -345,8 +345,10 @@ static __init void detect_machine_facili
S390_lowcore.machine_flags |= MACHINE_FLAG_IDTE;
if (test_facility(40))
S390_lowcore.machine_flags |= MACHINE_FLAG_LPP;
- if (test_facility(50) && test_facility(73))
+ if (test_facility(50) && test_facility(73)) {
S390_lowcore.machine_flags |= MACHINE_FLAG_TE;
+ __ctl_set_bit(0, 55);
+ }
if (test_facility(51))
S390_lowcore.machine_flags |= MACHINE_FLAG_TLB_LC;
if (test_facility(129)) {
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -120,6 +120,7 @@ int copy_thread(unsigned long clone_flag
memset(&p->thread.per_user, 0, sizeof(p->thread.per_user));
memset(&p->thread.per_event, 0, sizeof(p->thread.per_event));
clear_tsk_thread_flag(p, TIF_SINGLE_STEP);
+ p->thread.per_flags = 0;
/* Initialize per thread user and system timer values */
ti = task_thread_info(p);
ti->user_timer = 0;
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.9/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.9/s390-fix-transactional-execution-control-register-handling.patch
queue-4.9/s390-runtime-instrumention-fix-possible-memory-corruption.patch
This is a note to let you know that I've just added the patch titled
s390/disassembler: add missing end marker for e7 table
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-disassembler-add-missing-end-marker-for-e7-table.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 5c50538752af7968f53924b22dede8ed4ce4cb3b Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Tue, 26 Sep 2017 09:16:48 +0200
Subject: s390/disassembler: add missing end marker for e7 table
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit 5c50538752af7968f53924b22dede8ed4ce4cb3b upstream.
The e7 opcode table does not have an end marker. Hence when trying to
find an unknown e7 instruction the code will access memory behind the
table until it finds something that matches the opcode, or the kernel
crashes, whatever comes first.
This affects not only the in-kernel disassembler but also uprobes and
kprobes which refuse to set a probe on unknown instructions, and
therefore search the opcode tables to figure out if instructions are
known or not.
Fixes: 3585cb0280654 ("s390/disassembler: add vector instructions")
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/dis.c | 1 +
1 file changed, 1 insertion(+)
--- a/arch/s390/kernel/dis.c
+++ b/arch/s390/kernel/dis.c
@@ -1548,6 +1548,7 @@ static struct s390_insn opcode_e7[] = {
{ "vfsq", 0xce, INSTR_VRR_VV000MM },
{ "vfs", 0xe2, INSTR_VRR_VVV00MM },
{ "vftci", 0x4a, INSTR_VRI_VVIMM },
+ { "", 0, INSTR_INVALID }
};
static struct s390_insn opcode_eb[] = {
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.9/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.9/s390-fix-transactional-execution-control-register-handling.patch
queue-4.9/s390-runtime-instrumention-fix-possible-memory-corruption.patch
This is a note to let you know that I've just added the patch titled
ACPI / EC: Fix regression related to triggering source of EC event handling
ACPI / EC: Fix an issue that SCI_EVT cannot be detected
ACPI / EC: Remove old CLEAR_ON_RESUME quirk
Revert "ACPI / EC: Enable event freeze mode..." to fix
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-ec-fix-regression-related-to-triggering-source-of-ec-event-handling.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 53c5eaabaea9a1b7a96f95ccc486d2ad721d95bb Mon Sep 17 00:00:00 2001
From: Lv Zheng <lv.zheng(a)intel.com>
Date: Tue, 26 Sep 2017 16:54:03 +0800
Subject: ACPI / EC: Fix regression related to triggering source of EC event handling
From: Lv Zheng <lv.zheng(a)intel.com>
commit 53c5eaabaea9a1b7a96f95ccc486d2ad721d95bb upstream.
Originally the Samsung quirks removed by commit 4c237371 can be covered
by commit e923e8e7 and ec_freeze_events=Y mode. But commit 9c40f956
changed ec_freeze_events=Y back to N, making this problem re-surface.
Actually, if commit e923e8e7 is robust enough, we can freely change
ec_freeze_events mode, so this patch fixes the issue by improving
commit e923e8e7.
Related commits listed in the merged order:
Commit: e923e8e79e18fd6be9162f1be6b99a002e9df2cb
Subject: ACPI / EC: Fix an issue that SCI_EVT cannot be detected
after event is enabled
Commit: 4c237371f290d1ed3b2071dd43554362137b1cce
Subject: ACPI / EC: Remove old CLEAR_ON_RESUME quirk
Commit: 9c40f956ce9b331493347d1b3cb7e384f7dc0581
Subject: Revert "ACPI / EC: Enable event freeze mode..." to fix
a regression
This patch not only fixes the reported post-resume EC event triggering
source issue, but also fixes an unreported similar issue related to the
driver bind by adding EC event triggering source in ec_install_handlers().
Fixes: e923e8e79e18 (ACPI / EC: Fix an issue that SCI_EVT cannot be detected after event is enabled)
Fixes: 4c237371f290 (ACPI / EC: Remove old CLEAR_ON_RESUME quirk)
Fixes: 9c40f956ce9b (Revert "ACPI / EC: Enable event freeze mode..." to fix a regression)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196833
Signed-off-by: Lv Zheng <lv.zheng(a)intel.com>
Reported-by: Alistair Hamilton <ahpatent(a)gmail.com>
Tested-by: Alistair Hamilton <ahpatent(a)gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/acpi/ec.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -482,8 +482,11 @@ static inline void __acpi_ec_enable_even
{
if (!test_and_set_bit(EC_FLAGS_QUERY_ENABLED, &ec->flags))
ec_log_drv("event unblocked");
- if (!test_bit(EC_FLAGS_QUERY_PENDING, &ec->flags))
- advance_transaction(ec);
+ /*
+ * Unconditionally invoke this once after enabling the event
+ * handling mechanism to detect the pending events.
+ */
+ advance_transaction(ec);
}
static inline void __acpi_ec_disable_event(struct acpi_ec *ec)
@@ -1458,11 +1461,10 @@ static int ec_install_handlers(struct ac
if (test_bit(EC_FLAGS_STARTED, &ec->flags) &&
ec->reference_count >= 1)
acpi_ec_enable_gpe(ec, true);
-
- /* EC is fully operational, allow queries */
- acpi_ec_enable_event(ec);
}
}
+ /* EC is fully operational, allow queries */
+ acpi_ec_enable_event(ec);
return 0;
}
Patches currently in stable-queue which might be from lv.zheng(a)intel.com are
queue-4.9/acpi-ec-fix-regression-related-to-triggering-source-of-ec-event-handling.patch
This is a note to let you know that I've just added the patch titled
s390: fix transactional execution control register handling
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-fix-transactional-execution-control-register-handling.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Thu, 9 Nov 2017 12:29:34 +0100
Subject: s390: fix transactional execution control register handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 upstream.
Dan Horák reported the following crash related to transactional execution:
User process fault: interruption code 0013 ilc:3 in libpthread-2.26.so[3ff93c00000+1b000]
CPU: 2 PID: 1 Comm: /init Not tainted 4.13.4-300.fc27.s390x #1
Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
task: 00000000fafc8000 task.stack: 00000000fafc4000
User PSW : 0705200180000000 000003ff93c14e70
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
User GPRS: 0000000000000077 000003ff00000000 000003ff93144d48 000003ff93144d5e
0000000000000000 0000000000000002 0000000000000000 000003ff00000000
0000000000000000 0000000000000418 0000000000000000 000003ffcc9fe770
000003ff93d28f50 000003ff9310acf0 000003ff92b0319a 000003ffcc9fe6d0
User Code: 000003ff93c14e62: 60e0b030 std %f14,48(%r11)
000003ff93c14e66: 60f0b038 std %f15,56(%r11)
#000003ff93c14e6a: e5600000ff0e tbegin 0,65294
>000003ff93c14e70: a7740006 brc 7,3ff93c14e7c
000003ff93c14e74: a7080000 lhi %r0,0
000003ff93c14e78: a7f40023 brc 15,3ff93c14ebe
000003ff93c14e7c: b2220000 ipm %r0
000003ff93c14e80: 8800001c srl %r0,28
There are several bugs with control register handling with respect to
transactional execution:
- on task switch update_per_regs() is only called if the next task has
an mm (is not a kernel thread). This however is incorrect. This
breaks e.g. for user mode helper handling, where the kernel creates
a kernel thread and then execve's a user space program. Control
register contents related to transactional execution won't be
updated on execve. If the previous task ran with transactional
execution disabled then the new task will also run with
transactional execution disabled, which is incorrect. Therefore call
update_per_regs() unconditionally within switch_to().
- on startup the transactional execution facility is not enabled for
the idle thread. This is not really a bug, but an inconsistency to
other facilities. Therefore enable the facility if it is available.
- on fork the new thread's per_flags field is not cleared. This means
that a child process inherits the PER_FLAG_NO_TE flag. This flag can
be set with a ptrace request to disable transactional execution for
the current process. It should not be inherited by new child
processes in order to be consistent with the handling of all other
PER related debugging options. Therefore clear the per_flags field in
copy_thread_tls().
Reported-and-tested-by: Dan Horák <dan(a)danny.cz>
Fixes: d35339a42dd1 ("s390: add support for transactional memory")
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner(a)linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/include/asm/switch_to.h | 2 +-
arch/s390/kernel/early.c | 4 +++-
arch/s390/kernel/process.c | 1 +
3 files changed, 5 insertions(+), 2 deletions(-)
--- a/arch/s390/include/asm/switch_to.h
+++ b/arch/s390/include/asm/switch_to.h
@@ -34,8 +34,8 @@ static inline void restore_access_regs(u
save_access_regs(&prev->thread.acrs[0]); \
save_ri_cb(prev->thread.ri_cb); \
} \
+ update_cr_regs(next); \
if (next->mm) { \
- update_cr_regs(next); \
set_cpu_flag(CIF_FPU); \
restore_access_regs(&next->thread.acrs[0]); \
restore_ri_cb(next->thread.ri_cb, prev->thread.ri_cb); \
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -325,8 +325,10 @@ static __init void detect_machine_facili
S390_lowcore.machine_flags |= MACHINE_FLAG_IDTE;
if (test_facility(40))
S390_lowcore.machine_flags |= MACHINE_FLAG_LPP;
- if (test_facility(50) && test_facility(73))
+ if (test_facility(50) && test_facility(73)) {
S390_lowcore.machine_flags |= MACHINE_FLAG_TE;
+ __ctl_set_bit(0, 55);
+ }
if (test_facility(51))
S390_lowcore.machine_flags |= MACHINE_FLAG_TLB_LC;
if (test_facility(129)) {
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -137,6 +137,7 @@ int copy_thread(unsigned long clone_flag
memset(&p->thread.per_user, 0, sizeof(p->thread.per_user));
memset(&p->thread.per_event, 0, sizeof(p->thread.per_event));
clear_tsk_thread_flag(p, TIF_SINGLE_STEP);
+ p->thread.per_flags = 0;
/* Initialize per thread user and system timer values */
ti = task_thread_info(p);
ti->user_timer = 0;
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.4/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.4/s390-fix-transactional-execution-control-register-handling.patch
queue-4.4/s390-runtime-instrumention-fix-possible-memory-corruption.patch
This is a note to let you know that I've just added the patch titled
s390/runtime instrumention: fix possible memory corruption
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-runtime-instrumention-fix-possible-memory-corruption.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d6e646ad7cfa7034d280459b2b2546288f247144 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Mon, 11 Sep 2017 11:24:22 +0200
Subject: s390/runtime instrumention: fix possible memory corruption
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit d6e646ad7cfa7034d280459b2b2546288f247144 upstream.
For PREEMPT enabled kernels the runtime instrumentation (RI) code
contains a possible use-after-free bug. If a task that makes use of RI
exits, it will execute do_exit() while still enabled for preemption.
That function will call exit_thread_runtime_instr() via
exit_thread(). If exit_thread_runtime_instr() gets preempted after the
RI control block of the task has been freed but before the pointer to
it is set to NULL, then save_ri_cb(), called from switch_to(), will
write to already freed memory.
Avoid this and simply disable preemption while freeing the control
block and setting the pointer to NULL.
Fixes: e4b8b3f33fca ("s390: add support for runtime instrumentation")
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/runtime_instr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/s390/kernel/runtime_instr.c
+++ b/arch/s390/kernel/runtime_instr.c
@@ -47,11 +47,13 @@ void exit_thread_runtime_instr(void)
{
struct task_struct *task = current;
+ preempt_disable();
if (!task->thread.ri_cb)
return;
disable_runtime_instr();
kfree(task->thread.ri_cb);
task->thread.ri_cb = NULL;
+ preempt_enable();
}
SYSCALL_DEFINE1(s390_runtime_instr, int, command)
@@ -62,9 +64,7 @@ SYSCALL_DEFINE1(s390_runtime_instr, int,
return -EOPNOTSUPP;
if (command == S390_RUNTIME_INSTR_STOP) {
- preempt_disable();
exit_thread_runtime_instr();
- preempt_enable();
return 0;
}
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.4/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.4/s390-fix-transactional-execution-control-register-handling.patch
queue-4.4/s390-runtime-instrumention-fix-possible-memory-corruption.patch
This is a note to let you know that I've just added the patch titled
s390/disassembler: add missing end marker for e7 table
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-disassembler-add-missing-end-marker-for-e7-table.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 5c50538752af7968f53924b22dede8ed4ce4cb3b Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Tue, 26 Sep 2017 09:16:48 +0200
Subject: s390/disassembler: add missing end marker for e7 table
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit 5c50538752af7968f53924b22dede8ed4ce4cb3b upstream.
The e7 opcode table does not have an end marker. Hence when trying to
find an unknown e7 instruction the code will access memory behind the
table until it finds something that matches the opcode, or the kernel
crashes, whatever comes first.
This affects not only the in-kernel disassembler but also uprobes and
kprobes which refuse to set a probe on unknown instructions, and
therefore search the opcode tables to figure out if instructions are
known or not.
Fixes: 3585cb0280654 ("s390/disassembler: add vector instructions")
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/dis.c | 1 +
1 file changed, 1 insertion(+)
--- a/arch/s390/kernel/dis.c
+++ b/arch/s390/kernel/dis.c
@@ -1549,6 +1549,7 @@ static struct s390_insn opcode_e7[] = {
{ "vfsq", 0xce, INSTR_VRR_VV000MM },
{ "vfs", 0xe2, INSTR_VRR_VVV00MM },
{ "vftci", 0x4a, INSTR_VRI_VVIMM },
+ { "", 0, INSTR_INVALID }
};
static struct s390_insn opcode_eb[] = {
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.4/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.4/s390-fix-transactional-execution-control-register-handling.patch
queue-4.4/s390-runtime-instrumention-fix-possible-memory-corruption.patch
This is a note to let you know that I've just added the patch titled
s390/runtime instrumention: fix possible memory corruption
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-runtime-instrumention-fix-possible-memory-corruption.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d6e646ad7cfa7034d280459b2b2546288f247144 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Mon, 11 Sep 2017 11:24:22 +0200
Subject: s390/runtime instrumention: fix possible memory corruption
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit d6e646ad7cfa7034d280459b2b2546288f247144 upstream.
For PREEMPT enabled kernels the runtime instrumentation (RI) code
contains a possible use-after-free bug. If a task that makes use of RI
exits, it will execute do_exit() while still enabled for preemption.
That function will call exit_thread_runtime_instr() via
exit_thread(). If exit_thread_runtime_instr() gets preempted after the
RI control block of the task has been freed but before the pointer to
it is set to NULL, then save_ri_cb(), called from switch_to(), will
write to already freed memory.
Avoid this and simply disable preemption while freeing the control
block and setting the pointer to NULL.
Fixes: e4b8b3f33fca ("s390: add support for runtime instrumentation")
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/runtime_instr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/s390/kernel/runtime_instr.c
+++ b/arch/s390/kernel/runtime_instr.c
@@ -50,11 +50,13 @@ void exit_thread_runtime_instr(void)
{
struct task_struct *task = current;
+ preempt_disable();
if (!task->thread.ri_cb)
return;
disable_runtime_instr();
kfree(task->thread.ri_cb);
task->thread.ri_cb = NULL;
+ preempt_enable();
}
SYSCALL_DEFINE1(s390_runtime_instr, int, command)
@@ -65,9 +67,7 @@ SYSCALL_DEFINE1(s390_runtime_instr, int,
return -EOPNOTSUPP;
if (command == S390_RUNTIME_INSTR_STOP) {
- preempt_disable();
exit_thread_runtime_instr();
- preempt_enable();
return 0;
}
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.14/s390-guarded-storage-fix-possible-memory-corruption.patch
queue-4.14/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.14/s390-fix-transactional-execution-control-register-handling.patch
queue-4.14/s390-runtime-instrumention-fix-possible-memory-corruption.patch
queue-4.14/s390-noexec-execute-kexec-datamover-without-dat.patch
This is a note to let you know that I've just added the patch titled
s390/guarded storage: fix possible memory corruption
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-guarded-storage-fix-possible-memory-corruption.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From fa1edf3f63c05ca8eacafcd7048ed91e5360f1a8 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Mon, 11 Sep 2017 11:24:22 +0200
Subject: s390/guarded storage: fix possible memory corruption
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit fa1edf3f63c05ca8eacafcd7048ed91e5360f1a8 upstream.
For PREEMPT enabled kernels the guarded storage (GS) code contains a
possible use-after-free bug. If a task that makes use of GS exits, it
will execute do_exit() while still enabled for preemption.
That function will call exit_thread_runtime_instr() via exit_thread().
If exit_thread_gs() gets preempted after the GS control block of the
task has been freed but before the pointer to it is set to NULL, then
save_gs_cb(), called from switch_to(), will write to already freed
memory.
Avoid this and simply disable preemption while freeing the control
block and setting the pointer to NULL.
Fixes: 916cda1aa1b4 ("s390: add a system call for guarded storage")
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/guarded_storage.c | 2 ++
1 file changed, 2 insertions(+)
--- a/arch/s390/kernel/guarded_storage.c
+++ b/arch/s390/kernel/guarded_storage.c
@@ -14,9 +14,11 @@
void exit_thread_gs(void)
{
+ preempt_disable();
kfree(current->thread.gs_cb);
kfree(current->thread.gs_bc_cb);
current->thread.gs_cb = current->thread.gs_bc_cb = NULL;
+ preempt_enable();
}
static int gs_enable(void)
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.14/s390-guarded-storage-fix-possible-memory-corruption.patch
queue-4.14/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.14/s390-fix-transactional-execution-control-register-handling.patch
queue-4.14/s390-runtime-instrumention-fix-possible-memory-corruption.patch
queue-4.14/s390-noexec-execute-kexec-datamover-without-dat.patch
This is a note to let you know that I've just added the patch titled
s390/noexec: execute kexec datamover without DAT
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-noexec-execute-kexec-datamover-without-dat.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d0e810eeb3d326978f248b8f0233a2f30f58c72d Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Thu, 9 Nov 2017 23:00:14 +0100
Subject: s390/noexec: execute kexec datamover without DAT
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit d0e810eeb3d326978f248b8f0233a2f30f58c72d upstream.
Rebooting into a new kernel with kexec fails (system dies) if tried on
a machine that has no-execute support. Reason for this is that the so
called datamover code gets executed with DAT on (MMU is active) and
the page that contains the datamover is marked as non-executable.
Therefore when branching into the datamover an unexpected program
check happens and afterwards the machine is dead.
This can be simply avoided by disabling DAT, which also disables any
no-execute checks, just before the datamover gets executed.
In fact the first thing done by the datamover is to disable DAT. The
code in the datamover that disables DAT can be removed as well.
Thanks to Michael Holzheu and Gerald Schaefer for tracking this down.
Reviewed-by: Michael Holzheu <holzheu(a)linux.vnet.ibm.com>
Reviewed-by: Philipp Rudo <prudo(a)linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer(a)de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Fixes: 57d7f939e7bd ("s390: add no-execute support")
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/machine_kexec.c | 1 +
arch/s390/kernel/relocate_kernel.S | 3 ---
2 files changed, 1 insertion(+), 3 deletions(-)
--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -269,6 +269,7 @@ static void __do_machine_kexec(void *dat
s390_reset_system();
data_mover = (relocate_kernel_t) page_to_phys(image->control_code_page);
+ __arch_local_irq_stnsm(0xfb); /* disable DAT - avoid no-execute */
/* Call the moving routine */
(*data_mover)(&image->head, image->start);
--- a/arch/s390/kernel/relocate_kernel.S
+++ b/arch/s390/kernel/relocate_kernel.S
@@ -29,7 +29,6 @@
ENTRY(relocate_kernel)
basr %r13,0 # base address
.base:
- stnsm sys_msk-.base(%r13),0xfb # disable DAT
stctg %c0,%c15,ctlregs-.base(%r13)
stmg %r0,%r15,gprregs-.base(%r13)
lghi %r0,3
@@ -103,8 +102,6 @@ ENTRY(relocate_kernel)
.align 8
load_psw:
.long 0x00080000,0x80000000
- sys_msk:
- .quad 0
ctlregs:
.rept 16
.quad 0
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.14/s390-guarded-storage-fix-possible-memory-corruption.patch
queue-4.14/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.14/s390-fix-transactional-execution-control-register-handling.patch
queue-4.14/s390-runtime-instrumention-fix-possible-memory-corruption.patch
queue-4.14/s390-noexec-execute-kexec-datamover-without-dat.patch
This is a note to let you know that I've just added the patch titled
s390: fix transactional execution control register handling
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-fix-transactional-execution-control-register-handling.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Thu, 9 Nov 2017 12:29:34 +0100
Subject: s390: fix transactional execution control register handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 upstream.
Dan Horák reported the following crash related to transactional execution:
User process fault: interruption code 0013 ilc:3 in libpthread-2.26.so[3ff93c00000+1b000]
CPU: 2 PID: 1 Comm: /init Not tainted 4.13.4-300.fc27.s390x #1
Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
task: 00000000fafc8000 task.stack: 00000000fafc4000
User PSW : 0705200180000000 000003ff93c14e70
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
User GPRS: 0000000000000077 000003ff00000000 000003ff93144d48 000003ff93144d5e
0000000000000000 0000000000000002 0000000000000000 000003ff00000000
0000000000000000 0000000000000418 0000000000000000 000003ffcc9fe770
000003ff93d28f50 000003ff9310acf0 000003ff92b0319a 000003ffcc9fe6d0
User Code: 000003ff93c14e62: 60e0b030 std %f14,48(%r11)
000003ff93c14e66: 60f0b038 std %f15,56(%r11)
#000003ff93c14e6a: e5600000ff0e tbegin 0,65294
>000003ff93c14e70: a7740006 brc 7,3ff93c14e7c
000003ff93c14e74: a7080000 lhi %r0,0
000003ff93c14e78: a7f40023 brc 15,3ff93c14ebe
000003ff93c14e7c: b2220000 ipm %r0
000003ff93c14e80: 8800001c srl %r0,28
There are several bugs with control register handling with respect to
transactional execution:
- on task switch update_per_regs() is only called if the next task has
an mm (is not a kernel thread). This however is incorrect. This
breaks e.g. for user mode helper handling, where the kernel creates
a kernel thread and then execve's a user space program. Control
register contents related to transactional execution won't be
updated on execve. If the previous task ran with transactional
execution disabled then the new task will also run with
transactional execution disabled, which is incorrect. Therefore call
update_per_regs() unconditionally within switch_to().
- on startup the transactional execution facility is not enabled for
the idle thread. This is not really a bug, but an inconsistency to
other facilities. Therefore enable the facility if it is available.
- on fork the new thread's per_flags field is not cleared. This means
that a child process inherits the PER_FLAG_NO_TE flag. This flag can
be set with a ptrace request to disable transactional execution for
the current process. It should not be inherited by new child
processes in order to be consistent with the handling of all other
PER related debugging options. Therefore clear the per_flags field in
copy_thread_tls().
Reported-and-tested-by: Dan Horák <dan(a)danny.cz>
Fixes: d35339a42dd1 ("s390: add support for transactional memory")
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner(a)linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/include/asm/switch_to.h | 2 +-
arch/s390/kernel/early.c | 4 +++-
arch/s390/kernel/process.c | 1 +
3 files changed, 5 insertions(+), 2 deletions(-)
--- a/arch/s390/include/asm/switch_to.h
+++ b/arch/s390/include/asm/switch_to.h
@@ -37,8 +37,8 @@ static inline void restore_access_regs(u
save_ri_cb(prev->thread.ri_cb); \
save_gs_cb(prev->thread.gs_cb); \
} \
+ update_cr_regs(next); \
if (next->mm) { \
- update_cr_regs(next); \
set_cpu_flag(CIF_FPU); \
restore_access_regs(&next->thread.acrs[0]); \
restore_ri_cb(next->thread.ri_cb, prev->thread.ri_cb); \
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -375,8 +375,10 @@ static __init void detect_machine_facili
S390_lowcore.machine_flags |= MACHINE_FLAG_IDTE;
if (test_facility(40))
S390_lowcore.machine_flags |= MACHINE_FLAG_LPP;
- if (test_facility(50) && test_facility(73))
+ if (test_facility(50) && test_facility(73)) {
S390_lowcore.machine_flags |= MACHINE_FLAG_TE;
+ __ctl_set_bit(0, 55);
+ }
if (test_facility(51))
S390_lowcore.machine_flags |= MACHINE_FLAG_TLB_LC;
if (test_facility(129)) {
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -100,6 +100,7 @@ int copy_thread_tls(unsigned long clone_
memset(&p->thread.per_user, 0, sizeof(p->thread.per_user));
memset(&p->thread.per_event, 0, sizeof(p->thread.per_event));
clear_tsk_thread_flag(p, TIF_SINGLE_STEP);
+ p->thread.per_flags = 0;
/* Initialize per thread user and system timer values */
p->thread.user_timer = 0;
p->thread.guest_timer = 0;
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.14/s390-guarded-storage-fix-possible-memory-corruption.patch
queue-4.14/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.14/s390-fix-transactional-execution-control-register-handling.patch
queue-4.14/s390-runtime-instrumention-fix-possible-memory-corruption.patch
queue-4.14/s390-noexec-execute-kexec-datamover-without-dat.patch
This is a note to let you know that I've just added the patch titled
s390/disassembler: add missing end marker for e7 table
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
s390-disassembler-add-missing-end-marker-for-e7-table.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 5c50538752af7968f53924b22dede8ed4ce4cb3b Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Tue, 26 Sep 2017 09:16:48 +0200
Subject: s390/disassembler: add missing end marker for e7 table
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
commit 5c50538752af7968f53924b22dede8ed4ce4cb3b upstream.
The e7 opcode table does not have an end marker. Hence when trying to
find an unknown e7 instruction the code will access memory behind the
table until it finds something that matches the opcode, or the kernel
crashes, whatever comes first.
This affects not only the in-kernel disassembler but also uprobes and
kprobes which refuse to set a probe on unknown instructions, and
therefore search the opcode tables to figure out if instructions are
known or not.
Fixes: 3585cb0280654 ("s390/disassembler: add vector instructions")
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/s390/kernel/dis.c | 1 +
1 file changed, 1 insertion(+)
--- a/arch/s390/kernel/dis.c
+++ b/arch/s390/kernel/dis.c
@@ -1548,6 +1548,7 @@ static struct s390_insn opcode_e7[] = {
{ "vfsq", 0xce, INSTR_VRR_VV000MM },
{ "vfs", 0xe2, INSTR_VRR_VVV00MM },
{ "vftci", 0x4a, INSTR_VRI_VVIMM },
+ { "", 0, INSTR_INVALID }
};
static struct s390_insn opcode_eb[] = {
Patches currently in stable-queue which might be from heiko.carstens(a)de.ibm.com are
queue-4.14/s390-guarded-storage-fix-possible-memory-corruption.patch
queue-4.14/s390-disassembler-add-missing-end-marker-for-e7-table.patch
queue-4.14/s390-fix-transactional-execution-control-register-handling.patch
queue-4.14/s390-runtime-instrumention-fix-possible-memory-corruption.patch
queue-4.14/s390-noexec-execute-kexec-datamover-without-dat.patch
This is a note to let you know that I've just added the patch titled
ACPI / PM: Fix acpi_pm_notifier_lock vs flush_workqueue() deadlock
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-pm-fix-acpi_pm_notifier_lock-vs-flush_workqueue-deadlock.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From ff1656790b3a4caca94505c52fd0250f981ea187 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala(a)linux.intel.com>
Date: Tue, 7 Nov 2017 23:08:10 +0200
Subject: ACPI / PM: Fix acpi_pm_notifier_lock vs flush_workqueue() deadlock
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
commit ff1656790b3a4caca94505c52fd0250f981ea187 upstream.
acpi_remove_pm_notifier() ends up calling flush_workqueue() while
holding acpi_pm_notifier_lock, and that same lock is taken by
by the work via acpi_pm_notify_handler(). This can deadlock.
To fix the problem let's split the single lock into two: one to
protect the dev->wakeup between the work vs. add/remove, and
another one to handle notifier installation vs. removal.
After commit a1d14934ea4b "workqueue/lockdep: 'Fix' flush_work()
annotation" I was able to kill the machine (Intel Braswell)
very easily with 'powertop --auto-tune', runtime suspending i915,
and trying to wake it up via the USB keyboard. The cases when
it didn't die are presumably explained by lockdep getting disabled
by something else (cpu hotplug locking issues usually).
Fortunately I still got a lockdep report over netconsole
(trickling in very slowly), even though the machine was
otherwise practically dead:
[ 112.179806] ======================================================
[ 114.670858] WARNING: possible circular locking dependency detected
[ 117.155663] 4.13.0-rc6-bsw-bisect-00169-ga1d14934ea4b #119 Not tainted
[ 119.658101] ------------------------------------------------------
[ 121.310242] xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
[ 121.313294] xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
[ 121.313346] xhci_hcd 0000:00:14.0: HC died; cleaning up
[ 121.313485] usb 1-6: USB disconnect, device number 3
[ 121.313501] usb 1-6.2: USB disconnect, device number 4
[ 134.747383] kworker/0:2/47 is trying to acquire lock:
[ 137.220790] (acpi_pm_notifier_lock){+.+.}, at: [<ffffffff813cafdf>] acpi_pm_notify_handler+0x2f/0x80
[ 139.721524]
[ 139.721524] but task is already holding lock:
[ 144.672922] ((&dpc->work)){+.+.}, at: [<ffffffff8109ce90>] process_one_work+0x160/0x720
[ 147.184450]
[ 147.184450] which lock already depends on the new lock.
[ 147.184450]
[ 154.604711]
[ 154.604711] the existing dependency chain (in reverse order) is:
[ 159.447888]
[ 159.447888] -> #2 ((&dpc->work)){+.+.}:
[ 164.183486] __lock_acquire+0x1255/0x13f0
[ 166.504313] lock_acquire+0xb5/0x210
[ 168.778973] process_one_work+0x1b9/0x720
[ 171.030316] worker_thread+0x4c/0x440
[ 173.257184] kthread+0x154/0x190
[ 175.456143] ret_from_fork+0x27/0x40
[ 177.624348]
[ 177.624348] -> #1 ("kacpi_notify"){+.+.}:
[ 181.850351] __lock_acquire+0x1255/0x13f0
[ 183.941695] lock_acquire+0xb5/0x210
[ 186.046115] flush_workqueue+0xdd/0x510
[ 190.408153] acpi_os_wait_events_complete+0x31/0x40
[ 192.625303] acpi_remove_notify_handler+0x133/0x188
[ 194.820829] acpi_remove_pm_notifier+0x56/0x90
[ 196.989068] acpi_dev_pm_detach+0x5f/0xa0
[ 199.145866] dev_pm_domain_detach+0x27/0x30
[ 201.285614] i2c_device_probe+0x100/0x210
[ 203.411118] driver_probe_device+0x23e/0x310
[ 205.522425] __driver_attach+0xa3/0xb0
[ 207.634268] bus_for_each_dev+0x69/0xa0
[ 209.714797] driver_attach+0x1e/0x20
[ 211.778258] bus_add_driver+0x1bc/0x230
[ 213.837162] driver_register+0x60/0xe0
[ 215.868162] i2c_register_driver+0x42/0x70
[ 217.869551] 0xffffffffa0172017
[ 219.863009] do_one_initcall+0x45/0x170
[ 221.843863] do_init_module+0x5f/0x204
[ 223.817915] load_module+0x225b/0x29b0
[ 225.757234] SyS_finit_module+0xc6/0xd0
[ 227.661851] do_syscall_64+0x5c/0x120
[ 229.536819] return_from_SYSCALL_64+0x0/0x7a
[ 231.392444]
[ 231.392444] -> #0 (acpi_pm_notifier_lock){+.+.}:
[ 235.124914] check_prev_add+0x44e/0x8a0
[ 237.024795] __lock_acquire+0x1255/0x13f0
[ 238.937351] lock_acquire+0xb5/0x210
[ 240.840799] __mutex_lock+0x75/0x940
[ 242.709517] mutex_lock_nested+0x1c/0x20
[ 244.551478] acpi_pm_notify_handler+0x2f/0x80
[ 246.382052] acpi_ev_notify_dispatch+0x44/0x5c
[ 248.194412] acpi_os_execute_deferred+0x14/0x30
[ 250.003925] process_one_work+0x1ec/0x720
[ 251.803191] worker_thread+0x4c/0x440
[ 253.605307] kthread+0x154/0x190
[ 255.387498] ret_from_fork+0x27/0x40
[ 257.153175]
[ 257.153175] other info that might help us debug this:
[ 257.153175]
[ 262.324392] Chain exists of:
[ 262.324392] acpi_pm_notifier_lock --> "kacpi_notify" --> (&dpc->work)
[ 262.324392]
[ 267.391997] Possible unsafe locking scenario:
[ 267.391997]
[ 270.758262] CPU0 CPU1
[ 272.431713] ---- ----
[ 274.060756] lock((&dpc->work));
[ 275.646532] lock("kacpi_notify");
[ 277.260772] lock((&dpc->work));
[ 278.839146] lock(acpi_pm_notifier_lock);
[ 280.391902]
[ 280.391902] *** DEADLOCK ***
[ 280.391902]
[ 284.986385] 2 locks held by kworker/0:2/47:
[ 286.524895] #0: ("kacpi_notify"){+.+.}, at: [<ffffffff8109ce90>] process_one_work+0x160/0x720
[ 288.112927] #1: ((&dpc->work)){+.+.}, at: [<ffffffff8109ce90>] process_one_work+0x160/0x720
[ 289.727725]
Fixes: c072530f391e (ACPI / PM: Revork the handling of ACPI device wakeup notifications)
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/acpi/device_pm.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
--- a/drivers/acpi/device_pm.c
+++ b/drivers/acpi/device_pm.c
@@ -387,6 +387,7 @@ EXPORT_SYMBOL(acpi_bus_power_manageable)
#ifdef CONFIG_PM
static DEFINE_MUTEX(acpi_pm_notifier_lock);
+static DEFINE_MUTEX(acpi_pm_notifier_install_lock);
void acpi_pm_wakeup_event(struct device *dev)
{
@@ -443,24 +444,25 @@ acpi_status acpi_add_pm_notifier(struct
if (!dev && !func)
return AE_BAD_PARAMETER;
- mutex_lock(&acpi_pm_notifier_lock);
+ mutex_lock(&acpi_pm_notifier_install_lock);
if (adev->wakeup.flags.notifier_present)
goto out;
- adev->wakeup.ws = wakeup_source_register(dev_name(&adev->dev));
- adev->wakeup.context.dev = dev;
- adev->wakeup.context.func = func;
-
status = acpi_install_notify_handler(adev->handle, ACPI_SYSTEM_NOTIFY,
acpi_pm_notify_handler, NULL);
if (ACPI_FAILURE(status))
goto out;
+ mutex_lock(&acpi_pm_notifier_lock);
+ adev->wakeup.ws = wakeup_source_register(dev_name(&adev->dev));
+ adev->wakeup.context.dev = dev;
+ adev->wakeup.context.func = func;
adev->wakeup.flags.notifier_present = true;
+ mutex_unlock(&acpi_pm_notifier_lock);
out:
- mutex_unlock(&acpi_pm_notifier_lock);
+ mutex_unlock(&acpi_pm_notifier_install_lock);
return status;
}
@@ -472,7 +474,7 @@ acpi_status acpi_remove_pm_notifier(stru
{
acpi_status status = AE_BAD_PARAMETER;
- mutex_lock(&acpi_pm_notifier_lock);
+ mutex_lock(&acpi_pm_notifier_install_lock);
if (!adev->wakeup.flags.notifier_present)
goto out;
@@ -483,14 +485,15 @@ acpi_status acpi_remove_pm_notifier(stru
if (ACPI_FAILURE(status))
goto out;
+ mutex_lock(&acpi_pm_notifier_lock);
adev->wakeup.context.func = NULL;
adev->wakeup.context.dev = NULL;
wakeup_source_unregister(adev->wakeup.ws);
-
adev->wakeup.flags.notifier_present = false;
+ mutex_unlock(&acpi_pm_notifier_lock);
out:
- mutex_unlock(&acpi_pm_notifier_lock);
+ mutex_unlock(&acpi_pm_notifier_install_lock);
return status;
}
Patches currently in stable-queue which might be from ville.syrjala(a)linux.intel.com are
queue-4.14/acpi-pm-fix-acpi_pm_notifier_lock-vs-flush_workqueue-deadlock.patch
This is a note to let you know that I've just added the patch titled
ACPI / EC: Fix regression related to triggering source of EC event handling
ACPI / EC: Fix an issue that SCI_EVT cannot be detected
ACPI / EC: Remove old CLEAR_ON_RESUME quirk
Revert "ACPI / EC: Enable event freeze mode..." to fix
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
acpi-ec-fix-regression-related-to-triggering-source-of-ec-event-handling.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 53c5eaabaea9a1b7a96f95ccc486d2ad721d95bb Mon Sep 17 00:00:00 2001
From: Lv Zheng <lv.zheng(a)intel.com>
Date: Tue, 26 Sep 2017 16:54:03 +0800
Subject: ACPI / EC: Fix regression related to triggering source of EC event handling
From: Lv Zheng <lv.zheng(a)intel.com>
commit 53c5eaabaea9a1b7a96f95ccc486d2ad721d95bb upstream.
Originally the Samsung quirks removed by commit 4c237371 can be covered
by commit e923e8e7 and ec_freeze_events=Y mode. But commit 9c40f956
changed ec_freeze_events=Y back to N, making this problem re-surface.
Actually, if commit e923e8e7 is robust enough, we can freely change
ec_freeze_events mode, so this patch fixes the issue by improving
commit e923e8e7.
Related commits listed in the merged order:
Commit: e923e8e79e18fd6be9162f1be6b99a002e9df2cb
Subject: ACPI / EC: Fix an issue that SCI_EVT cannot be detected
after event is enabled
Commit: 4c237371f290d1ed3b2071dd43554362137b1cce
Subject: ACPI / EC: Remove old CLEAR_ON_RESUME quirk
Commit: 9c40f956ce9b331493347d1b3cb7e384f7dc0581
Subject: Revert "ACPI / EC: Enable event freeze mode..." to fix
a regression
This patch not only fixes the reported post-resume EC event triggering
source issue, but also fixes an unreported similar issue related to the
driver bind by adding EC event triggering source in ec_install_handlers().
Fixes: e923e8e79e18 (ACPI / EC: Fix an issue that SCI_EVT cannot be detected after event is enabled)
Fixes: 4c237371f290 (ACPI / EC: Remove old CLEAR_ON_RESUME quirk)
Fixes: 9c40f956ce9b (Revert "ACPI / EC: Enable event freeze mode..." to fix a regression)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196833
Signed-off-by: Lv Zheng <lv.zheng(a)intel.com>
Reported-by: Alistair Hamilton <ahpatent(a)gmail.com>
Tested-by: Alistair Hamilton <ahpatent(a)gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/acpi/ec.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -486,8 +486,11 @@ static inline void __acpi_ec_enable_even
{
if (!test_and_set_bit(EC_FLAGS_QUERY_ENABLED, &ec->flags))
ec_log_drv("event unblocked");
- if (!test_bit(EC_FLAGS_QUERY_PENDING, &ec->flags))
- advance_transaction(ec);
+ /*
+ * Unconditionally invoke this once after enabling the event
+ * handling mechanism to detect the pending events.
+ */
+ advance_transaction(ec);
}
static inline void __acpi_ec_disable_event(struct acpi_ec *ec)
@@ -1456,11 +1459,10 @@ static int ec_install_handlers(struct ac
if (test_bit(EC_FLAGS_STARTED, &ec->flags) &&
ec->reference_count >= 1)
acpi_ec_enable_gpe(ec, true);
-
- /* EC is fully operational, allow queries */
- acpi_ec_enable_event(ec);
}
}
+ /* EC is fully operational, allow queries */
+ acpi_ec_enable_event(ec);
return 0;
}
Patches currently in stable-queue which might be from lv.zheng(a)intel.com are
queue-4.14/acpi-ec-fix-regression-related-to-triggering-source-of-ec-event-handling.patch
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4f89fa286f6729312e227e7c2d764e8e7b9d340e Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:24 +0000
Subject: [PATCH] ACPI / APEI: Replace ioremap_page_range() with fixmap
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range()
with __set_fixmap() as ioremap_page_range() may sleep to allocate a new
level of page-table, even if its passed an existing final-address to
use in the mapping.
The GHES driver can only be enabled for architectures that select
HAVE_ACPI_APEI: Add fixmap entries to both x86 and arm64.
clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64
and __set_pte_vaddr() for x86. In each case its the same as the
respective arch_apei_flush_tlb_one().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Tested-by: Toshi Kani <toshi.kani(a)hpe.com>
[ For the arm64 bits: ]
Acked-by: Will Deacon <will.deacon(a)arm.com>
[ For the x86 bits: ]
Acked-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index caf86be815ba..4052ec39e8db 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -51,6 +51,13 @@ enum fixed_addresses {
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
+
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif /* CONFIG_ACPI_APEI_GHES */
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..b0c505fe9a95 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,6 +104,12 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index cb7aceae3553..572b6c7303ed 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -51,6 +51,7 @@
#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
+#include <asm/fixmap.h>
#include <asm/tlbflush.h>
#include <ras/ras_event.h>
@@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* Because the memory area used to transfer hardware error information
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
- * a special version of atomic ioremap is implemented for that.
+ * the fixmap is used instead.
*/
/*
@@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex);
/* virtual memory area for atomic ioremap */
static struct vm_struct *ghes_ioremap_area;
/*
- * These 2 spinlock is used to prevent atomic ioremap virtual memory
- * area from being mapped simultaneously.
+ * These 2 spinlocks are used to prevent the fixmap entries from being used
+ * simultaneously.
*/
static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);
@@ -159,53 +160,36 @@ static void ghes_ioremap_exit(void)
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
+ __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot);
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI);
}
static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
+ __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
-
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ);
}
-static void ghes_iounmap_nmi(void __iomem *vaddr_ptr)
+static void ghes_iounmap_nmi(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_NMI);
}
-static void ghes_iounmap_irq(void __iomem *vaddr_ptr)
+static void ghes_iounmap_irq(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_IRQ_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_IRQ);
}
static int ghes_estatus_pool_init(void)
@@ -361,10 +345,10 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
paddr += trunk;
buffer += trunk;
if (in_nmi) {
- ghes_iounmap_nmi(vaddr);
+ ghes_iounmap_nmi();
raw_spin_unlock(&ghes_ioremap_lock_nmi);
} else {
- ghes_iounmap_irq(vaddr);
+ ghes_iounmap_irq();
spin_unlock_irqrestore(&ghes_ioremap_lock_irq, flags);
}
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4f89fa286f6729312e227e7c2d764e8e7b9d340e Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:24 +0000
Subject: [PATCH] ACPI / APEI: Replace ioremap_page_range() with fixmap
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range()
with __set_fixmap() as ioremap_page_range() may sleep to allocate a new
level of page-table, even if its passed an existing final-address to
use in the mapping.
The GHES driver can only be enabled for architectures that select
HAVE_ACPI_APEI: Add fixmap entries to both x86 and arm64.
clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64
and __set_pte_vaddr() for x86. In each case its the same as the
respective arch_apei_flush_tlb_one().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Tested-by: Toshi Kani <toshi.kani(a)hpe.com>
[ For the arm64 bits: ]
Acked-by: Will Deacon <will.deacon(a)arm.com>
[ For the x86 bits: ]
Acked-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index caf86be815ba..4052ec39e8db 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -51,6 +51,13 @@ enum fixed_addresses {
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
+
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif /* CONFIG_ACPI_APEI_GHES */
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..b0c505fe9a95 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,6 +104,12 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index cb7aceae3553..572b6c7303ed 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -51,6 +51,7 @@
#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
+#include <asm/fixmap.h>
#include <asm/tlbflush.h>
#include <ras/ras_event.h>
@@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* Because the memory area used to transfer hardware error information
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
- * a special version of atomic ioremap is implemented for that.
+ * the fixmap is used instead.
*/
/*
@@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex);
/* virtual memory area for atomic ioremap */
static struct vm_struct *ghes_ioremap_area;
/*
- * These 2 spinlock is used to prevent atomic ioremap virtual memory
- * area from being mapped simultaneously.
+ * These 2 spinlocks are used to prevent the fixmap entries from being used
+ * simultaneously.
*/
static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);
@@ -159,53 +160,36 @@ static void ghes_ioremap_exit(void)
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
+ __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot);
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI);
}
static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
+ __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
-
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ);
}
-static void ghes_iounmap_nmi(void __iomem *vaddr_ptr)
+static void ghes_iounmap_nmi(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_NMI);
}
-static void ghes_iounmap_irq(void __iomem *vaddr_ptr)
+static void ghes_iounmap_irq(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_IRQ_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_IRQ);
}
static int ghes_estatus_pool_init(void)
@@ -361,10 +345,10 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
paddr += trunk;
buffer += trunk;
if (in_nmi) {
- ghes_iounmap_nmi(vaddr);
+ ghes_iounmap_nmi();
raw_spin_unlock(&ghes_ioremap_lock_nmi);
} else {
- ghes_iounmap_irq(vaddr);
+ ghes_iounmap_irq();
spin_unlock_irqrestore(&ghes_ioremap_lock_irq, flags);
}
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4f89fa286f6729312e227e7c2d764e8e7b9d340e Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:24 +0000
Subject: [PATCH] ACPI / APEI: Replace ioremap_page_range() with fixmap
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range()
with __set_fixmap() as ioremap_page_range() may sleep to allocate a new
level of page-table, even if its passed an existing final-address to
use in the mapping.
The GHES driver can only be enabled for architectures that select
HAVE_ACPI_APEI: Add fixmap entries to both x86 and arm64.
clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64
and __set_pte_vaddr() for x86. In each case its the same as the
respective arch_apei_flush_tlb_one().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Tested-by: Toshi Kani <toshi.kani(a)hpe.com>
[ For the arm64 bits: ]
Acked-by: Will Deacon <will.deacon(a)arm.com>
[ For the x86 bits: ]
Acked-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index caf86be815ba..4052ec39e8db 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -51,6 +51,13 @@ enum fixed_addresses {
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
+
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif /* CONFIG_ACPI_APEI_GHES */
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..b0c505fe9a95 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,6 +104,12 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index cb7aceae3553..572b6c7303ed 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -51,6 +51,7 @@
#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
+#include <asm/fixmap.h>
#include <asm/tlbflush.h>
#include <ras/ras_event.h>
@@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* Because the memory area used to transfer hardware error information
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
- * a special version of atomic ioremap is implemented for that.
+ * the fixmap is used instead.
*/
/*
@@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex);
/* virtual memory area for atomic ioremap */
static struct vm_struct *ghes_ioremap_area;
/*
- * These 2 spinlock is used to prevent atomic ioremap virtual memory
- * area from being mapped simultaneously.
+ * These 2 spinlocks are used to prevent the fixmap entries from being used
+ * simultaneously.
*/
static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);
@@ -159,53 +160,36 @@ static void ghes_ioremap_exit(void)
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
+ __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot);
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI);
}
static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
+ __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
-
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ);
}
-static void ghes_iounmap_nmi(void __iomem *vaddr_ptr)
+static void ghes_iounmap_nmi(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_NMI);
}
-static void ghes_iounmap_irq(void __iomem *vaddr_ptr)
+static void ghes_iounmap_irq(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_IRQ_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_IRQ);
}
static int ghes_estatus_pool_init(void)
@@ -361,10 +345,10 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
paddr += trunk;
buffer += trunk;
if (in_nmi) {
- ghes_iounmap_nmi(vaddr);
+ ghes_iounmap_nmi();
raw_spin_unlock(&ghes_ioremap_lock_nmi);
} else {
- ghes_iounmap_irq(vaddr);
+ ghes_iounmap_irq();
spin_unlock_irqrestore(&ghes_ioremap_lock_irq, flags);
}
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4f89fa286f6729312e227e7c2d764e8e7b9d340e Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:24 +0000
Subject: [PATCH] ACPI / APEI: Replace ioremap_page_range() with fixmap
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range()
with __set_fixmap() as ioremap_page_range() may sleep to allocate a new
level of page-table, even if its passed an existing final-address to
use in the mapping.
The GHES driver can only be enabled for architectures that select
HAVE_ACPI_APEI: Add fixmap entries to both x86 and arm64.
clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64
and __set_pte_vaddr() for x86. In each case its the same as the
respective arch_apei_flush_tlb_one().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Tested-by: Toshi Kani <toshi.kani(a)hpe.com>
[ For the arm64 bits: ]
Acked-by: Will Deacon <will.deacon(a)arm.com>
[ For the x86 bits: ]
Acked-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index caf86be815ba..4052ec39e8db 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -51,6 +51,13 @@ enum fixed_addresses {
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
+
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif /* CONFIG_ACPI_APEI_GHES */
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..b0c505fe9a95 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,6 +104,12 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+#ifdef CONFIG_ACPI_APEI_GHES
+ /* Used for GHES mapping from assorted contexts */
+ FIX_APEI_GHES_IRQ,
+ FIX_APEI_GHES_NMI,
+#endif
+
__end_of_permanent_fixed_addresses,
/*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index cb7aceae3553..572b6c7303ed 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -51,6 +51,7 @@
#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
+#include <asm/fixmap.h>
#include <asm/tlbflush.h>
#include <ras/ras_event.h>
@@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* Because the memory area used to transfer hardware error information
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
- * a special version of atomic ioremap is implemented for that.
+ * the fixmap is used instead.
*/
/*
@@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex);
/* virtual memory area for atomic ioremap */
static struct vm_struct *ghes_ioremap_area;
/*
- * These 2 spinlock is used to prevent atomic ioremap virtual memory
- * area from being mapped simultaneously.
+ * These 2 spinlocks are used to prevent the fixmap entries from being used
+ * simultaneously.
*/
static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);
@@ -159,53 +160,36 @@ static void ghes_ioremap_exit(void)
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
+ __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot);
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI);
}
static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
{
- unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
- vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
+ __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
-
- return (void __iomem *)vaddr;
+ return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ);
}
-static void ghes_iounmap_nmi(void __iomem *vaddr_ptr)
+static void ghes_iounmap_nmi(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_NMI);
}
-static void ghes_iounmap_irq(void __iomem *vaddr_ptr)
+static void ghes_iounmap_irq(void)
{
- unsigned long vaddr = (unsigned long __force)vaddr_ptr;
- void *base = ghes_ioremap_area->addr;
-
- BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_IRQ_PAGE(base));
- unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
- arch_apei_flush_tlb_one(vaddr);
+ clear_fixmap(FIX_APEI_GHES_IRQ);
}
static int ghes_estatus_pool_init(void)
@@ -361,10 +345,10 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
paddr += trunk;
buffer += trunk;
if (in_nmi) {
- ghes_iounmap_nmi(vaddr);
+ ghes_iounmap_nmi();
raw_spin_unlock(&ghes_ioremap_lock_nmi);
} else {
- ghes_iounmap_irq(vaddr);
+ ghes_iounmap_irq();
spin_unlock_irqrestore(&ghes_ioremap_lock_irq, flags);
}
}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 520e18a5080d2c444a03280d99c8a35cb667d321 Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:25 +0000
Subject: [PATCH] ACPI / APEI: Remove ghes_ioremap_area
Now that nothing is using the ghes_ioremap_area pages, rip them out.
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 572b6c7303ed..f14695e744d0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
* the fixmap is used instead.
- */
-
-/*
- * Two virtual pages are used, one for IRQ/PROCESS context, the other for
- * NMI context (optionally).
- */
-#define GHES_IOREMAP_PAGES 2
-#define GHES_IOREMAP_IRQ_PAGE(base) (base)
-#define GHES_IOREMAP_NMI_PAGE(base) ((base) + PAGE_SIZE)
-
-/* virtual memory area for atomic ioremap */
-static struct vm_struct *ghes_ioremap_area;
-/*
+ *
* These 2 spinlocks are used to prevent the fixmap entries from being used
* simultaneously.
*/
@@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced;
static int ghes_panic_timeout __read_mostly = 30;
-static int ghes_ioremap_init(void)
-{
- ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
- VM_IOREMAP, VMALLOC_START, VMALLOC_END);
- if (!ghes_ioremap_area) {
- pr_err(GHES_PFX "Failed to allocate virtual memory area for atomic ioremap.\n");
- return -ENOMEM;
- }
-
- return 0;
-}
-
-static void ghes_ioremap_exit(void)
-{
- free_vm_area(ghes_ioremap_area);
-}
-
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
phys_addr_t paddr;
@@ -1247,13 +1218,9 @@ static int __init ghes_init(void)
ghes_nmi_init_cxt();
- rc = ghes_ioremap_init();
- if (rc)
- goto err;
-
rc = ghes_estatus_pool_init();
if (rc)
- goto err_ioremap_exit;
+ goto err;
rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE *
GHES_ESTATUS_CACHE_ALLOCED_MAX);
@@ -1277,8 +1244,6 @@ static int __init ghes_init(void)
return 0;
err_pool_exit:
ghes_estatus_pool_exit();
-err_ioremap_exit:
- ghes_ioremap_exit();
err:
return rc;
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 520e18a5080d2c444a03280d99c8a35cb667d321 Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:25 +0000
Subject: [PATCH] ACPI / APEI: Remove ghes_ioremap_area
Now that nothing is using the ghes_ioremap_area pages, rip them out.
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 572b6c7303ed..f14695e744d0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
* the fixmap is used instead.
- */
-
-/*
- * Two virtual pages are used, one for IRQ/PROCESS context, the other for
- * NMI context (optionally).
- */
-#define GHES_IOREMAP_PAGES 2
-#define GHES_IOREMAP_IRQ_PAGE(base) (base)
-#define GHES_IOREMAP_NMI_PAGE(base) ((base) + PAGE_SIZE)
-
-/* virtual memory area for atomic ioremap */
-static struct vm_struct *ghes_ioremap_area;
-/*
+ *
* These 2 spinlocks are used to prevent the fixmap entries from being used
* simultaneously.
*/
@@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced;
static int ghes_panic_timeout __read_mostly = 30;
-static int ghes_ioremap_init(void)
-{
- ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
- VM_IOREMAP, VMALLOC_START, VMALLOC_END);
- if (!ghes_ioremap_area) {
- pr_err(GHES_PFX "Failed to allocate virtual memory area for atomic ioremap.\n");
- return -ENOMEM;
- }
-
- return 0;
-}
-
-static void ghes_ioremap_exit(void)
-{
- free_vm_area(ghes_ioremap_area);
-}
-
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
phys_addr_t paddr;
@@ -1247,13 +1218,9 @@ static int __init ghes_init(void)
ghes_nmi_init_cxt();
- rc = ghes_ioremap_init();
- if (rc)
- goto err;
-
rc = ghes_estatus_pool_init();
if (rc)
- goto err_ioremap_exit;
+ goto err;
rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE *
GHES_ESTATUS_CACHE_ALLOCED_MAX);
@@ -1277,8 +1244,6 @@ static int __init ghes_init(void)
return 0;
err_pool_exit:
ghes_estatus_pool_exit();
-err_ioremap_exit:
- ghes_ioremap_exit();
err:
return rc;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 520e18a5080d2c444a03280d99c8a35cb667d321 Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:25 +0000
Subject: [PATCH] ACPI / APEI: Remove ghes_ioremap_area
Now that nothing is using the ghes_ioremap_area pages, rip them out.
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 572b6c7303ed..f14695e744d0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
* the fixmap is used instead.
- */
-
-/*
- * Two virtual pages are used, one for IRQ/PROCESS context, the other for
- * NMI context (optionally).
- */
-#define GHES_IOREMAP_PAGES 2
-#define GHES_IOREMAP_IRQ_PAGE(base) (base)
-#define GHES_IOREMAP_NMI_PAGE(base) ((base) + PAGE_SIZE)
-
-/* virtual memory area for atomic ioremap */
-static struct vm_struct *ghes_ioremap_area;
-/*
+ *
* These 2 spinlocks are used to prevent the fixmap entries from being used
* simultaneously.
*/
@@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced;
static int ghes_panic_timeout __read_mostly = 30;
-static int ghes_ioremap_init(void)
-{
- ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
- VM_IOREMAP, VMALLOC_START, VMALLOC_END);
- if (!ghes_ioremap_area) {
- pr_err(GHES_PFX "Failed to allocate virtual memory area for atomic ioremap.\n");
- return -ENOMEM;
- }
-
- return 0;
-}
-
-static void ghes_ioremap_exit(void)
-{
- free_vm_area(ghes_ioremap_area);
-}
-
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
phys_addr_t paddr;
@@ -1247,13 +1218,9 @@ static int __init ghes_init(void)
ghes_nmi_init_cxt();
- rc = ghes_ioremap_init();
- if (rc)
- goto err;
-
rc = ghes_estatus_pool_init();
if (rc)
- goto err_ioremap_exit;
+ goto err;
rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE *
GHES_ESTATUS_CACHE_ALLOCED_MAX);
@@ -1277,8 +1244,6 @@ static int __init ghes_init(void)
return 0;
err_pool_exit:
ghes_estatus_pool_exit();
-err_ioremap_exit:
- ghes_ioremap_exit();
err:
return rc;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 520e18a5080d2c444a03280d99c8a35cb667d321 Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Mon, 6 Nov 2017 18:44:25 +0000
Subject: [PATCH] ACPI / APEI: Remove ghes_ioremap_area
Now that nothing is using the ghes_ioremap_area pages, rip them out.
Signed-off-by: James Morse <james.morse(a)arm.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tyler Baicar <tbaicar(a)codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: All applicable <stable(a)vger.kernel.org>
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 572b6c7303ed..f14695e744d0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
* from BIOS to Linux can be determined only in NMI, IRQ or timer
* handler, but general ioremap can not be used in atomic context, so
* the fixmap is used instead.
- */
-
-/*
- * Two virtual pages are used, one for IRQ/PROCESS context, the other for
- * NMI context (optionally).
- */
-#define GHES_IOREMAP_PAGES 2
-#define GHES_IOREMAP_IRQ_PAGE(base) (base)
-#define GHES_IOREMAP_NMI_PAGE(base) ((base) + PAGE_SIZE)
-
-/* virtual memory area for atomic ioremap */
-static struct vm_struct *ghes_ioremap_area;
-/*
+ *
* These 2 spinlocks are used to prevent the fixmap entries from being used
* simultaneously.
*/
@@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced;
static int ghes_panic_timeout __read_mostly = 30;
-static int ghes_ioremap_init(void)
-{
- ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
- VM_IOREMAP, VMALLOC_START, VMALLOC_END);
- if (!ghes_ioremap_area) {
- pr_err(GHES_PFX "Failed to allocate virtual memory area for atomic ioremap.\n");
- return -ENOMEM;
- }
-
- return 0;
-}
-
-static void ghes_ioremap_exit(void)
-{
- free_vm_area(ghes_ioremap_area);
-}
-
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
phys_addr_t paddr;
@@ -1247,13 +1218,9 @@ static int __init ghes_init(void)
ghes_nmi_init_cxt();
- rc = ghes_ioremap_init();
- if (rc)
- goto err;
-
rc = ghes_estatus_pool_init();
if (rc)
- goto err_ioremap_exit;
+ goto err;
rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE *
GHES_ESTATUS_CACHE_ALLOCED_MAX);
@@ -1277,8 +1244,6 @@ static int __init ghes_init(void)
return 0;
err_pool_exit:
ghes_estatus_pool_exit();
-err_ioremap_exit:
- ghes_ioremap_exit();
err:
return rc;
}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5c50538752af7968f53924b22dede8ed4ce4cb3b Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Tue, 26 Sep 2017 09:16:48 +0200
Subject: [PATCH] s390/disassembler: add missing end marker for e7 table
The e7 opcode table does not have an end marker. Hence when trying to
find an unknown e7 instruction the code will access memory behind the
table until it finds something that matches the opcode, or the kernel
crashes, whatever comes first.
This affects not only the in-kernel disassembler but also uprobes and
kprobes which refuse to set a probe on unknown instructions, and
therefore search the opcode tables to figure out if instructions are
known or not.
Cc: <stable(a)vger.kernel.org> # v3.18+
Fixes: 3585cb0280654 ("s390/disassembler: add vector instructions")
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
diff --git a/arch/s390/kernel/dis.c b/arch/s390/kernel/dis.c
index f7e82302a71e..d9970c15f79d 100644
--- a/arch/s390/kernel/dis.c
+++ b/arch/s390/kernel/dis.c
@@ -1548,6 +1548,7 @@ static struct s390_insn opcode_e7[] = {
{ "vfsq", 0xce, INSTR_VRR_VV000MM },
{ "vfs", 0xe2, INSTR_VRR_VVV00MM },
{ "vftci", 0x4a, INSTR_VRI_VVIMM },
+ { "", 0, INSTR_INVALID }
};
static struct s390_insn opcode_eb[] = {
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d6e646ad7cfa7034d280459b2b2546288f247144 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Mon, 11 Sep 2017 11:24:22 +0200
Subject: [PATCH] s390/runtime instrumention: fix possible memory corruption
For PREEMPT enabled kernels the runtime instrumentation (RI) code
contains a possible use-after-free bug. If a task that makes use of RI
exits, it will execute do_exit() while still enabled for preemption.
That function will call exit_thread_runtime_instr() via
exit_thread(). If exit_thread_runtime_instr() gets preempted after the
RI control block of the task has been freed but before the pointer to
it is set to NULL, then save_ri_cb(), called from switch_to(), will
write to already freed memory.
Avoid this and simply disable preemption while freeing the control
block and setting the pointer to NULL.
Fixes: e4b8b3f33fca ("s390: add support for runtime instrumentation")
Cc: <stable(a)vger.kernel.org> # v3.7+
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
diff --git a/arch/s390/kernel/runtime_instr.c b/arch/s390/kernel/runtime_instr.c
index 429d3a782f1c..b9738ae2e1de 100644
--- a/arch/s390/kernel/runtime_instr.c
+++ b/arch/s390/kernel/runtime_instr.c
@@ -49,11 +49,13 @@ void exit_thread_runtime_instr(void)
{
struct task_struct *task = current;
+ preempt_disable();
if (!task->thread.ri_cb)
return;
disable_runtime_instr();
kfree(task->thread.ri_cb);
task->thread.ri_cb = NULL;
+ preempt_enable();
}
SYSCALL_DEFINE1(s390_runtime_instr, int, command)
@@ -64,9 +66,7 @@ SYSCALL_DEFINE1(s390_runtime_instr, int, command)
return -EOPNOTSUPP;
if (command == S390_RUNTIME_INSTR_STOP) {
- preempt_disable();
exit_thread_runtime_instr();
- preempt_enable();
return 0;
}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a1c5befc1c24eb9c1ee83f711e0f21ee79cbb556 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Date: Thu, 9 Nov 2017 12:29:34 +0100
Subject: [PATCH] s390: fix transactional execution control register handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Dan Horák reported the following crash related to transactional execution:
User process fault: interruption code 0013 ilc:3 in libpthread-2.26.so[3ff93c00000+1b000]
CPU: 2 PID: 1 Comm: /init Not tainted 4.13.4-300.fc27.s390x #1
Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
task: 00000000fafc8000 task.stack: 00000000fafc4000
User PSW : 0705200180000000 000003ff93c14e70
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
User GPRS: 0000000000000077 000003ff00000000 000003ff93144d48 000003ff93144d5e
0000000000000000 0000000000000002 0000000000000000 000003ff00000000
0000000000000000 0000000000000418 0000000000000000 000003ffcc9fe770
000003ff93d28f50 000003ff9310acf0 000003ff92b0319a 000003ffcc9fe6d0
User Code: 000003ff93c14e62: 60e0b030 std %f14,48(%r11)
000003ff93c14e66: 60f0b038 std %f15,56(%r11)
#000003ff93c14e6a: e5600000ff0e tbegin 0,65294
>000003ff93c14e70: a7740006 brc 7,3ff93c14e7c
000003ff93c14e74: a7080000 lhi %r0,0
000003ff93c14e78: a7f40023 brc 15,3ff93c14ebe
000003ff93c14e7c: b2220000 ipm %r0
000003ff93c14e80: 8800001c srl %r0,28
There are several bugs with control register handling with respect to
transactional execution:
- on task switch update_per_regs() is only called if the next task has
an mm (is not a kernel thread). This however is incorrect. This
breaks e.g. for user mode helper handling, where the kernel creates
a kernel thread and then execve's a user space program. Control
register contents related to transactional execution won't be
updated on execve. If the previous task ran with transactional
execution disabled then the new task will also run with
transactional execution disabled, which is incorrect. Therefore call
update_per_regs() unconditionally within switch_to().
- on startup the transactional execution facility is not enabled for
the idle thread. This is not really a bug, but an inconsistency to
other facilities. Therefore enable the facility if it is available.
- on fork the new thread's per_flags field is not cleared. This means
that a child process inherits the PER_FLAG_NO_TE flag. This flag can
be set with a ptrace request to disable transactional execution for
the current process. It should not be inherited by new child
processes in order to be consistent with the handling of all other
PER related debugging options. Therefore clear the per_flags field in
copy_thread_tls().
Reported-and-tested-by: Dan Horák <dan(a)danny.cz>
Fixes: d35339a42dd1 ("s390: add support for transactional memory")
Cc: <stable(a)vger.kernel.org> # v3.7+
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner(a)linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
diff --git a/arch/s390/include/asm/switch_to.h b/arch/s390/include/asm/switch_to.h
index f6c2b5814ab0..8e6b07609ff4 100644
--- a/arch/s390/include/asm/switch_to.h
+++ b/arch/s390/include/asm/switch_to.h
@@ -36,8 +36,8 @@ static inline void restore_access_regs(unsigned int *acrs)
save_ri_cb(prev->thread.ri_cb); \
save_gs_cb(prev->thread.gs_cb); \
} \
+ update_cr_regs(next); \
if (next->mm) { \
- update_cr_regs(next); \
set_cpu_flag(CIF_FPU); \
restore_access_regs(&next->thread.acrs[0]); \
restore_ri_cb(next->thread.ri_cb, prev->thread.ri_cb); \
diff --git a/arch/s390/kernel/early.c b/arch/s390/kernel/early.c
index 389e9f61c76f..5096875f7822 100644
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -238,8 +238,10 @@ static __init void detect_machine_facilities(void)
S390_lowcore.machine_flags |= MACHINE_FLAG_IDTE;
if (test_facility(40))
S390_lowcore.machine_flags |= MACHINE_FLAG_LPP;
- if (test_facility(50) && test_facility(73))
+ if (test_facility(50) && test_facility(73)) {
S390_lowcore.machine_flags |= MACHINE_FLAG_TE;
+ __ctl_set_bit(0, 55);
+ }
if (test_facility(51))
S390_lowcore.machine_flags |= MACHINE_FLAG_TLB_LC;
if (test_facility(129)) {
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 080c851dd9a5..cee658e27732 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -86,6 +86,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long new_stackp,
memset(&p->thread.per_user, 0, sizeof(p->thread.per_user));
memset(&p->thread.per_event, 0, sizeof(p->thread.per_event));
clear_tsk_thread_flag(p, TIF_SINGLE_STEP);
+ p->thread.per_flags = 0;
/* Initialize per thread user and system timer values */
p->thread.user_timer = 0;
p->thread.guest_timer = 0;
The skcipher_walk_aead_common function calls scatterwalk_copychunks on
the input and output walks to skip the associated data. If the AD end
at an SG list entry boundary, then after these calls the walks will
still be pointing to the end of the skipped region.
These offsets are later checked for alignment in skcipher_walk_next,
so the skcipher_walk may detect the alignment incorrectly.
This patch fixes it by calling scatterwalk_done after the copychunks
calls to ensure that the offsets refer to the right SG list entry.
Fixes: b286d8b1a690 ("crypto: skcipher - Add skcipher walk interface")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnacek(a)gmail.com>
---
crypto/skcipher.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 4faa0fd53b0c..6c45ed536664 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -517,6 +517,9 @@ static int skcipher_walk_aead_common(struct skcipher_walk *walk,
scatterwalk_copychunks(NULL, &walk->in, req->assoclen, 2);
scatterwalk_copychunks(NULL, &walk->out, req->assoclen, 2);
+ scatterwalk_done(&walk->in, 0, walk->total);
+ scatterwalk_done(&walk->out, 0, walk->total);
+
walk->iv = req->iv;
walk->oiv = req->iv;
--
2.14.1
From: Rui Hua <huarui.dev(a)gmail.com>
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable(a)vger.kernel.org> # 4.14
Signed-off-by: Hua Rui <huarui.dev(a)gmail.com>
Reviewed-by: Michael Lyle <mlyle(a)lyle.org>
Reviewed-by: Coly Li <colyli(a)suse.de>
Signed-off-by: Michael Lyle <mlyle(a)lyle.org>
---
drivers/md/bcache/request.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 3a7aed7282b2..643c3021624f 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -708,16 +708,15 @@ static void cached_dev_read_error(struct closure *cl)
{
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = &s->bio.bio;
- struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
/*
- * If cache device is dirty (dc->has_dirty is non-zero), then
- * recovery a failed read request from cached device may get a
- * stale data back. So read failure recovery is only permitted
- * when cache device is clean.
+ * If read request hit dirty data (s->read_dirty_data is true),
+ * then recovery a failed read request from cached device may
+ * get a stale data back. So read failure recovery is only
+ * permitted when read request hit clean data in cache device,
+ * or when cache read race happened.
*/
- if (s->recoverable &&
- (dc && !atomic_read(&dc->has_dirty))) {
+ if (s->recoverable && !s->read_dirty_data) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);
--
2.14.1
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
We're supposed to examine msgs[i] and msgs[i+1] to see if they
form a pair suitable for an indexed transfer. But in reality
we're examining msgs[0] and msgs[1]. Fix this.
Cc: stable(a)vger.kernel.org
Cc: Daniel Kurtz <djkurtz(a)chromium.org>
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Sean Paul <seanpaul(a)chromium.org>
Fixes: 56f9eac05489 ("drm/i915/intel_i2c: use INDEX cycles for i2c read transactions")
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/gpu/drm/i915/intel_i2c.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/intel_i2c.c b/drivers/gpu/drm/i915/intel_i2c.c
index eb5827110d8f..165375cbef2f 100644
--- a/drivers/gpu/drm/i915/intel_i2c.c
+++ b/drivers/gpu/drm/i915/intel_i2c.c
@@ -484,7 +484,7 @@ do_gmbus_xfer(struct i2c_adapter *adapter, struct i2c_msg *msgs, int num)
for (; i < num; i += inc) {
inc = 1;
- if (gmbus_is_index_read(msgs, i, num)) {
+ if (gmbus_is_index_read(&msgs[i], i, num)) {
ret = gmbus_xfer_index_read(dev_priv, &msgs[i]);
inc = 2; /* an index read is two msgs */
} else if (msgs[i].flags & I2C_M_RD) {
--
2.13.6
This is a note to let you know that I've just added the patch titled
mm, hwpoison: fixup "mm: check the return value of lookup_page_ext for all call sites"
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-hwpoison-fixup-mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Fri Nov 24 11:13:07 CET 2017
Date: Fri, 24 Nov 2017 11:13:07 +0100
To: Greg KH <gregkh(a)linuxfoundation.org>
From: Michal Hocko <mhocko(a)suse.com>
Subject: mm, hwpoison: fixup "mm: check the return value of lookup_page_ext for all call sites"
From: Michal Hocko <mhocko(a)suse.com>
Backport of the upstream commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites") is wrong for hwpoison
pages. I have accidentally negated the condition for bailout. This
basically disables hwpoison pages tracking while the code still
might crash on unusual configurations when struct pages do not have
page_ext allocated. The fix is trivial to invert the condition.
Reported-by: Jiri Slaby <jslaby(a)suse.cz>
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/debug-pagealloc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/debug-pagealloc.c
+++ b/mm/debug-pagealloc.c
@@ -34,7 +34,7 @@ static inline void set_page_poison(struc
struct page_ext *page_ext;
page_ext = lookup_page_ext(page);
- if (page_ext)
+ if (!page_ext)
return;
__set_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}
@@ -44,7 +44,7 @@ static inline void clear_page_poison(str
struct page_ext *page_ext;
page_ext = lookup_page_ext(page);
- if (page_ext)
+ if (!page_ext)
return;
__clear_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}
@@ -54,7 +54,7 @@ static inline bool page_poison(struct pa
struct page_ext *page_ext;
page_ext = lookup_page_ext(page);
- if (page_ext)
+ if (!page_ext)
return false;
return test_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}
Patches currently in stable-queue which might be from gregkh(a)linuxfoundation.org are
queue-4.4/mm-hwpoison-fixup-mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites.patch
During an eeh a kernel-oops is reported if no vPHB is allocated to the
AFU. This happens as during AFU init, an error in creation of vPHB is
a non-fatal error. Hence afu->phb should always be checked for NULL
before iterating over it for the virtual AFU pci devices.
This patch fixes the kenel-oops by adding a NULL pointer check for
afu->phb before it is dereferenced.
Fixes: 9e8df8a2196("cxl: EEH support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.vnet.ibm.com>
---
Changelog:
v3 -> Fixed a reverse NULL check [Fred]
Resend -> Added the 'Fixes' info and marking the patch to stable tree [Mpe]
v2 -> Added the vphb NULL check to cxl_vphb_error_detected() [Andrew]
---
drivers/misc/cxl/pci.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index bb7fd3f4edab..19969ee86d6f 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -2083,6 +2083,9 @@ static pci_ers_result_t cxl_vphb_error_detected(struct cxl_afu *afu,
/* There should only be one entry, but go through the list
* anyway
*/
+ if (afu->phb == NULL)
+ return result;
+
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
if (!afu_dev->driver)
continue;
@@ -2124,8 +2127,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
* Tell the AFU drivers; but we don't care what they
* say, we're going away.
*/
- if (afu->phb != NULL)
- cxl_vphb_error_detected(afu, state);
+ cxl_vphb_error_detected(afu, state);
}
return PCI_ERS_RESULT_DISCONNECT;
}
@@ -2265,6 +2267,9 @@ static pci_ers_result_t cxl_pci_slot_reset(struct pci_dev *pdev)
if (cxl_afu_select_best_mode(afu))
goto err;
+ if (afu->phb == NULL)
+ continue;
+
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
/* Reset the device context.
* TODO: make this less disruptive
@@ -2327,6 +2332,9 @@ static void cxl_pci_resume(struct pci_dev *pdev)
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
+ if (afu->phb == NULL)
+ continue;
+
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
if (afu_dev->driver && afu_dev->driver->err_handler &&
afu_dev->driver->err_handler->resume)
--
2.14.3
Commit-ID: 12a78d43de767eaf8fb272facb7a7b6f2dc6a9df
Gitweb: https://git.kernel.org/tip/12a78d43de767eaf8fb272facb7a7b6f2dc6a9df
Author: Masami Hiramatsu <mhiramat(a)kernel.org>
AuthorDate: Fri, 24 Nov 2017 13:56:30 +0900
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitDate: Fri, 24 Nov 2017 08:36:12 +0100
x86/decoder: Add new TEST instruction pattern
The kbuild test robot reported this build warning:
Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c
Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
Warning: objdump says 3 bytes, but insn_get_length() says 2
Warning: decoded and checked 1569014 instructions with 1 warnings
This sequence seems to be a new instruction not in the opcode map in the Intel SDM.
The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"
In that table, opcodes listed by the index REG bits as:
000 001 010 011 100 101 110 111
TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX
So, it seems TEST Ib is assigned to 001.
Add the new pattern.
Reported-by: kbuild test robot <fengguang.wu(a)intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
---
arch/x86/lib/x86-opcode-map.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 12e3771..c4d5591 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -896,7 +896,7 @@ EndTable
GrpTable: Grp3_1
0: TEST Eb,Ib
-1:
+1: TEST Eb,Ib
2: NOT Eb
3: NEG Eb
4: MUL AL,Eb
Commit-ID: 843c885f41b3c74212d8a85da95993075dd41415
Gitweb: https://git.kernel.org/tip/843c885f41b3c74212d8a85da95993075dd41415
Author: Masami Hiramatsu <mhiramat(a)kernel.org>
AuthorDate: Fri, 24 Nov 2017 13:56:30 +0900
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitDate: Fri, 24 Nov 2017 08:16:09 +0100
x86/decoder: Add new TEST instruction pattern
The kbuild test robot reported this build warning:
Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c
Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
Warning: objdump says 3 bytes, but insn_get_length() says 2
Warning: decoded and checked 1569014 instructions with 1 warnings
This sequence seems to be a new instruction not in the opcode map in the Intel SDM.
The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"
In that table, opcodes listed by the index REG bits as:
000 001 010 011 100 101 110 111
TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX
So, it seems TEST Ib is assigned to 001.
Add the new pattern.
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Reported-by: kbuild test robot <fengguang.wu(a)intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
---
arch/x86/lib/x86-opcode-map.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 12e3771..c4d5591 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -896,7 +896,7 @@ EndTable
GrpTable: Grp3_1
0: TEST Eb,Ib
-1:
+1: TEST Eb,Ib
2: NOT Eb
3: NEG Eb
4: MUL AL,Eb
Changes since v1 [1]:
* fix arm64 compilation, add __HAVE_ARCH_PUD_WRITE
* fix sparc64 compilation, add __HAVE_ARCH_PUD_WRITE
* fix s390 compilation, add a pud_write() helper
---
Andrew,
Here is a third version to the pud_write() fix [2], and some follow-on
patches to use the '_access_permitted' helpers in fault and
get_user_pages() paths where we are checking if the thread has access to
write. I explicitly omit conversions for places where the kernel is
checking the _PAGE_RW flag for kernel purposes, not for userspace
access.
Beyond fixing the crash, this series also fixes get_user_pages() and
fault paths to honor protection keys in the same manner as
get_user_pages_fast(). Only the crash fix is tagged for -stable as the
protection key check is done just for consistency reasons since
userspace can change protection keys at will.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013249.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013237.html
---
Dan Williams (4):
mm: fix device-dax pud write-faults triggered by get_user_pages()
mm: replace pud_write with pud_access_permitted in fault + gup paths
mm: replace pmd_write with pmd_access_permitted in fault + gup paths
mm: replace pte_write with pte_access_permitted in fault + gup paths
arch/arm64/include/asm/pgtable.h | 1 +
arch/s390/include/asm/pgtable.h | 6 ++++++
arch/sparc/include/asm/pgtable_64.h | 1 +
arch/sparc/mm/gup.c | 4 ++--
arch/x86/include/asm/pgtable.h | 6 ++++++
fs/dax.c | 3 ++-
include/asm-generic/pgtable.h | 9 +++++++++
include/linux/hugetlb.h | 8 --------
mm/gup.c | 2 +-
mm/hmm.c | 8 ++++----
mm/huge_memory.c | 6 +++---
mm/memory.c | 8 ++++----
12 files changed, 39 insertions(+), 23 deletions(-)
This is the start of the stable review cycle for the 4.14.2 release.
There are 18 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri Nov 24 10:11:38 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.2-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.2-rc1
Jan Harkes <jaharkes(a)cs.cmu.edu>
coda: fix 'kernel memory exposure attempt' in fsync
Jaewon Kim <jaewon31.kim(a)samsung.com>
mm/page_ext.c: check if page_ext is not prepared
Pavel Tatashin <pasha.tatashin(a)oracle.com>
mm/page_alloc.c: broken deferred calculation
Corey Minyard <cminyard(a)mvista.com>
ipmi: fix unsigned long underflow
alex chen <alex.chen(a)huawei.com>
ocfs2: should wait dio before inode lock in ocfs2_setattr()
Changwei Ge <ge.changwei(a)h3c.com>
ocfs2: fix cluster hang after a node dies
Jann Horn <jannh(a)google.com>
mm/pagewalk.c: report holes in hugetlb ranges
Neeraj Upadhyay <neeraju(a)codeaurora.org>
rcu: Fix up pending cbs check in rcu_prepare_for_idle
Alexander Steffen <Alexander.Steffen(a)infineon.com>
tpm-dev-common: Reject too short writes
Ji-Ze Hong (Peter Hong) <hpeter(a)gmail.com>
serial: 8250_fintek: Fix finding base_port with activated SuperIO
Lukas Wunner <lukas(a)wunner.de>
serial: omap: Fix EFR write on RTS deassertion
Roberto Sassu <roberto.sassu(a)huawei.com>
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
Eric W. Biederman <ebiederm(a)xmission.com>
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
Huacai Chen <chenhc(a)lemote.com>
fealnx: Fix building error on MIPS
Bjørn Mork <bjorn(a)mork.no>
net: cdc_ncm: GetNtbFormat endian fix
Xin Long <lucien.xin(a)gmail.com>
vxlan: fix the issue that neigh proxy blocks all icmpv6 packets
Jason A. Donenfeld <Jason(a)zx2c4.com>
af_netlink: ensure that NLMSG_DONE never fails in dumps
Michael Lyle <mlyle(a)lyle.org>
bio: ensure __bio_clone_fast copies bi_partno
-------------
Diffstat:
Makefile | 4 ++--
block/bio.c | 1 +
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
drivers/char/tpm/tpm-dev-common.c | 6 ++++++
drivers/net/ethernet/fealnx.c | 6 +++---
drivers/net/usb/cdc_ncm.c | 4 ++--
drivers/net/vxlan.c | 31 +++++++++++++------------------
drivers/tty/serial/8250/8250_fintek.c | 3 +++
drivers/tty/serial/omap-serial.c | 2 +-
fs/coda/upcall.c | 3 +--
fs/ocfs2/dlm/dlmrecovery.c | 1 +
fs/ocfs2/file.c | 9 +++++++--
include/linux/mmzone.h | 3 ++-
kernel/rcu/tree_plugin.h | 2 +-
mm/page_alloc.c | 27 ++++++++++++++++++---------
mm/page_ext.c | 4 ----
mm/pagewalk.c | 6 +++++-
net/netlink/af_netlink.c | 17 +++++++++++------
net/netlink/af_netlink.h | 1 +
net/sctp/ipv6.c | 5 +++--
security/integrity/ima/ima_appraise.c | 3 +++
21 files changed, 90 insertions(+), 58 deletions(-)
This is the start of the stable review cycle for the 4.13.16 release.
There are 35 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri Nov 24 10:11:25 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.13.16-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.13.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.13.16-rc1
Jan Harkes <jaharkes(a)cs.cmu.edu>
coda: fix 'kernel memory exposure attempt' in fsync
Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
Jaewon Kim <jaewon31.kim(a)samsung.com>
mm/page_ext.c: check if page_ext is not prepared
Pavel Tatashin <pasha.tatashin(a)oracle.com>
mm/page_alloc.c: broken deferred calculation
Corey Minyard <cminyard(a)mvista.com>
ipmi: fix unsigned long underflow
alex chen <alex.chen(a)huawei.com>
ocfs2: should wait dio before inode lock in ocfs2_setattr()
Changwei Ge <ge.changwei(a)h3c.com>
ocfs2: fix cluster hang after a node dies
Jann Horn <jannh(a)google.com>
mm/pagewalk.c: report holes in hugetlb ranges
Neeraj Upadhyay <neeraju(a)codeaurora.org>
rcu: Fix up pending cbs check in rcu_prepare_for_idle
Alexander Steffen <Alexander.Steffen(a)infineon.com>
tpm-dev-common: Reject too short writes
Ji-Ze Hong (Peter Hong) <hpeter(a)gmail.com>
serial: 8250_fintek: Fix finding base_port with activated SuperIO
Lukas Wunner <lukas(a)wunner.de>
serial: omap: Fix EFR write on RTS deassertion
Roberto Sassu <roberto.sassu(a)huawei.com>
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
Eric W. Biederman <ebiederm(a)xmission.com>
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
Huacai Chen <chenhc(a)lemote.com>
fealnx: Fix building error on MIPS
Xin Long <lucien.xin(a)gmail.com>
sctp: do not peel off an assoc from one netns to another one
Bjørn Mork <bjorn(a)mork.no>
net: cdc_ncm: GetNtbFormat endian fix
Xin Long <lucien.xin(a)gmail.com>
vxlan: fix the issue that neigh proxy blocks all icmpv6 packets
Jason A. Donenfeld <Jason(a)zx2c4.com>
af_netlink: ensure that NLMSG_DONE never fails in dumps
Inbar Karmy <inbark(a)mellanox.com>
net/mlx5e: Set page to null in case dma mapping fails
Huy Nguyen <huyn(a)mellanox.com>
net/mlx5: Cancel health poll before sending panic teardown command
Cong Wang <xiyou.wangcong(a)gmail.com>
vlan: fix a use-after-free in vlan_device_event()
Yuchung Cheng <ycheng(a)google.com>
tcp: fix tcp_fastretrans_alert warning
Eric Dumazet <edumazet(a)google.com>
tcp: gso: avoid refcount_t warning from tcp_gso_segment()
Andrey Konovalov <andreyknvl(a)google.com>
net: usb: asix: fill null-ptr-deref in asix_suspend
Kristian Evensen <kristian.evensen(a)gmail.com>
qmi_wwan: Add missing skb_reset_mac_header-call
Bjørn Mork <bjorn(a)mork.no>
net: qmi_wwan: fix divide by 0 on bad descriptors
Bjørn Mork <bjorn(a)mork.no>
net: cdc_ether: fix divide by 0 on bad descriptors
Hangbin Liu <liuhangbin(a)gmail.com>
bonding: discard lowest hash bit for 802.3ad layer3+4
Guillaume Nault <g.nault(a)alphalink.fr>
l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6
Ye Yin <hustcat(a)gmail.com>
netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
Florian Fainelli <f.fainelli(a)gmail.com>
net: systemport: Correct IPG length settings
Eric Dumazet <edumazet(a)google.com>
tcp: do not mangle skb->cb[] in tcp_make_synack()
Jeff Barnhill <0xeffeff(a)gmail.com>
net: vrf: correct FRA_L3MDEV encode type
Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
tcp_nv: fix division by zero in tcpnv_acked()
-------------
Diffstat:
Makefile | 4 ++--
arch/x86/kernel/cpu/intel_cacheinfo.c | 32 ++++++++++++++-----------
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++----
drivers/char/tpm/tpm-dev-common.c | 6 +++++
drivers/net/bonding/bond_main.c | 2 +-
drivers/net/ethernet/broadcom/bcmsysport.c | 10 ++++----
drivers/net/ethernet/fealnx.c | 6 ++---
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 12 ++++------
drivers/net/ethernet/mellanox/mlx5/core/main.c | 7 ++++++
drivers/net/usb/asix_devices.c | 4 ++--
drivers/net/usb/cdc_ether.c | 2 +-
drivers/net/usb/cdc_ncm.c | 4 ++--
drivers/net/usb/qmi_wwan.c | 3 ++-
drivers/net/vrf.c | 2 +-
drivers/net/vxlan.c | 31 ++++++++++--------------
drivers/tty/serial/8250/8250_fintek.c | 3 +++
drivers/tty/serial/omap-serial.c | 2 +-
fs/coda/upcall.c | 3 +--
fs/ocfs2/dlm/dlmrecovery.c | 1 +
fs/ocfs2/file.c | 9 +++++--
include/linux/mmzone.h | 3 ++-
include/linux/skbuff.h | 7 ++++++
kernel/rcu/tree_plugin.h | 2 +-
mm/page_alloc.c | 27 ++++++++++++++-------
mm/page_ext.c | 4 ----
mm/pagewalk.c | 6 ++++-
net/8021q/vlan.c | 6 ++---
net/core/skbuff.c | 1 +
net/ipv4/tcp_input.c | 3 +--
net/ipv4/tcp_nv.c | 2 +-
net/ipv4/tcp_offload.c | 12 ++++++++--
net/ipv4/tcp_output.c | 9 ++-----
net/l2tp/l2tp_ip.c | 24 +++++++------------
net/l2tp/l2tp_ip6.c | 24 +++++++------------
net/netlink/af_netlink.c | 17 ++++++++-----
net/netlink/af_netlink.h | 1 +
net/sctp/ipv6.c | 5 ++--
net/sctp/socket.c | 4 ++++
security/integrity/ima/ima_appraise.c | 3 +++
39 files changed, 179 insertions(+), 134 deletions(-)
From: James Hogan <jhogan(a)kernel.org>
Building 32-bit MIPS64r2 kernels produces warnings like the following
on certain toolchains (such as GNU assembler 2.24.90, but not GNU
assembler 2.28.51) since commit 22b8ba765a72 ("MIPS: Fix MIPS64 FP
save/restore on 32-bit kernels"), due to the exposure of fpu_save_16odd
from fpu_save_double and fpu_restore_16odd from fpu_restore_double:
arch/mips/kernel/r4k_fpu.S:47: Warning: float register should be even, was 1
...
arch/mips/kernel/r4k_fpu.S:59: Warning: float register should be even, was 1
...
This appears to be because .set mips64r2 does not change the FPU ABI to
64-bit when -march=mips64r2 (or e.g. -march=xlp) is provided on the
command line on that toolchain, from the default FPU ABI of 32-bit due
to the -mabi=32. This makes access to the odd FPU registers invalid.
Fix by explicitly changing the FPU ABI with .set fp=64 directives in
fpu_save_16odd and fpu_restore_16odd, and moving the undefine of fp up
in asmmacro.h so fp doesn't turn into $30.
Fixes: 22b8ba765a72 ("MIPS: Fix MIPS64 FP save/restore on 32-bit kernels")
Signed-off-by: James Hogan <jhogan(a)kernel.org>
Cc: Ralf Baechle <ralf(a)linux-mips.org>
Cc: Paul Burton <paul.burton(a)imgtec.com>
Cc: linux-mips(a)linux-mips.org
Cc: <stable(a)vger.kernel.org> # 4.0+: 22b8ba765a72: MIPS: Fix MIPS64 FP save/restore on 32-bit kernels
Cc: <stable(a)vger.kernel.org> # 4.0+
---
arch/mips/include/asm/asmmacro.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index b815d7b3bd27..feb069cbf44e 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -19,6 +19,9 @@
#include <asm/asmmacro-64.h>
#endif
+/* preprocessor replaces the fp in ".set fp=64" with $30 otherwise */
+#undef fp
+
/*
* Helper macros for generating raw instruction encodings.
*/
@@ -105,6 +108,7 @@
.macro fpu_save_16odd thread
.set push
.set mips64r2
+ .set fp=64
SET_HARDFLOAT
sdc1 $f1, THREAD_FPR1(\thread)
sdc1 $f3, THREAD_FPR3(\thread)
@@ -163,6 +167,7 @@
.macro fpu_restore_16odd thread
.set push
.set mips64r2
+ .set fp=64
SET_HARDFLOAT
ldc1 $f1, THREAD_FPR1(\thread)
ldc1 $f3, THREAD_FPR3(\thread)
@@ -234,9 +239,6 @@
.endm
#ifdef TOOLCHAIN_SUPPORTS_MSA
-/* preprocessor replaces the fp in ".set fp=64" with $30 otherwise */
-#undef fp
-
.macro _cfcmsa rd, cs
.set push
.set mips32r2
--
2.14.1
On Thu, 2017-11-23 at 13:08 +0000, Ben Hutchings wrote:
> On Tue, 2017-11-21 at 19:41 -0800, Joe Perches wrote:
> > On Wed, 2017-11-22 at 01:58 +0000, Ben Hutchings wrote:
> > > 3.16.51-rc1 review patch. If anyone has any objections, please let me know.
> > []
> > > --- a/drivers/md/bcache/writeback.h
> > > +++ b/drivers/md/bcache/writeback.h
> > > @@ -14,6 +14,25 @@ static inline uint64_t bcache_dev_sector
> > > return ret;
> > > }
> > >
> > > +static inline uint64_t bcache_flash_devs_sectors_dirty(struct cache_set *c)
> > > +{
> > > + uint64_t i, ret = 0;
> >
> > There's no reason i should be uint64_t
> > as nr_uuids is unsigned int.
>
> But this still works, right? That's a minor issue to deal with
> upstream, not in the backport.
correct
This is the start of the stable review cycle for the 4.9.65 release.
There are 25 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri Nov 24 10:11:07 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.65-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.65-rc1
Jan Harkes <jaharkes(a)cs.cmu.edu>
coda: fix 'kernel memory exposure attempt' in fsync
Pavel Tatashin <pasha.tatashin(a)oracle.com>
mm/page_alloc.c: broken deferred calculation
Corey Minyard <cminyard(a)mvista.com>
ipmi: fix unsigned long underflow
alex chen <alex.chen(a)huawei.com>
ocfs2: should wait dio before inode lock in ocfs2_setattr()
Changwei Ge <ge.changwei(a)h3c.com>
ocfs2: fix cluster hang after a node dies
Adam Wallis <awallis(a)codeaurora.org>
dmaengine: dmatest: warn user when dma test times out
Ji-Ze Hong (Peter Hong) <hpeter(a)gmail.com>
serial: 8250_fintek: Fix finding base_port with activated SuperIO
Lukas Wunner <lukas(a)wunner.de>
serial: omap: Fix EFR write on RTS deassertion
Roberto Sassu <roberto.sassu(a)huawei.com>
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
Eric Biggers <ebiggers(a)google.com>
crypto: dh - Fix double free of ctx->p
Tudor-Dan Ambarus <tudor.ambarus(a)microchip.com>
crypto: dh - fix memleak in setkey
Eric W. Biederman <ebiederm(a)xmission.com>
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
Huacai Chen <chenhc(a)lemote.com>
fealnx: Fix building error on MIPS
Xin Long <lucien.xin(a)gmail.com>
sctp: do not peel off an assoc from one netns to another one
Jason A. Donenfeld <Jason(a)zx2c4.com>
af_netlink: ensure that NLMSG_DONE never fails in dumps
Cong Wang <xiyou.wangcong(a)gmail.com>
vlan: fix a use-after-free in vlan_device_event()
Andrey Konovalov <andreyknvl(a)google.com>
net: usb: asix: fill null-ptr-deref in asix_suspend
Kristian Evensen <kristian.evensen(a)gmail.com>
qmi_wwan: Add missing skb_reset_mac_header-call
Bjørn Mork <bjorn(a)mork.no>
net: qmi_wwan: fix divide by 0 on bad descriptors
Bjørn Mork <bjorn(a)mork.no>
net: cdc_ether: fix divide by 0 on bad descriptors
Hangbin Liu <liuhangbin(a)gmail.com>
bonding: discard lowest hash bit for 802.3ad layer3+4
Ye Yin <hustcat(a)gmail.com>
netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
Eric Dumazet <edumazet(a)google.com>
tcp: do not mangle skb->cb[] in tcp_make_synack()
Jeff Barnhill <0xeffeff(a)gmail.com>
net: vrf: correct FRA_L3MDEV encode type
Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
tcp_nv: fix division by zero in tcpnv_acked()
-------------
Diffstat:
Makefile | 4 ++--
crypto/dh.c | 34 +++++++++++++++-------------------
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
drivers/dma/dmatest.c | 1 +
drivers/net/bonding/bond_main.c | 2 +-
drivers/net/ethernet/fealnx.c | 6 +++---
drivers/net/usb/asix_devices.c | 4 ++--
drivers/net/usb/cdc_ether.c | 2 +-
drivers/net/usb/qmi_wwan.c | 3 ++-
drivers/net/vrf.c | 2 +-
drivers/tty/serial/8250/8250_fintek.c | 3 +++
drivers/tty/serial/omap-serial.c | 2 +-
fs/coda/upcall.c | 3 +--
fs/ocfs2/dlm/dlmrecovery.c | 1 +
fs/ocfs2/file.c | 9 +++++++--
include/linux/mmzone.h | 3 ++-
include/linux/skbuff.h | 7 +++++++
mm/page_alloc.c | 27 ++++++++++++++++++---------
net/8021q/vlan.c | 6 +++---
net/core/skbuff.c | 1 +
net/ipv4/tcp_nv.c | 2 +-
net/ipv4/tcp_output.c | 9 ++-------
net/netlink/af_netlink.c | 17 +++++++++++------
net/netlink/af_netlink.h | 1 +
net/sctp/ipv6.c | 5 +++--
net/sctp/socket.c | 4 ++++
security/integrity/ima/ima_appraise.c | 3 +++
27 files changed, 103 insertions(+), 68 deletions(-)
On Thu 23-11-17 13:05:10, Ben Hutchings wrote:
> On Wed, 2017-11-22 at 08:41 +0100, Vlastimil Babka wrote:
> > On 11/22/2017 02:58 AM, Ben Hutchings wrote:
> > > 3.16.51-rc1 review patch. If anyone has any objections, please let me know.
> >
> > I don't really care much in the end, but is "fix wrong comment" really a
> > stable patch material these days? :)
>
> It had a Fixes: field and it clearly won't do any harm.
Fixes tag is sometimes abused this way. It will not do any harm but the
fewer patch to backport the better IMHO
> Still, none of the other stable branches has it so I'll drop it.
Makes sense.
--
Michal Hocko
SUSE Labs
When no IOMMU is available, all GEM buffers allocated by Exynos DRM driver
are contiguous, because of the underlying dma_alloc_attrs() function
provides only such buffers. In such case it makes no sense to keep
BO_NONCONTIG flag for the allocated GEM buffers. This allows to avoid
failures for buffer contiguity checks in the subsequent operations on GEM
objects.
Signed-off-by: Marek Szyprowski <m.szyprowski(a)samsung.com>
CC: stable(a)vger.kernel.org # v4.4+
---
This issue is there since commit 0519f9a12d011 ("drm/exynos: add iommu
support for exynos drm framework"), but this patch applies cleanly
only to v4.4+ kernel releases due changes in the surrounding code.
Changelog:
v2:
- added warning message when buffer flags are updadated (requested by Inki)
v1: https://patchwork.kernel.org/patch/10034919/
- initial version
---
drivers/gpu/drm/exynos/exynos_drm_gem.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c
index 077de014d610..4400efe3974a 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_gem.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c
@@ -247,6 +247,15 @@ struct exynos_drm_gem *exynos_drm_gem_create(struct drm_device *dev,
if (IS_ERR(exynos_gem))
return exynos_gem;
+ if (!is_drm_iommu_supported(dev) && (flags & EXYNOS_BO_NONCONTIG)) {
+ /*
+ * when no IOMMU is available, all allocated buffers are
+ * contiguous anyway, so drop EXYNOS_BO_NONCONTIG flag
+ */
+ flags &= ~EXYNOS_BO_NONCONTIG;
+ DRM_WARN("Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer\n");
+ }
+
/* set memory type and cache attribute from user side. */
exynos_gem->flags = flags;
--
2.14.2
We added crtc_id to the atomic ioctl, but forgot to add it for vblank
and page flip events. Commit bd386e518056 ("drm: Reorganize
drm_pending_event to support future event types [v2]") added it to
the vblank event, but page flip event was still missing.
Correct this and add a test for making sure we always set crtc_id correctly.
Fixes: bd386e518056 ("drm: Reorganize drm_pending_event to support future event types [v2]")
Fixes: 5db06a8a98f5 ("drm: Pass CRTC ID in userspace vblank events")
Cc: Daniel Stone <daniels(a)collabora.com>
Cc: Daniel Vetter <daniel.vetter(a)intel.com>
Cc: Gustavo Padovan <gustavo(a)padovan.org>
Cc: Sean Paul <seanpaul(a)chromium.org>
Cc: dri-devel(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v4.12+
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
---
drivers/gpu/drm/drm_plane.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/drm_plane.c b/drivers/gpu/drm/drm_plane.c
index 19404e34cd59..37a93cdffb4a 100644
--- a/drivers/gpu/drm/drm_plane.c
+++ b/drivers/gpu/drm/drm_plane.c
@@ -1030,6 +1030,7 @@ int drm_mode_page_flip_ioctl(struct drm_device *dev,
e->event.base.type = DRM_EVENT_FLIP_COMPLETE;
e->event.base.length = sizeof(e->event);
e->event.vbl.user_data = page_flip->user_data;
+ e->event.vbl.crtc_id = crtc->base.id;
ret = drm_event_reserve_init(dev, file_priv, &e->base, &e->event.base);
if (ret) {
kfree(e);
--
2.15.0
From: Jan Kara <jack(a)suse.cz>
[ Upstream commit e3fce68cdbed297d927e993b3ea7b8b1cee545da ]
Currently dax_iomap_rw() takes care of invalidating page tables and
evicting hole pages from the radix tree when write(2) to the file
happens. This invalidation is only necessary when there is some block
allocation resulting from write(2). Furthermore in current place the
invalidation is racy wrt page fault instantiating a hole page just after
we have invalidated it.
So perform the page invalidation inside dax_iomap_actor() where we can
do it only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
---
fs/dax.c | 28 +++++++++++-----------------
1 file changed, 11 insertions(+), 17 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index bf6218da7928..800748f10b3d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1265,6 +1265,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
return -EIO;
+ /*
+ * Write can allocate block for an area which has a hole page mapped
+ * into page tables. We have to tear down these mappings so that data
+ * written by write(2) is visible in mmap.
+ */
+ if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+ invalidate_inode_pages2_range(inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (end - 1) >> PAGE_SHIFT);
+ }
+
while (pos < end) {
unsigned offset = pos & (PAGE_SIZE - 1);
struct blk_dax_ctl dax = { 0 };
@@ -1329,23 +1340,6 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iov_iter_rw(iter) == WRITE)
flags |= IOMAP_WRITE;
- /*
- * Yes, even DAX files can have page cache attached to them: A zeroed
- * page is inserted into the pagecache when we have to serve a write
- * fault on a hole. It should never be dirtied and can simply be
- * dropped from the pagecache once we get real data for the page.
- *
- * XXX: This is racy against mmap, and there's nothing we can do about
- * it. We'll eventually need to shift this down even further so that
- * we can check if we allocated blocks over a hole first.
- */
- if (mapping->nrpages) {
- ret = invalidate_inode_pages2_range(mapping,
- pos >> PAGE_SHIFT,
- (pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
- WARN_ON_ONCE(ret);
- }
-
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, iomap_dax_actor);
--
2.11.0
Running this code with IRQs enabled (where dummy_lock is a spinlock):
static void check_load_gs_index(void)
{
/* This will fail. */
load_gs_index(0xffff);
spin_lock(&dummy_lock);
spin_unlock(&dummy_lock);
}
Will generate a lockdep warning. The issue is that the actual write
to %gs would cause an exception with IRQs disabled, and the exception
handler would, as an inadvertent side effect, update irqflag tracing
to reflect the IRQs-off status. native_load_gs_index() would then
turn IRQs back on and return with irqflag tracing still thinking that
IRQs were off. The dummy lock-and-unlock causes lockdep to notice the
error and warn.
Fix it by adding the missing tracing.
Apparently nothing did this in a context where it mattered. I haven't
tried to find a code path that would actually exhibit the warning if
appropriately nasty user code were running.
I suspect that the security impact of this bug is very, very low --
production systems don't run with lockdep enabled, and the warning is
mostly harmless anyway.
Found during a quick audit of the entry code to try to track down an
unrelated bug that Ingo found in some still-in-development code.
Cc: stable(a)vger.kernel.org
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
Hi Ingo-
You asked me to look for an irqflag tracing bug, so I found one :)
arch/x86/entry/entry_64.S | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a2b30ec69497..3c288f260fdf 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -51,15 +51,19 @@ ENTRY(native_usergs_sysret64)
END(native_usergs_sysret64)
#endif /* CONFIG_PARAVIRT */
-.macro TRACE_IRQS_IRETQ
+.macro TRACE_IRQS_FLAGS flags:req
#ifdef CONFIG_TRACE_IRQFLAGS
- bt $9, EFLAGS(%rsp) /* interrupts off? */
+ bt $9, \flags /* interrupts off? */
jnc 1f
TRACE_IRQS_ON
1:
#endif
.endm
+.macro TRACE_IRQS_IRETQ
+ TRACE_IRQS_FLAGS EFLAGS(%rsp)
+.endm
+
/*
* When dynamic function tracer is enabled it will add a breakpoint
* to all locations that it is about to modify, sync CPUs, update
@@ -943,11 +947,13 @@ ENTRY(native_load_gs_index)
FRAME_BEGIN
pushfq
DISABLE_INTERRUPTS(CLBR_ANY & ~CLBR_RDI)
+ TRACE_IRQS_OFF
SWAPGS
.Lgs_change:
movl %edi, %gs
2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
SWAPGS
+ TRACE_IRQS_FLAGS (%rsp)
popfq
FRAME_END
ret
--
2.13.6
During an eeh a kernel-oops is reported if no vPHB to allocated to the
AFU. This happens as during AFU init, an error in creation of vPHB is
a non-fatal error. Hence afu->phb should always be checked for NULL
before iterating over it for the virtual AFU pci devices.
This patch fixes the kenel-oops by adding a NULL pointer check for
afu->phb before it is dereferenced.
Fixes: 9e8df8a2196("cxl: EEH support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.vnet.ibm.com>
---
Changelog:
Resend -> Added the 'Fixes' info and marking the patch to stable tree [Mpe]
v2 -> Added the vphb NULL check to cxl_vphb_error_detected() [Andrew]
---
drivers/misc/cxl/pci.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index bb7fd3f4edab..18773343ab3e 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -2083,6 +2083,9 @@ static pci_ers_result_t cxl_vphb_error_detected(struct cxl_afu *afu,
/* There should only be one entry, but go through the list
* anyway
*/
+ if (afu->phb == NULL)
+ return result;
+
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
if (!afu_dev->driver)
continue;
@@ -2124,8 +2127,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
* Tell the AFU drivers; but we don't care what they
* say, we're going away.
*/
- if (afu->phb != NULL)
- cxl_vphb_error_detected(afu, state);
+ cxl_vphb_error_detected(afu, state);
}
return PCI_ERS_RESULT_DISCONNECT;
}
@@ -2265,6 +2267,9 @@ static pci_ers_result_t cxl_pci_slot_reset(struct pci_dev *pdev)
if (cxl_afu_select_best_mode(afu))
goto err;
+ if (afu->phb == NULL)
+ continue;
+
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
/* Reset the device context.
* TODO: make this less disruptive
@@ -2327,6 +2332,9 @@ static void cxl_pci_resume(struct pci_dev *pdev)
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
+ if (afu->phb != NULL)
+ continue;
+
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
if (afu_dev->driver && afu_dev->driver->err_handler &&
afu_dev->driver->err_handler->resume)
--
2.14.3
On Wed, Nov 15, 2017 at 03:39:22PM +0000, Moore, Robert wrote:
>> -----Original Message-----
>> From: alexander.levin(a)verizon.com [mailto:alexander.levin@verizon.com]
>> Sent: Tuesday, November 14, 2017 6:46 PM
>> To: linux-kernel(a)vger.kernel.org; stable(a)vger.kernel.org
>> Cc: Moore, Robert <robert.moore(a)intel.com>; Zheng, Lv
>> <lv.zheng(a)intel.com>; Wysocki, Rafael J <rafael.j.wysocki(a)intel.com>;
>> alexander.levin(a)verizon.com
>> Subject: [PATCH AUTOSEL for 4.9 01/56] ACPICA: Resources: Not a valid
>> resource if buffer length too long
>>
>> From: Bob Moore <robert.moore(a)intel.com>
>>
>> [ Upstream commit 57707a9a7780fab426b8ae9b4c7b65b912a748b3 ]
>>
>> ACPICA commit 9f76de2d249b18804e35fb55d14b1c2604d627a1
>> ACPICA commit b2e89d72ef1e9deefd63c3fd1dee90f893575b3a
>> ACPICA commit 23b5bbe6d78afd3c5abf3adb91a1b098a3000b2e
>>
>> The declared buffer length must be the same as the length of the byte
>> initializer list, otherwise not a valid resource descriptor.
[snip]
>[Moore, Robert]
>
>Please explain what you are doing here.
Proposing this commit for the 4.9 LTS tree.
--
Thanks,
Sasha
From: Peter Ujfalusi <peter.ujfalusi(a)ti.com>
[ Upstream commit 657279778af54f35e54b07b6687918f254a2992c ]
OMAP1510, OMAP5910 and OMAP310 have only 9 logical channels.
OMAP1610, OMAP5912, OMAP1710, OMAP730, and OMAP850 have 16 logical channels
available.
The wired 17 for the lch_count must have been used to cover the 16 + 1
dedicated LCD channel, in reality we can only use 9 or 16 channels.
The d->chan_count is not used by the omap-dma stack, so we can skip the
setup. chan_count was configured to the number of logical channels and not
the actual number of physical channels anyways.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi(a)ti.com>
Acked-by: Aaro Koskinen <aaro.koskinen(a)iki.fi>
Signed-off-by: Tony Lindgren <tony(a)atomide.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
---
arch/arm/mach-omap1/dma.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/arch/arm/mach-omap1/dma.c b/arch/arm/mach-omap1/dma.c
index 4be601b638d7..8129e5f9c94d 100644
--- a/arch/arm/mach-omap1/dma.c
+++ b/arch/arm/mach-omap1/dma.c
@@ -31,7 +31,6 @@
#include <mach/irqs.h>
#define OMAP1_DMA_BASE (0xfffed800)
-#define OMAP1_LOGICAL_DMA_CH_COUNT 17
static u32 enable_1510_mode;
@@ -311,8 +310,6 @@ static int __init omap1_system_dma_init(void)
goto exit_iounmap;
}
- d->lch_count = OMAP1_LOGICAL_DMA_CH_COUNT;
-
/* Valid attributes for omap1 plus processors */
if (cpu_is_omap15xx())
d->dev_caps = ENABLE_1510_MODE;
@@ -329,13 +326,14 @@ static int __init omap1_system_dma_init(void)
d->dev_caps |= CLEAR_CSR_ON_READ;
d->dev_caps |= IS_WORD_16;
- if (cpu_is_omap15xx())
- d->chan_count = 9;
- else if (cpu_is_omap16xx() || cpu_is_omap7xx()) {
- if (!(d->dev_caps & ENABLE_1510_MODE))
- d->chan_count = 16;
+ /* available logical channels */
+ if (cpu_is_omap15xx()) {
+ d->lch_count = 9;
+ } else {
+ if (d->dev_caps & ENABLE_1510_MODE)
+ d->lch_count = 9;
else
- d->chan_count = 9;
+ d->lch_count = 16;
}
p = dma_plat_info;
--
2.11.0
From: Florian Fainelli <f.fainelli(a)gmail.com>
[ Upstream commit bb7da333d0a9f3bddc08f84187b7579a3f68fd24 ]
Since we need to pad our packets, utilize skb_put_padto() which
increases skb->len by how much we need to pad, allowing us to eliminate
the test on skb->len right below.
Signed-off-by: Florian Fainelli <f.fainelli(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
---
drivers/net/ethernet/broadcom/bcmsysport.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
index 8860e74aa28f..fae1a1ff53ab 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1061,13 +1061,12 @@ static netdev_tx_t bcm_sysport_xmit(struct sk_buff *skb,
* (including FCS and tag) because the length verification is done after
* the Broadcom tag is stripped off the ingress packet.
*/
- if (skb_padto(skb, ETH_ZLEN + ENET_BRCM_TAG_LEN)) {
+ if (skb_put_padto(skb, ETH_ZLEN + ENET_BRCM_TAG_LEN)) {
ret = NETDEV_TX_OK;
goto out;
}
- skb_len = skb->len < ETH_ZLEN + ENET_BRCM_TAG_LEN ?
- ETH_ZLEN + ENET_BRCM_TAG_LEN : skb->len;
+ skb_len = skb->len;
mapping = dma_map_single(kdev, skb->data, skb_len, DMA_TO_DEVICE);
if (dma_mapping_error(kdev, mapping)) {
--
2.11.0
From: Bob Moore <robert.moore(a)intel.com>
[ Upstream commit 57707a9a7780fab426b8ae9b4c7b65b912a748b3 ]
ACPICA commit 9f76de2d249b18804e35fb55d14b1c2604d627a1
ACPICA commit b2e89d72ef1e9deefd63c3fd1dee90f893575b3a
ACPICA commit 23b5bbe6d78afd3c5abf3adb91a1b098a3000b2e
The declared buffer length must be the same as the length of the
byte initializer list, otherwise not a valid resource descriptor.
Link: https://github.com/acpica/acpica/commit/9f76de2d
Link: https://github.com/acpica/acpica/commit/b2e89d72
Link: https://github.com/acpica/acpica/commit/23b5bbe6
Signed-off-by: Bob Moore <robert.moore(a)intel.com>
Signed-off-by: Lv Zheng <lv.zheng(a)intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
---
drivers/acpi/acpica/utresrc.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/acpica/utresrc.c b/drivers/acpi/acpica/utresrc.c
index 1de3376da66a..2ad99ea3d496 100644
--- a/drivers/acpi/acpica/utresrc.c
+++ b/drivers/acpi/acpica/utresrc.c
@@ -421,8 +421,10 @@ acpi_ut_walk_aml_resources(struct acpi_walk_state *walk_state,
ACPI_FUNCTION_TRACE(ut_walk_aml_resources);
- /* The absolute minimum resource template is one end_tag descriptor */
-
+ /*
+ * The absolute minimum resource template is one end_tag descriptor.
+ * However, we will treat a lone end_tag as just a simple buffer.
+ */
if (aml_length < sizeof(struct aml_resource_end_tag)) {
return_ACPI_STATUS(AE_AML_NO_RESOURCE_END_TAG);
}
@@ -454,9 +456,8 @@ acpi_ut_walk_aml_resources(struct acpi_walk_state *walk_state,
/* Invoke the user function */
if (user_function) {
- status =
- user_function(aml, length, offset, resource_index,
- context);
+ status = user_function(aml, length, offset,
+ resource_index, context);
if (ACPI_FAILURE(status)) {
return_ACPI_STATUS(status);
}
@@ -480,6 +481,12 @@ acpi_ut_walk_aml_resources(struct acpi_walk_state *walk_state,
*context = aml;
}
+ /* Check if buffer is defined to be longer than the resource length */
+
+ if (aml_length > (offset + length)) {
+ return_ACPI_STATUS(AE_AML_NO_RESOURCE_END_TAG);
+ }
+
/* Normal exit */
return_ACPI_STATUS(AE_OK);
--
2.11.0
This is the start of the stable review cycle for the 3.18.84 release.
There are 12 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri Nov 24 10:10:45 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.84-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-3.18.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 3.18.84-rc1
Jan Harkes <jaharkes(a)cs.cmu.edu>
coda: fix 'kernel memory exposure attempt' in fsync
Corey Minyard <cminyard(a)mvista.com>
ipmi: fix unsigned long underflow
alex chen <alex.chen(a)huawei.com>
ocfs2: should wait dio before inode lock in ocfs2_setattr()
Roberto Sassu <roberto.sassu(a)huawei.com>
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
Cong Wang <xiyou.wangcong(a)gmail.com>
vlan: fix a use-after-free in vlan_device_event()
Jason A. Donenfeld <Jason(a)zx2c4.com>
af_netlink: ensure that NLMSG_DONE never fails in dumps
Huacai Chen <chenhc(a)lemote.com>
fealnx: Fix building error on MIPS
Xin Long <lucien.xin(a)gmail.com>
sctp: do not peel off an assoc from one netns to another one
Ye Yin <hustcat(a)gmail.com>
netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
Eric Dumazet <edumazet(a)google.com>
tcp: do not mangle skb->cb[] in tcp_make_synack()
Eric W. Biederman <ebiederm(a)xmission.com>
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
WANG Cong <xiyou.wangcong(a)gmail.com>
ipv6/dccp: do not inherit ipv6_mc_list from parent
-------------
Diffstat:
Makefile | 4 ++--
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
drivers/net/ethernet/fealnx.c | 6 +++---
fs/coda/upcall.c | 3 +--
fs/ocfs2/file.c | 9 +++++++--
include/linux/skbuff.h | 7 +++++++
net/8021q/vlan.c | 6 +++---
net/core/skbuff.c | 1 +
net/dccp/ipv6.c | 7 +++++++
net/ipv4/tcp_output.c | 9 ++-------
net/ipv6/tcp_ipv6.c | 2 ++
net/netlink/af_netlink.c | 17 +++++++++++------
net/netlink/af_netlink.h | 1 +
net/sctp/ipv6.c | 2 ++
net/sctp/socket.c | 4 ++++
security/integrity/ima/ima_appraise.c | 3 +++
16 files changed, 62 insertions(+), 29 deletions(-)
This is the start of the stable review cycle for the 4.14.1 release.
There are 31 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Tue Nov 21 14:59:32 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.1-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.1-rc1
Johan Hovold <johan(a)kernel.org>
spi: fix use-after-free at controller deregistration
Hans de Goede <hdegoede(a)redhat.com>
staging: rtl8188eu: Revert 4 commits breaking ARP
Hans de Goede <hdegoede(a)redhat.com>
staging: vboxvideo: Fix reporting invalid suggested-offset-properties
Johan Hovold <johan(a)kernel.org>
staging: greybus: spilib: fix use-after-free after deregistration
Gilad Ben-Yossef <gilad(a)benyossef.com>
staging: ccree: fix 64 bit scatter/gather DMA ops
Huacai Chen <chenhc(a)lemote.com>
staging: sm750fb: Fix parameter mistake in poke32
Aditya Shankar <aditya.shankar(a)microchip.com>
staging: wilc1000: Fix bssid buffer offset in Txq
Bjorn Andersson <bjorn.andersson(a)linaro.org>
rpmsg: glink: Add missing MODULE_LICENSE
Jason Gerecke <killertofu(a)gmail.com>
HID: wacom: generic: Recognize WACOM_HID_WD_PEN as a type of pen collection
Sébastien Szymanski <sebastien.szymanski(a)armadeus.com>
HID: cp2112: add HIDRAW dependency
Hans de Goede <hdegoede(a)redhat.com>
platform/x86: peaq_wmi: Fix missing terminating entry for peaq_dmi_table
Hans de Goede <hdegoede(a)redhat.com>
platform/x86: peaq-wmi: Add DMI check before binding to the WMI interface
Yazen Ghannam <yazen.ghannam(a)amd.com>
x86/MCE/AMD: Always give panic severity for UC errors in kernel context
Andy Lutomirski <luto(a)kernel.org>
selftests/x86/protection_keys: Fix syscall NR redefinition warnings
Johan Hovold <johan(a)kernel.org>
USB: serial: garmin_gps: fix memory leak on probe errors
Johan Hovold <johan(a)kernel.org>
USB: serial: garmin_gps: fix I/O after failed probe and remove
Douglas Fischer <douglas.fischer(a)outlook.com>
USB: serial: qcserial: add pid/vid for Sierra Wireless EM7355 fw update
Lu Baolu <baolu.lu(a)linux.intel.com>
USB: serial: Change DbC debug device binding ID
Johan Hovold <johan(a)kernel.org>
USB: serial: metro-usb: stop I/O after failed open
Andrew Gabbasov <andrew_gabbasov(a)mentor.com>
usb: gadget: f_fs: Fix use-after-free in ffs_free_inst
Bernhard Rosenkraenzer <bernhard.rosenkranzer(a)linaro.org>
USB: Add delay-init quirk for Corsair K70 LUX keyboards
Alan Stern <stern(a)rowland.harvard.edu>
USB: usbfs: compute urb->actual_length for isochronous
Lu Baolu <baolu.lu(a)linux.intel.com>
USB: early: Use new USB product ID and strings for DbC device
raveendra padasalagi <raveendra.padasalagi(a)broadcom.com>
crypto: brcm - Explicity ACK mailbox message
Eric Biggers <ebiggers(a)google.com>
crypto: dh - Don't permit 'key' or 'g' size longer than 'p'
Eric Biggers <ebiggers(a)google.com>
crypto: dh - Don't permit 'p' to be 0
Eric Biggers <ebiggers(a)google.com>
crypto: dh - Fix double free of ctx->p
Andrey Konovalov <andreyknvl(a)google.com>
media: dib0700: fix invalid dvb_detach argument
Arvind Yadav <arvind.yadav.cs(a)gmail.com>
media: imon: Fix null-ptr-deref in imon_probe
Adam Wallis <awallis(a)codeaurora.org>
dmaengine: dmatest: warn user when dma test times out
Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
EDAC, sb_edac: Don't create a second memory controller if HA1 is not present
-------------
Diffstat:
Makefile | 4 +-
arch/x86/kernel/cpu/mcheck/mce-severity.c | 7 +-
crypto/dh.c | 33 ++++-----
crypto/dh_helper.c | 16 ++++
drivers/crypto/bcm/cipher.c | 101 ++++++++++++--------------
drivers/dma/dmatest.c | 1 +
drivers/edac/sb_edac.c | 9 ++-
drivers/hid/Kconfig | 2 +-
drivers/hid/wacom_wac.h | 1 +
drivers/media/rc/imon.c | 5 ++
drivers/media/usb/dvb-usb/dib0700_devices.c | 24 +++---
drivers/platform/x86/peaq-wmi.c | 19 +++++
drivers/rpmsg/qcom_glink_native.c | 3 +
drivers/spi/spi.c | 5 +-
drivers/staging/ccree/cc_lli_defs.h | 2 +-
drivers/staging/greybus/spilib.c | 8 +-
drivers/staging/rtl8188eu/core/rtw_recv.c | 83 ++++++++++++---------
drivers/staging/rtl8188eu/os_dep/mon.c | 34 ++-------
drivers/staging/sm750fb/ddk750_chip.h | 2 +-
drivers/staging/vboxvideo/vbox_drv.h | 8 +-
drivers/staging/vboxvideo/vbox_irq.c | 4 +-
drivers/staging/vboxvideo/vbox_mode.c | 26 +++++--
drivers/staging/wilc1000/wilc_wlan.c | 2 +-
drivers/usb/core/devio.c | 14 ++++
drivers/usb/core/quirks.c | 3 +
drivers/usb/early/xhci-dbc.h | 6 +-
drivers/usb/gadget/function/f_fs.c | 1 +
drivers/usb/serial/garmin_gps.c | 22 +++++-
drivers/usb/serial/metro-usb.c | 11 ++-
drivers/usb/serial/qcserial.c | 1 +
drivers/usb/serial/usb_debug.c | 4 +-
tools/testing/selftests/x86/protection_keys.c | 24 ++++--
32 files changed, 289 insertions(+), 196 deletions(-)
Kthread function bch_allocator_thread() references allocator_wait(ca, cond)
and when kthread_should_stop() is true, this kthread exits.
The problem is, if kthread_should_stop() is true, macro allocator_wait()
calls "return 0" with current task state TASK_INTERRUPTIBLE. After function
bch_allocator_thread() returns to do_exit(), there are some blocking
operations are called, then a kenrel warning is popped up by __might_sleep
from kernel/sched/core.c,
"WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at [xxxx]"
If the task is interrupted and preempted out, since its status is
TASK_INTERRUPTIBLE, it means scheduler won't pick it back to run forever,
and the allocator thread may hang in do_exit().
This patch sets allocator kthread state back to TASK_RUNNING before it
returns to do_exit(), which avoids a potential deadlock.
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
---
drivers/md/bcache/alloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a27d85232ce1..996ebbabd819 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -286,9 +286,12 @@ do { \
if (cond) \
break; \
\
+ \
mutex_unlock(&(ca)->set->bucket_lock); \
- if (kthread_should_stop()) \
+ if (kthread_should_stop()) { \
+ __set_current_state(TASK_RUNNING); \
return 0; \
+ } \
\
schedule(); \
mutex_lock(&(ca)->set->bucket_lock); \
--
2.13.6
This is the start of the stable review cycle for the 4.4.100 release.
There are 59 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Tue Nov 21 14:31:34 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.100-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.100-rc1
Johan Hovold <johan(a)kernel.org>
USB: serial: garmin_gps: fix memory leak on probe errors
Johan Hovold <johan(a)kernel.org>
USB: serial: garmin_gps: fix I/O after failed probe and remove
Douglas Fischer <douglas.fischer(a)outlook.com>
USB: serial: qcserial: add pid/vid for Sierra Wireless EM7355 fw update
Bernhard Rosenkraenzer <bernhard.rosenkranzer(a)linaro.org>
USB: Add delay-init quirk for Corsair K70 LUX keyboards
Alan Stern <stern(a)rowland.harvard.edu>
USB: usbfs: compute urb->actual_length for isochronous
Dmitry V. Levin <ldv(a)altlinux.org>
uapi: fix linux/rds.h userspace compilation errors
Dmitry V. Levin <ldv(a)altlinux.org>
uapi: fix linux/rds.h userspace compilation error
Sasha Levin <alexander.levin(a)verizon.com>
Revert "uapi: fix linux/rds.h userspace compilation errors"
Sasha Levin <alexander.levin(a)verizon.com>
Revert "crypto: xts - Add ECB dependency"
Paul Burton <paul.burton(a)imgtec.com>
MIPS: Netlogic: Exclude netlogic,xlp-pic code from XLR builds
Marcin Nowakowski <marcin.nowakowski(a)imgtec.com>
MIPS: init: Ensure reserved memory regions are not added to bootmem
Marcin Nowakowski <marcin.nowakowski(a)imgtec.com>
MIPS: init: Ensure bootmem does not corrupt reserved memory
Paul Burton <paul.burton(a)imgtec.com>
MIPS: End asm function prologue macros with .insn
Jannik Becher <becher.jannik(a)gmail.com>
staging: rtl8712: fixed little endian problem
Emil Tantilov <emil.s.tantilov(a)intel.com>
ixgbe: do not disable FEC from the driver
Emil Tantilov <emil.s.tantilov(a)intel.com>
ixgbe: add mask for 64 RSS queues
Tony Nguyen <anthony.l.nguyen(a)intel.com>
ixgbe: Reduce I2C retry count on X550 devices
Emil Tantilov <emil.s.tantilov(a)intel.com>
ixgbe: handle close/suspend race with netif_device_detach/present
Emil Tantilov <emil.s.tantilov(a)intel.com>
ixgbe: fix AER error handling
Jon Mason <jon.mason(a)broadcom.com>
arm64: dts: NS2: reserve memory for Nitro firmware
Kailang Yang <kailang(a)realtek.com>
ALSA: hda/realtek - Add new codec ID ALC299
Arvind Yadav <arvind.yadav.cs(a)gmail.com>
gpu: drm: mgag200: mgag200_main:- Handle error from pci_iomap
Alexey Khoroshilov <khoroshilov(a)ispras.ru>
backlight: adp5520: Fix error handling in adp5520_bl_probe()
Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
backlight: lcd: Fix race condition during register
Takashi Iwai <tiwai(a)suse.de>
ALSA: vx: Fix possible transfer overflow
Takashi Iwai <tiwai(a)suse.de>
ALSA: vx: Don't try to update capture stream before running
James Smart <james.smart(a)broadcom.com>
scsi: lpfc: Clear the VendorVersion in the PLOGI/PLOGI ACC payload
James Smart <james.smart(a)broadcom.com>
scsi: lpfc: Correct issue leading to oops during link reset
James Smart <james.smart(a)broadcom.com>
scsi: lpfc: Correct host name in symbolic_name field
James Smart <james.smart(a)broadcom.com>
scsi: lpfc: FCoE VPort enable-disable does not bring up the VPort
James Smart <james.smart(a)broadcom.com>
scsi: lpfc: Add missing memory barrier
Galo Navarro <anglorvaroa(a)gmail.com>
staging: rtl8188eu: fix incorrect ERROR tags from logs
subhashj(a)codeaurora.org <subhashj(a)codeaurora.org>
scsi: ufs: add capability to keep auto bkops always enabled
Javier Martinez Canillas <javier(a)osg.samsung.com>
scsi: ufs-qcom: Fix module autoload
Hannu Lounento <hannu.lounento(a)ge.com>
igb: Fix hw_dbg logging in igb_update_flash_i210
Todd Fujinaka <todd.fujinaka(a)intel.com>
igb: close/suspend race in netif_device_detach
Aaron Sierra <asierra(a)xes-inc.com>
igb: reset the PHY before reading the PHY ID
Arvind Yadav <arvind.yadav.cs(a)gmail.com>
drm/sti: sti_vtg: Handle return NULL error from devm_ioremap_nocache
Geert Uytterhoeven <geert(a)linux-m68k.org>
ata: SATA_MV should depend on HAS_DMA
Geert Uytterhoeven <geert(a)linux-m68k.org>
ata: SATA_HIGHBANK should depend on HAS_DMA
Geert Uytterhoeven <geert(a)linux-m68k.org>
ata: ATA_BMDMA should depend on HAS_DMA
Tony Lindgren <tony(a)atomide.com>
ARM: dts: Fix omap3 off mode pull defines
Tony Lindgren <tony(a)atomide.com>
ARM: OMAP2+: Fix init for multiple quirks for the same SoC
Tony Lindgren <tony(a)atomide.com>
ARM: dts: Fix am335x and dm814x scm syscon to probe children
Tony Lindgren <tony(a)atomide.com>
ARM: dts: Fix compatible for ti81xx uarts for 8250
Ngai-Mint Kwan <ngai-mint.kwan(a)intel.com>
fm10k: request reset when mbx->state changes
Roger Quadros <rogerq(a)ti.com>
extcon: palmas: Check the parent instance to prevent the NULL
Adam Wallis <awallis(a)codeaurora.org>
dmaengine: dmatest: warn user when dma test times out
Leif Liddy <leif.linux(a)gmail.com>
Bluetooth: btusb: fix QCA Rome suspend/resume
Eric Biggers <ebiggers(a)google.com>
arm: crypto: reduce priority of bit-sliced AES cipher
Bjørn Mork <bjorn(a)mork.no>
net: qmi_wwan: fix divide by 0 on bad descriptors
Bjørn Mork <bjorn(a)mork.no>
net: cdc_ether: fix divide by 0 on bad descriptors
Xin Long <lucien.xin(a)gmail.com>
sctp: do not peel off an assoc from one netns to another one
Jan Beulich <jbeulich(a)suse.com>
xen-blkback: don't leak stack data via response ring
Daniel Borkmann <daniel(a)iogearbox.net>
bpf: don't let ldimm64 leak map addresses on unprivileged
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: fix singlestepping over syscall
Jan Kara <jack(a)suse.cz>
ext4: fix data exposure after a crash
Andrey Konovalov <andreyknvl(a)google.com>
media: dib0700: fix invalid dvb_detach argument
Arvind Yadav <arvind.yadav.cs(a)gmail.com>
media: imon: Fix null-ptr-deref in imon_probe
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/am33xx.dtsi | 3 +-
arch/arm/boot/dts/dm814x.dtsi | 9 ++-
arch/arm/boot/dts/dm816x.dtsi | 6 +-
arch/arm/crypto/aesbs-glue.c | 6 +-
arch/arm/mach-omap2/pdata-quirks.c | 1 -
arch/arm64/boot/dts/broadcom/ns2.dtsi | 2 +
arch/mips/include/asm/asm.h | 10 ++-
arch/mips/kernel/setup.c | 78 +++++++++++++++++++-
arch/mips/netlogic/common/irq.c | 4 +-
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/kvm/emulate.c | 1 +
arch/x86/kvm/x86.c | 52 ++++++-------
crypto/Kconfig | 1 -
drivers/ata/Kconfig | 3 +
drivers/block/xen-blkback/blkback.c | 23 +++---
drivers/block/xen-blkback/common.h | 25 ++-----
drivers/bluetooth/btusb.c | 6 ++
drivers/dma/dmatest.c | 1 +
drivers/extcon/extcon-palmas.c | 5 ++
drivers/gpu/drm/mgag200/mgag200_main.c | 2 +
drivers/gpu/drm/sti/sti_vtg.c | 4 +
drivers/media/rc/imon.c | 5 ++
drivers/media/usb/dvb-usb/dib0700_devices.c | 24 +++---
drivers/net/ethernet/intel/fm10k/fm10k_mbx.c | 10 ++-
drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 6 +-
drivers/net/ethernet/intel/igb/e1000_82575.c | 11 +++
drivers/net/ethernet/intel/igb/e1000_i210.c | 4 +-
drivers/net/ethernet/intel/igb/igb_main.c | 21 +++---
drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 8 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 +++---
drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c | 4 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 2 -
drivers/net/usb/cdc_ether.c | 2 +-
drivers/net/usb/qmi_wwan.c | 2 +-
drivers/scsi/lpfc/lpfc_attr.c | 17 +++++
drivers/scsi/lpfc/lpfc_els.c | 6 ++
drivers/scsi/lpfc/lpfc_hw.h | 6 ++
drivers/scsi/lpfc/lpfc_sli.c | 3 +
drivers/scsi/lpfc/lpfc_vport.c | 8 ++
drivers/scsi/ufs/ufs-qcom.c | 1 +
drivers/scsi/ufs/ufshcd.c | 33 ++++++---
drivers/scsi/ufs/ufshcd.h | 13 ++++
drivers/staging/rtl8188eu/include/rtw_debug.h | 2 +-
drivers/staging/rtl8712/rtl871x_ioctl_linux.c | 2 +-
drivers/usb/core/devio.c | 14 ++++
drivers/usb/core/quirks.c | 3 +
drivers/usb/serial/garmin_gps.c | 22 +++++-
drivers/usb/serial/qcserial.c | 1 +
drivers/video/backlight/adp5520_bl.c | 12 ++-
drivers/video/backlight/lcd.c | 4 +-
fs/ext4/inode.c | 23 +++---
include/dt-bindings/pinctrl/omap.h | 4 +-
include/uapi/linux/rds.h | 102 +++++++++++++-------------
kernel/bpf/verifier.c | 21 ++++--
net/sctp/socket.c | 4 +
sound/drivers/vx/vx_pcm.c | 8 +-
sound/pci/hda/patch_realtek.c | 10 +++
sound/pci/vx222/vx222_ops.c | 12 +--
sound/pcmcia/vx/vxp_ops.c | 12 +--
60 files changed, 481 insertions(+), 231 deletions(-)
On Wed 22-11-17 10:09:13, Zi Yan wrote:
>
>
> Michal Hocko wrote:
> > On Wed 22-11-17 09:43:46, Zi Yan wrote:
> >>
> >> Michal Hocko wrote:
[...]
> >>> but why is unsafe to enable the feature on other arches which support
> >>> THP? Is there any plan to do the next step and remove this config
> >>> option?
> >> Because different architectures have their own way of specifying a swap
> >> entry. This means, to support THP migration, each architecture needs to
> >> add its own __pmd_to_swp_entry() and __swp_entry_to_pmd(), which are
> >> used for arch-independent pmd_to_swp_entry() and swp_entry_to_pmd().
> >
> > I understand that part. But this smells like a matter of coding, no?
> > I was suprised to see the note about safety which didn't make much sense
> > to me.
>
> And testing as well. I had powerpc book3s support in my initial patch
> submission, but removed it because I do not have access to the powerpc
> machine any more. I also tried ARM64, which seems working by adding the
> code, but I have no hardware to test it now.
>
> Any suggestions?
Cc arch maintainers and mailing lists?
--
Michal Hocko
SUSE Labs
This is the start of the stable review cycle for the 3.2.96 release.
There are 61 patches in this series, which will be posted as responses
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri Nov 24 20:00:00 UTC 2017.
Anything received after that time might be too late.
A combined patch relative to 3.2.95 will be posted as an additional
response to this. A shortlog and diffstat can be found below.
Ben.
-------------
Aleksandr Bezzubikov (1):
PCI: shpchp: Enable bridge bus mastering if MSI is enabled
[48b79a14505349a29b3e20f03619ada9b33c4b17]
Amir Goldstein (1):
xfs: fix incorrect log_flushed on fsync
[47c7d0b19502583120c3f396c7559e7a77288a68]
Andrey Korolyov (1):
cs5536: add support for IDE controller variant
[591b6bb605785c12a21e8b07a08a277065b655a5]
Andy Lutomirski (1):
x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
[9584d98bed7a7a904d0702ad06bbcc94703cb5b4]
Arvind Yadav (1):
media: imon: Fix null-ptr-deref in imon_probe
[58fd55e838276a0c13d1dc7c387f90f25063cbf3]
Bart Van Assche (1):
block: Relax a check in blk_start_queue()
[4ddd56b003f251091a67c15ae3fe4a5c5c5e390a]
Ben Hutchings (1):
mac80211: Fix null dereference in ieee80211_key_link()
[not upstream; fixes a regression specific to 3.2-stable]
Benjamin Block (1):
scsi: zfcp: add handling for FCP_RESID_OVER to the fcp ingress path
[a099b7b1fc1f0418ab8d79ecf98153e1e134656e]
Bjørn Mork (1):
net: cdc_ether: fix divide by 0 on bad descriptors
[2cb80187ba065d7decad7c6614e35e07aec8a974]
Brian King (1):
scsi: aacraid: Fix command send race condition
[1ae948fa4f00f3a2823e7cb19a3049ef27dd6947]
Cameron Gutman (2):
Input: xpad - don't depend on endpoint order
[c01b5e7464f0cf20936d7467c7528163c4e2782d]
Input: xpad - validate USB endpoint type during probe
[122d6a347329818419b032c5a1776e6b3866d9b9]
Chad Dupuis (1):
[SCSI] qla2xxx: Add mutex around optrom calls to serialize accesses.
[7a8ab9c840b5dff9bb70328338a86444ed1c2415]
Christophe JAILLET (1):
driver core: bus: Fix a potential double free
[0f9b011d3321ca1079c7a46c18cb1956fbdb7bcb]
Colin Ian King (1):
media: em28xx: calculate left volume level correctly
[801e3659bf2c87c31b7024087d61e89e172b5651]
Dan Carpenter (2):
powerpc/44x: Fix mask and shift to zero bug
[8d046759f6ad75824fdf7b9c9a3da0272ea9ea92]
scsi: qla2xxx: Fix an integer overflow in sysfs code
[e6f77540c067b48dee10f1e33678415bfcc89017]
Dmitry Fleytman (1):
usb: Add device quirk for Logitech HD Pro Webcam C920-C
[a1279ef74eeeb5f627f091c71d80dd7ac766c99d]
Dmitry Torokhov (1):
Input: gtco - fix potential out-of-bound access
[a50829479f58416a013a4ccca791336af3c584c7]
Douglas Anderson (1):
USB: core: Avoid race of async_completed() w/ usbdev_release()
[ed62ca2f4f51c17841ea39d98c0c409cb53a3e10]
Edwin Török (1):
dlm: avoid double-free on error path in dlm_device_{register,unregister}
[55acdd926f6b21a5cdba23da98a48aedf19ac9c3]
Eric Dumazet (1):
ipv6: fix typo in fib6_net_exit()
[32a805baf0fb70b6dbedefcd7249ac7f580f9e3b]
Eric W. Biederman (1):
fcntl: Don't use ambiguous SIG_POLL si_codes
[d08477aa975e97f1dc64c0ae59cebf98520456ce]
Eryu Guan (1):
ext4: validate s_first_meta_bg at mount time
[3a4b77cd47bb837b8557595ec7425f281f2ca1fe]
Finn Thain (1):
scsi: mac_esp: Fix PIO transfers for MESSAGE IN phase
[7640d91d285893a5cf1e62b2cd00f0884c401d93]
Guenter Roeck (1):
media: uvcvideo: Prevent heap overflow when accessing mapped controls
[7e09f7d5c790278ab98e5f2c22307ebe8ad6e8ba]
Guillaume Nault (2):
l2tp: pass tunnel pointer to ->session_create()
[f026bc29a8e093edfbb2a77700454b285c97e8ad]
l2tp: prevent creation of sessions on terminated tunnels
[f3c66d4e144a0904ea9b95d23ed9f8eb38c11bfb]
Guillermo A. Amaral (1):
Input: xpad - add a few new VID/PID combinations
[540602a43ae5fa94064f8fae100f5ca75d4c002b]
Jan H . Schönherr (1):
KVM: SVM: Add a missing 'break' statement
[49a8afca386ee1775519a4aa80f8e121bd227dd4]
Joe Carnuccio (1):
[SCSI] qla2xxx: Corrections to returned sysfs error codes.
[71dfe9e776878d9583d004edade55edc2bdac5eb]
Johan Hovold (2):
USB: serial: console: fix use-after-free after failed setup
[299d7572e46f98534033a9e65973f13ad1ce9047]
[media] cx231xx-cards: fix NULL-deref on missing association descriptor
[6c3b047fa2d2286d5e438bcb470c7b1a49f415f6]
Johannes Berg (1):
mac80211: don't compare TKIP TX MIC key in reinstall prevention
[cfbb0d90a7abb289edc91833d0905931f8805f12]
Jonas Gorski (2):
MIPS: AR7: allow NULL clock for clk_get_rate
[585e0e9d02a690c29932b2fc0789835c7b91d448]
MIPS: BCM63XX: allow NULL clock for clk_get_rate
[1b495faec231980b6c719994b24044ccc04ae06c]
Kai-Heng Feng (2):
Input: i8042 - add Gigabyte P57 to the keyboard reset table
[697c5d8a36768b36729533fb44622b35d56d6ad0]
usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard
[de3af5bf259d7a0bfaac70441c8568ab5998d80c]
Leon Romanovsky (1):
net/mlx4_core: Make explicit conversion to 64bit value
[187782eb58a89ea030731114c6ae37842a4472fe]
Mike Marciniszyn (1):
IB/{qib, hfi1}: Avoid flow control testing for RDMA write operation
[5b0ef650bd0f820e922fcc42f1985d4621ae19cf]
Nisar Sayed (1):
smsc95xx: Configure pause time to 0xffff when tx flow control enabled
[9c0827317f235865ae421293f8aecf6cb327a63e]
Noa Osherovich (1):
IB/core: Fix the validations of a multicast LID in attach or detach operations
[5236333592244557a19694a51337df6ac018f0a7]
Oleg Nesterov (1):
signal: move the "sig < SIGRTMIN" check into siginmask(sig)
[5c8ccefdf46c5f87d87b694c7fbc04941c2c99a5]
Paul Mackerras (1):
powerpc: Correct instruction code for xxlor instruction
[93b2d3cf3733b4060d3623161551f51ea1ab5499]
Rui Teng (1):
powerpc/mm: Fix check of multiple 16G pages from device tree
[23493c121912a39f0262e0dbeb236e1d39efa4d5]
Sabrina Dubroca (1):
ipv6: fix memory leak with multiple tables during netns destruction
[ba1cc08d9488c94cb8d94f545305688b72a2a300]
Sean Young (1):
media: lirc_zilog: driver only sends LIRCCODE
[89d8a2cc51d1f29ea24a0b44dde13253141190a0]
SeongJae Park (1):
mm/vmstat.c: fix wrong comment
[f113e64121ba9f4791332248b315d9f57ee33a6b]
Steffen Maier (6):
scsi: zfcp: fix capping of unsuccessful GPN_FT SAN response trace records
[975171b4461be296a35e83ebd748946b81cf0635]
scsi: zfcp: fix missing trace records for early returns in TMF eh handlers
[1a5d999ebfc7bfe28deb48931bb57faa8e4102b6]
scsi: zfcp: fix passing fsf_req to SCSI trace on TMF to correlate with HBA
[9fe5d2b2fd30aa8c7827ec62cbbe6d30df4fe3e3]
scsi: zfcp: fix payload with full FCP_RSP IU in SCSI trace records
[12c3e5754c8022a4f2fd1e9f00d19e99ee0d3cc1]
scsi: zfcp: fix queuecommand for scsi_eh commands when DIX enabled
[71b8e45da51a7b64a23378221c0a5868bd79da4f]
scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
[fdb7cee3b9e3c561502e58137a837341f10cbf8b]
Steven Rostedt (1):
ftrace: Fix selftest goto location on error
[46320a6acc4fb58f04bcf78c4c942cc43b20f986]
Ted Mielczarek (1):
Input: xpad - add support for Xbox One controllers
[1a48ff81b3912be5fadae3fafde6c2f632246a4c]
Theodore Ts'o (1):
ext4: fix fencepost in s_first_meta_bg validation
[2ba3e6e8afc9b6188b471f27cf2b5e3cf34e7af2]
Thomas Gleixner (1):
genirq: Make sparse_irq_lock protect what it should protect
[12ac1d0f6c3e95732d144ffa65c8b20fbd9aa462]
Wanpeng Li (1):
KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously
[9a6e7c39810e4a8bc7fc95056cefb40583fe07ef]
Xiangliang.Yu (1):
drm/ttm: Fix accounting error when fail to get pages for pool
[9afae2719273fa1d406829bf3498f82dbdba71c7]
Xin Long (1):
sctp: do not peel off an assoc from one netns to another one
[df80cd9b28b9ebaa284a41df611dbf3a2d05ca74]
Makefile | 4 +-
arch/mips/ar7/clock.c | 3 +
arch/mips/bcm63xx/clk.c | 3 +
arch/powerpc/boot/4xx.c | 2 +-
arch/powerpc/include/asm/ppc-opcode.h | 2 +-
arch/powerpc/mm/hash_utils_64.c | 2 +-
arch/x86/include/asm/elf.h | 5 +-
arch/x86/kvm/svm.c | 1 +
arch/x86/kvm/x86.c | 34 ++++-
block/blk-core.c | 2 +-
drivers/ata/pata_amd.c | 1 +
drivers/ata/pata_cs5536.c | 1 +
drivers/base/bus.c | 2 +-
drivers/gpu/drm/ttm/ttm_page_alloc.c | 2 +-
drivers/infiniband/core/verbs.c | 44 +++++-
drivers/infiniband/hw/qib/qib_rc.c | 3 +-
drivers/input/joystick/xpad.c | 218 ++++++++++++++++++++++++----
drivers/input/serio/i8042-x86ia64io.h | 7 +
drivers/input/tablet/gtco.c | 17 ++-
drivers/media/rc/imon.c | 5 +
drivers/media/video/cx231xx/cx231xx-cards.c | 2 +-
drivers/media/video/em28xx/em28xx-audio.c | 2 +-
drivers/media/video/uvc/uvc_ctrl.c | 7 +
drivers/net/ethernet/mellanox/mlx4/fw.c | 2 +-
drivers/net/usb/cdc_ether.c | 5 +-
drivers/net/usb/smsc95xx.c | 11 +-
drivers/pci/hotplug/shpchp_hpc.c | 2 +
drivers/s390/scsi/zfcp_dbf.c | 31 +++-
drivers/s390/scsi/zfcp_dbf.h | 13 +-
drivers/s390/scsi/zfcp_fc.h | 6 +-
drivers/s390/scsi/zfcp_fsf.c | 7 +-
drivers/s390/scsi/zfcp_scsi.c | 16 +-
drivers/scsi/aacraid/aachba.c | 48 +++---
drivers/scsi/mac_esp.c | 35 ++---
drivers/scsi/qla2xxx/qla_attr.c | 71 ++++++---
drivers/scsi/qla2xxx/qla_bsg.c | 12 +-
drivers/scsi/qla2xxx/qla_def.h | 1 +
drivers/scsi/qla2xxx/qla_os.c | 1 +
drivers/staging/media/lirc/lirc_zilog.c | 8 +-
drivers/usb/core/devio.c | 4 +-
drivers/usb/core/quirks.c | 6 +-
drivers/usb/serial/console.c | 1 +
fs/dlm/user.c | 4 +
fs/ext4/super.c | 9 ++
fs/fcntl.c | 13 +-
fs/xfs/xfs_log.c | 7 -
include/asm-generic/siginfo.h | 4 +-
include/linux/pci_ids.h | 1 +
include/linux/signal.h | 24 +--
kernel/irq/irqdesc.c | 24 +--
kernel/trace/trace_selftest.c | 2 +-
mm/vmstat.c | 2 +-
net/ipv6/ip6_fib.c | 17 ++-
net/l2tp/l2tp_core.c | 38 +++--
net/l2tp/l2tp_core.h | 8 +-
net/l2tp/l2tp_eth.c | 11 +-
net/l2tp/l2tp_netlink.c | 8 +-
net/l2tp/l2tp_ppp.c | 19 +--
net/mac80211/key.c | 38 ++++-
net/sctp/socket.c | 5 +
60 files changed, 632 insertions(+), 251 deletions(-)
--
Ben Hutchings
Beware of programmers who carry screwdrivers. - Leonard Brandwein
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -142,8 +142,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.9/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
---
Please apply this patch to <=4.9 stable trees instead of the
original patch.
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 29f2f8b853ae..c2cbd2620169 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -142,8 +142,12 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
--
2.15.0.448.gf294e3d99a-goog
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -142,8 +142,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.4/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
On Wed 22-11-17 09:43:46, Zi Yan wrote:
>
>
> Michal Hocko wrote:
> > On Wed 22-11-17 09:54:16, Michal Hocko wrote:
> >> On Mon 20-11-17 21:18:55, Zi Yan wrote:
> > [...]
> >>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>> index 895ec0c4942e..a2246cf670ba 100644
> >>> --- a/include/linux/migrate.h
> >>> +++ b/include/linux/migrate.h
> >>> @@ -54,7 +54,7 @@ static inline struct page *new_page_nodemask(struct page *page,
> >>> new_page = __alloc_pages_nodemask(gfp_mask, order,
> >>> preferred_nid, nodemask);
> >>>
> >>> - if (new_page && PageTransHuge(page))
> >>> + if (new_page && PageTransHuge(new_page))
> >>> prep_transhuge_page(new_page);
> >> I would keep the two checks consistent. But that leads to a more
> >> interesting question. new_page_nodemask does
> >>
> >> if (thp_migration_supported() && PageTransHuge(page)) {
> >> order = HPAGE_PMD_ORDER;
> >> gfp_mask |= GFP_TRANSHUGE;
> >> }
> >
> > And one more question/note. Why do we need thp_migration_supported
> > in the first place? 9c670ea37947 ("mm: thp: introduce
> > CONFIG_ARCH_ENABLE_THP_MIGRATION") says
> > : Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> > : functionality to x86_64, which should be safer at the first step.
> >
> > but why is unsafe to enable the feature on other arches which support
> > THP? Is there any plan to do the next step and remove this config
> > option?
>
> Because different architectures have their own way of specifying a swap
> entry. This means, to support THP migration, each architecture needs to
> add its own __pmd_to_swp_entry() and __swp_entry_to_pmd(), which are
> used for arch-independent pmd_to_swp_entry() and swp_entry_to_pmd().
I understand that part. But this smells like a matter of coding, no?
I was suprised to see the note about safety which didn't make much sense
to me.
--
Michal Hocko
SUSE Labs
This is a note to let you know that I've just added the patch titled
ipmi: Prefer ACPI system interfaces over SMBIOS ones
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-prefer-acpi-system-interfaces-over-smbios-ones.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7e030d6dff713250c7dcfb543cad2addaf479b0e Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Fri, 8 Sep 2017 14:05:58 -0500
Subject: ipmi: Prefer ACPI system interfaces over SMBIOS ones
From: Corey Minyard <cminyard(a)mvista.com>
commit 7e030d6dff713250c7dcfb543cad2addaf479b0e upstream.
The recent changes to add SMBIOS (DMI) IPMI interfaces as platform
devices caused DMI to be selected before ACPI, causing ACPI type
of operations to not work.
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_si_intf.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -3424,7 +3424,7 @@ static inline void wait_for_timer_and_th
del_timer_sync(&smi_info->si_timer);
}
-static int is_new_interface(struct smi_info *info)
+static struct smi_info *find_dup_si(struct smi_info *info)
{
struct smi_info *e;
@@ -3439,24 +3439,36 @@ static int is_new_interface(struct smi_i
*/
if (info->slave_addr && !e->slave_addr)
e->slave_addr = info->slave_addr;
- return 0;
+ return e;
}
}
- return 1;
+ return NULL;
}
static int add_smi(struct smi_info *new_smi)
{
int rv = 0;
+ struct smi_info *dup;
mutex_lock(&smi_infos_lock);
- if (!is_new_interface(new_smi)) {
- pr_info(PFX "%s-specified %s state machine: duplicate\n",
- ipmi_addr_src_to_str(new_smi->addr_source),
- si_to_str[new_smi->si_type]);
- rv = -EBUSY;
- goto out_err;
+ dup = find_dup_si(new_smi);
+ if (dup) {
+ if (new_smi->addr_source == SI_ACPI &&
+ dup->addr_source == SI_SMBIOS) {
+ /* We prefer ACPI over SMBIOS. */
+ dev_info(dup->dev,
+ "Removing SMBIOS-specified %s state machine in favor of ACPI\n",
+ si_to_str[new_smi->si_type]);
+ cleanup_one_si(dup);
+ } else {
+ dev_info(new_smi->dev,
+ "%s-specified %s state machine: duplicate\n",
+ ipmi_addr_src_to_str(new_smi->addr_source),
+ si_to_str[new_smi->si_type]);
+ rv = -EBUSY;
+ goto out_err;
+ }
}
pr_info(PFX "Adding %s-specified %s state machine\n",
@@ -3865,7 +3877,8 @@ static void cleanup_one_si(struct smi_in
poll(to_clean);
schedule_timeout_uninterruptible(1);
}
- disable_si_irq(to_clean, false);
+ if (to_clean->handlers)
+ disable_si_irq(to_clean, false);
while (to_clean->curr_msg || (to_clean->si_state != SI_NORMAL)) {
poll(to_clean);
schedule_timeout_uninterruptible(1);
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.14/ipmi-fix-unsigned-long-underflow.patch
queue-4.14/ipmi-prefer-acpi-system-interfaces-over-smbios-ones.patch
I'm requesting a backport of
7e030d6dff713250c7dcfb543cad2addaf479b0e ipmi: Prefer ACPI system
interfaces over SMBIOS ones
to the 4.14 stable kernel tree only. This was already staged for
Linus in my public tree before Andrew noticed an issue that this
patch fixes, where the system can oops when the ipmi_si module
is removed. Since it fixes an oops that likely can affect people, I'd
like it to be added to the stable tree,
Since this bug was added in 4.13 by:
0944d889a237b6107f9ceeee053fe7221cdd1089 ipmi: Convert DMI handling
over to a platform device
it should only require a backport to 4.14.
Thank you,
-corey
On 11/21/2017 09:14 PM, Jens Axboe wrote:
> On 11/21/2017 01:12 PM, Christian Borntraeger wrote:
>>
>>
>> On 11/21/2017 08:30 PM, Jens Axboe wrote:
>>> On 11/21/2017 12:15 PM, Christian Borntraeger wrote:
>>>>
>>>>
>>>> On 11/21/2017 07:39 PM, Jens Axboe wrote:
>>>>> On 11/21/2017 11:27 AM, Jens Axboe wrote:
>>>>>> On 11/21/2017 11:12 AM, Christian Borntraeger wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/21/2017 07:09 PM, Jens Axboe wrote:
>>>>>>>> On 11/21/2017 10:27 AM, Jens Axboe wrote:
>>>>>>>>> On 11/21/2017 03:14 AM, Christian Borntraeger wrote:
>>>>>>>>>> Bisect points to
>>>>>>>>>>
>>>>>>>>>> 1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit
>>>>>>>>>> commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1
>>>>>>>>>> Author: Christoph Hellwig <hch(a)lst.de>
>>>>>>>>>> Date: Mon Jun 26 12:20:57 2017 +0200
>>>>>>>>>>
>>>>>>>>>> blk-mq: Create hctx for each present CPU
>>>>>>>>>>
>>>>>>>>>> commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
>>>>>>>>>>
>>>>>>>>>> Currently we only create hctx for online CPUs, which can lead to a lot
>>>>>>>>>> of churn due to frequent soft offline / online operations. Instead
>>>>>>>>>> allocate one for each present CPU to avoid this and dramatically simplify
>>>>>>>>>> the code.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Christoph Hellwig <hch(a)lst.de>
>>>>>>>>>> Reviewed-by: Jens Axboe <axboe(a)kernel.dk>
>>>>>>>>>> Cc: Keith Busch <keith.busch(a)intel.com>
>>>>>>>>>> Cc: linux-block(a)vger.kernel.org
>>>>>>>>>> Cc: linux-nvme(a)lists.infradead.org
>>>>>>>>>> Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
>>>>>>>>>> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
>>>>>>>>>> Cc: Oleksandr Natalenko <oleksandr(a)natalenko.name>
>>>>>>>>>> Cc: Mike Galbraith <efault(a)gmx.de>
>>>>>>>>>> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
>>>>>>>>>
>>>>>>>>> I wonder if we're simply not getting the masks updated correctly. I'll
>>>>>>>>> take a look.
>>>>>>>>
>>>>>>>> Can't make it trigger here. We do init for each present CPU, which means
>>>>>>>> that if I offline a few CPUs here and register a queue, those still show
>>>>>>>> up as present (just offline) and get mapped accordingly.
>>>>>>>>
>>>>>>>> From the looks of it, your setup is different. If the CPU doesn't show
>>>>>>>> up as present and it gets hotplugged, then I can see how this condition
>>>>>>>> would trigger. What environment are you running this in? We might have
>>>>>>>> to re-introduce the cpu hotplug notifier, right now we just monitor
>>>>>>>> for a dead cpu and handle that.
>>>>>>>
>>>>>>> I am not doing a hot unplug and the replug, I use KVM and add a previously
>>>>>>> not available CPU.
>>>>>>>
>>>>>>> in libvirt/virsh speak:
>>>>>>> <vcpu placement='static' current='1'>4</vcpu>
>>>>>>
>>>>>> So that's why we run into problems. It's not present when we load the device,
>>>>>> but becomes present and online afterwards.
>>>>>>
>>>>>> Christoph, we used to handle this just fine, your patch broke it.
>>>>>>
>>>>>> I'll see if I can come up with an appropriate fix.
>>>>>
>>>>> Can you try the below?
>>>>
>>>>
>>>> It does prevent the crash but it seems that the new CPU is not "used " after the hotplug for mq:
>>>>
>>>>
>>>> output with 2 cpus:
>>>> /sys/kernel/debug/block/vda
>>>> /sys/kernel/debug/block/vda/hctx0
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/completed
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/merged
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/dispatched
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/rq_list
>>>> /sys/kernel/debug/block/vda/hctx0/active
>>>> /sys/kernel/debug/block/vda/hctx0/run
>>>> /sys/kernel/debug/block/vda/hctx0/queued
>>>> /sys/kernel/debug/block/vda/hctx0/dispatched
>>>> /sys/kernel/debug/block/vda/hctx0/io_poll
>>>> /sys/kernel/debug/block/vda/hctx0/sched_tags_bitmap
>>>> /sys/kernel/debug/block/vda/hctx0/sched_tags
>>>> /sys/kernel/debug/block/vda/hctx0/tags_bitmap
>>>> /sys/kernel/debug/block/vda/hctx0/tags
>>>> /sys/kernel/debug/block/vda/hctx0/ctx_map
>>>> /sys/kernel/debug/block/vda/hctx0/busy
>>>> /sys/kernel/debug/block/vda/hctx0/dispatch
>>>> /sys/kernel/debug/block/vda/hctx0/flags
>>>> /sys/kernel/debug/block/vda/hctx0/state
>>>> /sys/kernel/debug/block/vda/sched
>>>> /sys/kernel/debug/block/vda/sched/dispatch
>>>> /sys/kernel/debug/block/vda/sched/starved
>>>> /sys/kernel/debug/block/vda/sched/batching
>>>> /sys/kernel/debug/block/vda/sched/write_next_rq
>>>> /sys/kernel/debug/block/vda/sched/write_fifo_list
>>>> /sys/kernel/debug/block/vda/sched/read_next_rq
>>>> /sys/kernel/debug/block/vda/sched/read_fifo_list
>>>> /sys/kernel/debug/block/vda/write_hints
>>>> /sys/kernel/debug/block/vda/state
>>>> /sys/kernel/debug/block/vda/requeue_list
>>>> /sys/kernel/debug/block/vda/poll_stat
>>>
>>> Try this, basically just a revert.
>>
>> Yes, seems to work.
>>
>> Tested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
>
> Great, thanks for testing.
>
>> Do you know why the original commit made it into 4.12 stable? After all
>> it has no Fixes tag and no cc stable-
>
> I was wondering the same thing when you said it was in 4.12.stable and
> not in 4.12 release. That patch should absolutely not have gone into
> stable, it's not marked as such and it's not fixing a problem that is
> stable worthy. In fact, it's causing a regression...
>
> Greg? Upstream commit is mentioned higher up, start of the email.
>
Forgot to cc Greg?
Hi all,
Since I'm going slightly off-topic, I've tweaked the subject line and
trimmed some of the conversation.
I believe everyone in the CC list might be interested in the
following, yet feel free to adjust.
Above all, I'd kindly ask everyone to skim through and draw their conclusions.
If the ideas put forward have some value - great, if not - let my email rot.
On 17 November 2017 at 13:57, Greg KH <gregkh(a)linuxfoundation.org> wrote:
>>
>> I still have no idea how this autoselect picks up patches that do *not*
>> have cc: stable nor Fixes: from us. What information do you have that we
>> don't for making that call?
>
> I'll let Sasha describe how he's doing this, but in the end, does it
> really matter _how_ it is done, vs. the fact that it seems to at least
> one human reviewer that this is a patch that _should_ be included based
> on the changelog text and the code patch?
>
> By having this review process that Sasha is providing, he's saying
> "Here's a patch that I think might be good for stable, do you object?"
>
> If you do, great, no harm done, all is fine, the patch is dropped. If
> you don't object, just ignore the email and the patch gets merged.
>
> If you don't want any of this to happen for your subsystem at all, then
> also fine, just let us know and we will ignore it entirely.
>
Let me start with saying that I'm handling the releases for Mesa 3D -
the project providing OpenGL, Vulkan and many other userspace graphics
drivers.
I've been doing that for 3 years now, which admittedly is quite a
short time relative to the kernel.
There is a procedure quite similar to the kernel, with a few
differences - see below for details.
We also autoselect patches, hence my interest in the heuristics
applied for nominating patches ;-)
That aside, here are some things I've learned from my experience.
Some of those may not be applicable - hope you'll find them useful:
- Try to reference developers to existing documentation/procedure.
Was just reminded that even long standing developers can forget detail X or Y.
- CC developers for the important stuff - aka do not CC on each accepted patch.
Accepted patches are merged in pre-release branch and a email with
accepted/deferred/rejected list is sent.
Patches that had conflicts merging, and ones that are rejected have
their author in the CC list.
Rejected patches have brief description + developers are contacted beforehand.
- Autoselect patches are merged only with the approval from the
respective developers.
IMHO this engages developers into the process, without distracting
them too much.
It is by no means a perfect system - input and changes are always appreciated.
That said, here are some suggestions which should make autosel smoother:
- Document the autoselect process
Information about about What, Why, and [ideally] How - analogous to
the normal stable nominations.
Insert reference to the process in the patch notification email.
- Make the autoselect nominations _more_ distinct than the normal stable ones.
Maintainers will want to put more cognitive effort into the patches.
HTH
Emil
This is a note to let you know that I've just added the patch titled
mm/page_ext.c: check if page_ext is not prepared
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: mm/page_ext.c: check if page_ext is not prepared
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream.
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/page_ext.c | 4 ----
1 file changed, 4 deletions(-)
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -106,7 +106,6 @@ struct page_ext *lookup_page_ext(struct
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -115,7 +114,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (unlikely(!base))
return NULL;
-#endif
offset = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return base + offset;
@@ -180,7 +178,6 @@ struct page_ext *lookup_page_ext(struct
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -189,7 +186,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (!section->page_ext)
return NULL;
-#endif
return section->page_ext + pfn;
}
Patches currently in stable-queue which might be from jaewon31.kim(a)samsung.com are
queue-4.4/mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
From: Zi Yan <zi.yan(a)cs.rutgers.edu>
In [1], Andrea reported that during memory hotplug/hot remove
prep_transhuge_page() is called incorrectly on non-THP pages for
migration, when THP is on but THP migration is not enabled.
This leads to a bad state of target pages for migration.
This patch fixes it by only calling prep_transhuge_page() when we are
certain that the target page is THP.
[1] https://lkml.org/lkml/2017/11/20/411
Cc: stable(a)vger.kernel.org # v4.14
Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
Reported-by: Andrea Reale <ar(a)linux.vnet.ibm.com>
Signed-off-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
---
include/linux/migrate.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 895ec0c4942e..a2246cf670ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -54,7 +54,7 @@ static inline struct page *new_page_nodemask(struct page *page,
new_page = __alloc_pages_nodemask(gfp_mask, order,
preferred_nid, nodemask);
- if (new_page && PageTransHuge(page))
+ if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
return new_page;
--
2.14.2
On Wed 22-11-17 07:29:38, Zi Yan wrote:
> On 22 Nov 2017, at 7:13, Zi Yan wrote:
>
> > On 22 Nov 2017, at 5:14, Michal Hocko wrote:
> >
> >> On Wed 22-11-17 10:35:10, Michal Hocko wrote:
> >> [...]
> >>> Moreover I am not really sure this is really working properly. Just look
> >>> at the split_huge_page. It moves all the tail pages to the LRU list
> >>> while migrate_pages has a list of pages to migrate. So we will migrate
> >>> the head page and all the rest will get back to the LRU list. What
> >>> guarantees that they will get migrated as well.
> >>
> >> OK, so this is as I've expected. It doesn't work! Some pfn walker based
> >> migration will just skip tail pages see madvise_inject_error.
> >> __alloc_contig_migrate_range will simply fail on THP page see
> >> isolate_migratepages_block so we even do not try to migrate it.
> >> do_move_page_to_node_array will simply migrate head and do not care
> >> about tail pages. do_mbind splits the page and then fall back to pte
> >> walk when thp migration is not supported but it doesn't handle tail
> >> pages if the THP migration path is not able to allocate a fresh THP
> >> AFAICS. Memory hotplug should be safe because it doesn't skip the whole
> >> THP when doing pfn walk.
> >>
> >> Unless I am missing something here this looks like a huge mess to me.
> >
> > +Kirill
> >
> > First, I agree with you that splitting a THP and only migrating its head page
> > is a mess. But what you describe is also the behavior of migrate_page()
> > _before_ THP migration support is added. I thought that was intended.
> >
> > Look at http://elixir.free-electrons.com/linux/v4.13.15/source/mm/migrate.c#L1091,
> > unmap_and_move() splits THPs and only migrates the head page in v4.13 before THP
> > migration is added. I think the behavior was introduced since v4.5 (I just skimmed
> > v4.0 to v4.13 code and did not have time to use git blame), before that THPs are
> > not migrated but shown as successfully migrated (at least from v4.4’s code).
>
> Sorry, I misread v4.4’s code, it also does ‘splitting a THP and migrating its head page’.
> This behavior was there for a long time, at least since v3.0.
>
> The code in unmap_and_move() is:
>
> if (unlikely(PageTransHuge(page)))
> if (unlikely(split_huge_page(page)))
> goto out;
I _think_ that this all should be handled at migrate_pages layer. Try to
migrate THP and fallback to split_huge_page into to the list when it
fails. I haven't checked whether there is something which would prevent
that though. THP tricks in specific paths then should be removed.
--
Michal Hocko
SUSE Labs
I've made mistake during converting hugetlb code to 5-level paging:
in huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset().
Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc()
if p4d table is not yet allocated.
It only can happen in 5-level paging mode. In 4-level paging mode
p4d_offset() always returns pgd, so we are fine.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
Cc: <stable(a)vger.kernel.org> # v4.11+
---
mm/hugetlb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d2ff5e8bf2b..94a4c0b63580 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4617,7 +4617,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pte_t *pte = NULL;
pgd = pgd_offset(mm, addr);
- p4d = p4d_offset(pgd, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
pud = pud_alloc(mm, p4d, addr);
if (pud) {
if (sz == PUD_SIZE) {
--
2.15.0
Hi,
Is there a way to get notifications about patches queued up for the
-rc1 commits in linux-stable-rc.git? I currently use the push that is
associated with email to this list from Greg. But that only gives 48h
window to do the building and testing. If there is a way to get the
patches earlier, that would give more time for testing activities. Any
hints are much appreciated.
Best Regards,
Milosz
On Wed 22-11-17 04:18:35, Zi Yan wrote:
> On 22 Nov 2017, at 3:54, Michal Hocko wrote:
[...]
> > I would keep the two checks consistent. But that leads to a more
> > interesting question. new_page_nodemask does
> >
> > if (thp_migration_supported() && PageTransHuge(page)) {
> > order = HPAGE_PMD_ORDER;
> > gfp_mask |= GFP_TRANSHUGE;
> > }
> >
> > How come it is safe to allocate an order-0 page if
> > !thp_migration_supported() when we are about to migrate THP? This
> > doesn't make any sense to me. Are we working around this somewhere else?
> > Why shouldn't we simply return NULL here?
>
> If !thp_migration_supported(), we will first split a THP and migrate
> its head page. This process is done in unmap_and_move() after
> get_new_page() (the function pointer to this new_page_nodemask()) is
> called. The situation can be PageTransHuge(page) is true here, but the
> page is split in unmap_and_move(), so we want to return a order-0 page
> here.
This deserves a big fat comment in the code because this is not clear
from the code!
> I think the confusion comes from that there is no guarantee of THP
> allocation when we are doing THP migration. If we can allocate a THP
> during THP migration, we are good. Otherwise, we want to fallback to
> the old way, splitting the original THP and migrating the head page,
> to preserve the original code behavior.
I understand that but that should be done explicitly rather than relying
on two functions doing the right thing because this is just too fragile.
Moreover I am not really sure this is really working properly. Just look
at the split_huge_page. It moves all the tail pages to the LRU list
while migrate_pages has a list of pages to migrate. So we will migrate
the head page and all the rest will get back to the LRU list. What
guarantees that they will get migrated as well.
This all looks like a mess!
--
Michal Hocko
SUSE Labs
On Tue 21 Nov 2017, 17:35, Zi Yan wrote:
> On 21 Nov 2017, at 17:12, Andrew Morton wrote:
>
> > On Mon, 20 Nov 2017 21:18:55 -0500 Zi Yan <zi.yan(a)sent.com> wrote:
> >
> >> This patch fixes it by only calling prep_transhuge_page() when we are
> >> certain that the target page is THP.
> >
> > What are the user-visible effects of the bug?
>
> By inspecting the code, if called on a non-THP, prep_transhuge_page() will
> 1) change the value of the mapping of (page + 2), since it is used for THP deferred list;
> 2) change the lru value of (page + 1), since it is used for THP’s dtor.
>
> Both can lead to data corruption of these two pages.
Pragmatically and from the point of view of the memory_hotplug subsys,
the effect is a kernel crash when pages are being migrated during a memory
hot remove offline and migration target pages are found in a bad state.
Best,
Andrea
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.9/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.4/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -447,8 +447,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.14/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.13/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-3.18/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
The patch below does not apply to the 4.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: [PATCH] mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 326439428daf..f2face8b889e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: [PATCH] mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 326439428daf..f2face8b889e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
This is a note to let you know that I've just added the patch titled
mm, swap: fix false error message in __swp_swapcount()
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-swap-fix-false-error-message-in-__swp_swapcount.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: mm, swap: fix false error message in __swp_swapcount()
From: Huang Ying <huang.ying.caritas(a)gmail.com>
commit e9a6effa500526e2a19d5ad042cb758b55b1ef93 upstream.
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/swap_state.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -506,6 +506,7 @@ struct page *swapin_readahead(swp_entry_
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true;
@@ -519,6 +520,8 @@ struct page *swapin_readahead(swp_entry_
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
Patches currently in stable-queue which might be from huang.ying.caritas(a)gmail.com are
queue-4.13/mm-swap-fix-false-error-message-in-__swp_swapcount.patch
This is a note to let you know that I've just added the patch titled
x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-cpu-amd-derive-l3-shared_cpu_map-from-cpu_llc_shared_mask.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 2b83809a5e6d619a780876fcaf68cdc42b50d28c Mon Sep 17 00:00:00 2001
From: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
Date: Mon, 31 Jul 2017 10:51:59 +0200
Subject: x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
From: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c upstream.
For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
be contiguous for cores across different L3s (e.g. family17h system
w/ downcore configuration). This breaks the logic, and results in an
incorrect L3 shared_cpu_map.
Instead, always use the previously calculated cpu_llc_shared_mask of
each CPU to derive the L3 shared_cpu_map.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20170731085159.9455-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/cpu/intel_cacheinfo.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -811,7 +811,24 @@ static int __cache_amd_cpumap_setup(unsi
struct cacheinfo *this_leaf;
int i, sibling;
- if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+ /*
+ * For L3, always use the pre-calculated cpu_llc_shared_mask
+ * to derive shared_cpu_map.
+ */
+ if (index == 3) {
+ for_each_cpu(i, cpu_llc_shared_mask(cpu)) {
+ this_cpu_ci = get_cpu_cacheinfo(i);
+ if (!this_cpu_ci->info_list)
+ continue;
+ this_leaf = this_cpu_ci->info_list + index;
+ for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) {
+ if (!cpu_online(sibling))
+ continue;
+ cpumask_set_cpu(sibling,
+ &this_leaf->shared_cpu_map);
+ }
+ }
+ } else if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
unsigned int apicid, nshared, first, last;
this_leaf = this_cpu_ci->info_list + index;
@@ -837,19 +854,6 @@ static int __cache_amd_cpumap_setup(unsi
continue;
cpumask_set_cpu(sibling,
&this_leaf->shared_cpu_map);
- }
- }
- } else if (index == 3) {
- for_each_cpu(i, cpu_llc_shared_mask(cpu)) {
- this_cpu_ci = get_cpu_cacheinfo(i);
- if (!this_cpu_ci->info_list)
- continue;
- this_leaf = this_cpu_ci->info_list + index;
- for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) {
- if (!cpu_online(sibling))
- continue;
- cpumask_set_cpu(sibling,
- &this_leaf->shared_cpu_map);
}
}
} else
Patches currently in stable-queue which might be from suravee.suthikulpanit(a)amd.com are
queue-4.13/x86-cpu-amd-derive-l3-shared_cpu_map-from-cpu_llc_shared_mask.patch
Upstream commit ID: 2b83809a5e6d619a780876fcaf68cdc42b50d28c
Stable kernel version to apply: 4.13.x
Reason: This patch fixes the L3 topology for the AMD Zen-based
(family17h) processors with down-core configuration.
Thanks,
Suravee
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1166,6 +1166,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1186,8 +1193,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.9/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.9/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.9/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.9/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -672,7 +672,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -284,28 +284,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -332,7 +341,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.9/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
dmaengine: dmatest: warn user when dma test times out
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dmaengine-dmatest-warn-user-when-dma-test-times-out.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 Mon Sep 17 00:00:00 2001
From: Adam Wallis <awallis(a)codeaurora.org>
Date: Thu, 2 Nov 2017 08:53:30 -0400
Subject: dmaengine: dmatest: warn user when dma test times out
From: Adam Wallis <awallis(a)codeaurora.org>
commit a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 upstream.
Commit adfa543e7314 ("dmatest: don't use set_freezable_with_signal()")
introduced a bug (that is in fact documented by the patch commit text)
that leaves behind a dangling pointer. Since the done_wait structure is
allocated on the stack, future invocations to the DMATEST can produce
undesirable results (e.g., corrupted spinlocks). Ideally, this would be
cleaned up in the thread handler, but at the very least, the kernel
is left in a very precarious scenario that can lead to some long debug
sessions when the crash comes later.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=197605
Signed-off-by: Adam Wallis <awallis(a)codeaurora.org>
Signed-off-by: Vinod Koul <vinod.koul(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/dma/dmatest.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -666,6 +666,7 @@ static int dmatest_func(void *data)
* free it this time?" dancing. For now, just
* leave it dangling.
*/
+ WARN(1, "dmatest: Kernel stack may be corrupted!!\n");
dmaengine_unmap_put(um);
result("test timed out", total_tests, src_off, dst_off,
len, 0);
Patches currently in stable-queue which might be from awallis(a)codeaurora.org are
queue-4.9/dmaengine-dmatest-warn-user-when-dma-test-times-out.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1166,6 +1166,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1186,8 +1193,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.4/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This patch is a fix specific to the 3.19 - 4.4 kernels. The 4.5 kernel
inadvertently fixed this bug differently (db3cbfff5bcc0), but is not
a stable candidate due it being a complicated re-write of the entire
feature.
This patch fixes a potential timing bug with nvme's asynchronous queue
deletion, which causes an allocated request to be accidentally released
due to the ordering of the shared completion context among the sq/cq
pair. The completion context saves the request that issued the queue
deletion. If the submission side deletion happens to reset the active
request, the completion side will release the wrong request tag back into
the pool of available tags. This means the driver will create multiple
commands with the same tag, corrupting the queue context.
The error is observable in the kernel logs like:
"nvme XX:YY:ZZ completed id XX twice on qid:0"
In this particular case, this message occurs because the queue is
corrupted.
The following timing sequence demonstrates the error:
CPU A CPU B
----------------------- -----------------------------
nvme_irq
nvme_process_cq
async_completion
queue_kthread_work -----------> nvme_del_sq_work_handler
nvme_delete_cq
adapter_async_del_queue
nvme_submit_admin_async_cmd
cmdinfo->req = req;
blk_mq_free_request(cmdinfo->req); <-- wrong request!!!
This patch fixes the bug by releasing the request in the completion side
prior to waking the submission thread, such that that thread can't muck
with the shared completion context.
Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq")
Cc: <stable(a)vger.kernel.org> # 4.4.x
Cc: <stable(a)vger.kernel.org> # 4.1.x
Signed-off-by: Keith Busch <keith.busch(a)intel.com>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 669edbd47602..d6ceb8b91cd6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -350,8 +350,8 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
struct async_cmd_info *cmdinfo = ctx;
cmdinfo->result = le32_to_cpup(&cqe->result);
cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
- queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
blk_mq_free_request(cmdinfo->req);
+ queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
}
static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
--
2.13.6
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -688,7 +688,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
} pg_data_t;
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -267,28 +267,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -324,7 +333,7 @@ static inline bool update_defer_init(pg_
return true;
/* Initialise at least 2G of the highest zone */
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.4/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.4/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1161,6 +1161,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1200,8 +1207,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.14/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
mm, swap: fix false error message in __swp_swapcount()
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-swap-fix-false-error-message-in-__swp_swapcount.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: mm, swap: fix false error message in __swp_swapcount()
From: Huang Ying <huang.ying.caritas(a)gmail.com>
commit e9a6effa500526e2a19d5ad042cb758b55b1ef93 upstream.
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/swap_state.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
Patches currently in stable-queue which might be from huang.ying.caritas(a)gmail.com are
queue-4.14/mm-swap-fix-false-error-message-in-__swp_swapcount.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.14/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.14/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -188,8 +188,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.14/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -700,7 +700,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -290,28 +290,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -338,7 +347,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.14/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
mm/page_ext.c: check if page_ext is not prepared
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: mm/page_ext.c: check if page_ext is not prepared
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream.
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/page_ext.c | 4 ----
1 file changed, 4 deletions(-)
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
Patches currently in stable-queue which might be from jaewon31.kim(a)samsung.com are
queue-4.14/mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4030,7 +4030,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4043,8 +4044,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4110,7 +4111,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.14/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1168,6 +1168,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1207,8 +1214,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.13/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
rcu: Fix up pending cbs check in rcu_prepare_for_idle
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
rcu-fix-up-pending-cbs-check-in-rcu_prepare_for_idle.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 135bd1a230bb69a68c9808a7d25467318900b80a Mon Sep 17 00:00:00 2001
From: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Date: Mon, 7 Aug 2017 11:20:10 +0530
Subject: rcu: Fix up pending cbs check in rcu_prepare_for_idle
From: Neeraj Upadhyay <neeraju(a)codeaurora.org>
commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.
The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks. This commit
therefore inverts this check.
Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1493,7 +1493,7 @@ static void rcu_prepare_for_idle(void)
rdtp->last_accelerate = jiffies;
for_each_rcu_flavor(rsp) {
rdp = this_cpu_ptr(rsp->rda);
- if (rcu_segcblist_pend_cbs(&rdp->cblist))
+ if (!rcu_segcblist_pend_cbs(&rdp->cblist))
continue;
rnp = rdp->mynode;
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
Patches currently in stable-queue which might be from neeraju(a)codeaurora.org are
queue-4.13/rcu-fix-up-pending-cbs-check-in-rcu_prepare_for_idle.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.13/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.13/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -187,8 +187,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.13/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -691,7 +691,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -289,28 +289,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -337,7 +346,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.13/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
mm/page_ext.c: check if page_ext is not prepared
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: mm/page_ext.c: check if page_ext is not prepared
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream.
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/page_ext.c | 4 ----
1 file changed, 4 deletions(-)
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -124,7 +124,6 @@ struct page_ext *lookup_page_ext(struct
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -133,7 +132,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -198,7 +196,6 @@ struct page_ext *lookup_page_ext(struct
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -207,7 +204,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
Patches currently in stable-queue which might be from jaewon31.kim(a)samsung.com are
queue-4.13/mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4030,7 +4030,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4043,8 +4044,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4110,7 +4111,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.13/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1151,6 +1151,13 @@ int ocfs2_setattr(struct dentry *dentry,
dquot_initialize(inode);
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1170,8 +1177,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-3.18/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4010,7 +4010,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4023,8 +4024,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4091,7 +4092,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-3.18/ipmi-fix-unsigned-long-underflow.patch
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: [PATCH] mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4f0367d472c4..2c16216c29b6 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct page *page)
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: [PATCH] mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4f0367d472c4..2c16216c29b6 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct page *page)
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: [PATCH] mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4f0367d472c4..2c16216c29b6 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct page *page)
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with i40evf as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index fe817e2b6fef..50864f99446d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -179,7 +179,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
/* if the descriptor isn't done, no work yet to do */
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with fm10k as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index dbd69310f263..538b42d5c187 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1231,7 +1231,7 @@ static bool fm10k_clean_tx_irq(struct fm10k_q_vector *q_vector,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->flags & FM10K_TXD_FLAG_DONE))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with igb as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Aaron Brown <aaron.f.brown(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index e94d3c256667..c208753ff5b7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7317,7 +7317,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with igbvf as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Aaron Brown <aaron.f.brown(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 713e8df23744..4214c1519a87 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -810,7 +810,7 @@ static bool igbvf_clean_tx_irq(struct igbvf_ring *tx_ring)
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with ixgbevf as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index feed11bc9ddf..1f4a69134ade 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -326,7 +326,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with i40e as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 775d5a125887..4c08cc86463e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3966,7 +3966,7 @@ static bool i40e_clean_fdir_tx_irq(struct i40e_ring *tx_ring, int budget)
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if the descriptor isn't done, no work yet to do */
if (!(eop_desc->cmd_type_offset_bsz &
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d6d352a6e6ea..4566d66ffc7c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -759,7 +759,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
/* we have caught up to head, no work left to do */
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
This patch fixes an issue seen on Power systems with ixgbe which results
in skb list corruption and an eventual kernel oops. The following is what
was observed:
CPU 1 CPU2
============================ ============================
1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq
2: first->skb = skb eop_desc = tx_buffer->next_to_watch
3: ixgbe_tx_map read_barrier_depends()
4: wmb check adapter written status bit
5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..);
6: writel(i, tx_ring->tail);
The read_barrier_depends is insufficient to ensure that tx_buffer->skb does not
get loaded prior to tx_buffer->next_to_watch, which then results in loading
a stale skb pointer. This patch replaces the read_barrier_depends with
smp_rmb to ensure loads are ordered with respect to the load of
tx_buffer->next_to_watch.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ca06c3cc2ca8..62a18914f00f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1192,7 +1192,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
--
2.15.0
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: [PATCH] mm/pagewalk.c: report holes in hugetlb ranges
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8bd4afa83cb8..23a3e415ac2c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -188,8 +188,12 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: [PATCH] mm/pagewalk.c: report holes in hugetlb ranges
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8bd4afa83cb8..23a3e415ac2c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -188,8 +188,12 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
When I added entry_SYSCALL_64_after_hwframe, I left TRACE_IRQS_OFF
before it. This means that users of entry_SYSCALL_64_after_hwframe
were responsible for invoking TRACE_IRQS_OFF, and the one and only
user (added in the same commit) got it wrong.
I think this would manifest as a warning if a Xen PV guest with
CONFIG_DEBUG_LOCKDEP=y were used with context tracking. (The
context tracking bit is to cause lockdep to get invoked before we
turn IRQs back on.) I haven't tested that for real yet because I
can't get a kernel configured like that to boot at all on Xen PV.
I've reported it upstream. The problem seems to be that Xen PV is
missing early #UD handling, is hitting some WARN, and we rely on
Move TRACE_IRQS_OFF below the label.
Cc: stable(a)vger.kernel.org
Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: Juergen Gross <jgross(a)suse.com>
Fixes: 8a9949bc71a7 ("x86/xen/64: Rearrange the SYSCALL entries")
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
arch/x86/entry/entry_64.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a2b30ec69497..5063ed1214dd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -148,8 +148,6 @@ ENTRY(entry_SYSCALL_64)
movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
- TRACE_IRQS_OFF
-
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */
@@ -170,6 +168,8 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */
UNWIND_HINT_REGS extra=0
+ TRACE_IRQS_OFF
+
/*
* If we need to do entry work or if we guess we'll need to do
* exit work, go straight to the slow path.
--
2.13.6
This patch converts several network drivers to use smp_rmb
rather than read_barrier_depends. The initial issue was
discovered with ixgbe on a Power machine which resulted
in skb list corruption due to fetching a stale skb pointer.
More details can be found in the ixgbe patch description.
Changes since v1:
- Remove NULLing of tx_buffer->skb in the ixgbe patch
Brian King (7):
ixgbe: Fix skb list corruption on Power systems
i40e: Use smp_rmb rather than read_barrier_depends
ixgbevf: Use smp_rmb rather than read_barrier_depends
igbvf: Use smp_rmb rather than read_barrier_depends
igb: Use smp_rmb rather than read_barrier_depends
fm10k: Use smp_rmb rather than read_barrier_depends
i40evf: Use smp_rmb rather than read_barrier_depends
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
8 files changed, 8 insertions(+), 8 deletions(-)
--
1.8.3.1