As the comment notes, the return codes for TSYNC and NEW_LISTENER conflict,
because they both return positive values, one in the case of success and
one in the case of error. So, let's disallow both of these flags together.
While this is technically a userspace break, all the users I know of are
still waiting on me to land this feature in libseccomp, so I think it'll be
safe. Also, at present my use case doesn't require TSYNC at all, so this
isn't a big deal to disallow. If someone wanted to support this, a path
forward would be to add a new flag like
TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN, but
the use cases are so different I don't see it really happening.
Finally, it's worth noting that this does actually fix a UAF issue: at the end
of seccomp_set_mode_filter(), we have:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
if (ret < 0) {
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
} else {
fd_install(listener, listener_f);
ret = listener;
}
}
out_free:
seccomp_filter_free(prepared);
But if ret > 0 because TSYNC raced, we'll install the listener fd and then free
the filter out from underneath it, causing a UAF when the task closes it or
dies. This patch also switches the condition to be simply if (ret), so that
if someone does add the flag mentioned above, they won't have to remember
to fix this too.
Signed-off-by: Tycho Andersen <tycho(a)tycho.ws>
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
CC: stable(a)vger.kernel.org # v5.0+
---
kernel/seccomp.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d0d355ded2f4..79bada51091b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -500,7 +500,10 @@ seccomp_prepare_user_filter(const char __user *user_filter)
*
* Caller must be holding current->sighand->siglock lock.
*
- * Returns 0 on success, -ve on error.
+ * Returns 0 on success, -ve on error, or
+ * - in TSYNC mode: the pid of a thread which was either not in the correct
+ * seccomp mode or did not have an ancestral seccomp filter
+ * - in NEW_LISTENER mode: the fd of the new listener
*/
static long seccomp_attach_filter(unsigned int flags,
struct seccomp_filter *filter)
@@ -1256,6 +1259,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
+ /*
+ * In the successful case, NEW_LISTENER returns the new listener fd.
+ * But in the failure case, TSYNC returns the thread that died. If you
+ * combine these two flags, there's no way to tell whether something
+ * succeded or failed. So, let's disallow this combination.
+ */
+ if ((flags & SECCOMP_FILTER_FLAG_TSYNC) &&
+ (flags && SECCOMP_FILTER_FLAG_NEW_LISTENER))
+ return -EINVAL;
+
/* Prepare the new filter before holding any locks. */
prepared = seccomp_prepare_user_filter(filter);
if (IS_ERR(prepared))
@@ -1302,7 +1315,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
mutex_unlock(¤t->signal->cred_guard_mutex);
out_put_fd:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
- if (ret < 0) {
+ if (ret) {
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
--
2.19.1
Lars Persson <lists(a)bofh.nu> reported that a label was unused in
the 4.14 version of this patchset, and the issue was present in
the 4.19 patchset as well, so I'm sending a v2 that fixes it.
The original 4.19 patchset queued for stable is OK, and
can be used as is, but this v2 is a bit better: it fixes the
unused label issue and handles overlapping fragments better.
Sorry for the mess/v2.
=======================
Currently, 4.19 and earlier stable kernels contain a security fix
that is not fully IPv6 standard compliant.
This patchset backports IPv6 defrag fixes from 5.1rc that restore
standard-compliance.
Original 5.1 patchet: https://patchwork.ozlabs.org/cover/1029418/
v2 changes: handle overlapping fragments the way it is done upstream
Peter Oskolkov (3):
net: IP defrag: encapsulate rbtree defrag code into callable functions
net: IP6 defrag: use rbtrees for IPv6 defrag
net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
include/net/inet_frag.h | 16 +-
include/net/ipv6_frag.h | 11 +-
net/ipv4/inet_fragment.c | 293 +++++++++++++++++++++++
net/ipv4/ip_fragment.c | 302 +++---------------------
net/ipv6/netfilter/nf_conntrack_reasm.c | 260 ++++++--------------
net/ipv6/reassembly.c | 240 ++++++-------------
6 files changed, 488 insertions(+), 634 deletions(-)
--
2.21.0.593.g511ec345e18-goog
Hello,
We ran automated tests on a patchset that was proposed for merging into this
kernel tree. The patches were applied to:
Kernel repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: e4abcebedac3 - Linux 5.0.9
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Merge testing
-------------
We cloned this repository and checked out a ref:
Repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Ref: e4abcebedac3 - Linux 5.0.9
We then merged the patchset with `git am`:
bonding-fix-event-handling-for-stacked-bonds.patch
failover-allow-name-change-on-iff_up-slave-interfaces.patch
net-atm-fix-potential-spectre-v1-vulnerabilities.patch
net-bridge-fix-per-port-af_packet-sockets.patch
net-bridge-multicast-use-rcu-to-access-port-list-from-br_multicast_start_querier.patch
net-fec-manage-ahb-clock-in-runtime-pm.patch
net-fix-missing-meta-data-in-skb-with-vlan-packet.patch
net-fou-do-not-use-guehdr-after-iptunnel_pull_offloads-in-gue_udp_recv.patch
tcp-tcp_grow_window-needs-to-respect-tcp_space.patch
team-set-slave-to-promisc-if-team-is-already-in-promisc-mode.patch
tipc-missing-entries-in-name-table-of-publications.patch
vhost-reject-zero-size-iova-range.patch
ipv4-recompile-ip-options-in-ipv4_link_failure.patch
ipv4-ensure-rcu_read_lock-in-ipv4_link_failure.patch
mlxsw-spectrum_switchdev-add-mdb-entries-in-prepare-phase.patch
mlxsw-core-do-not-use-wq_mem_reclaim-for-emad-workqueue.patch
mlxsw-core-do-not-use-wq_mem_reclaim-for-mlxsw-ordered-workqueue.patch
mlxsw-core-do-not-use-wq_mem_reclaim-for-mlxsw-workqueue.patch
mlxsw-spectrum_router-do-not-check-vrf-mac-address.patch
net-thunderx-raise-xdp-mtu-to-1508.patch
net-thunderx-don-t-allow-jumbo-frames-with-xdp.patch
net-tls-fix-the-iv-leaks.patch
net-tls-don-t-leak-partially-sent-record-in-device-mode.patch
net-strparser-partially-revert-strparser-call-skb_unclone-conditionally.patch
net-tls-fix-build-without-config_tls_device.patch
net-bridge-fix-netlink-export-of-vlan_stats_per_port-option.patch
net-mlx5e-xdp-avoid-checksum-complete-when-xdp-prog-is-loaded.patch
net-mlx5e-protect-against-non-uplink-representor-for-encap.patch
net-mlx5e-switch-to-toeplitz-rss-hash-by-default.patch
net-mlx5e-rx-fixup-skb-checksum-for-packets-with-tail-padding.patch
net-mlx5e-rx-check-ip-headers-sanity.patch
revert-net-mlx5e-enable-reporting-checksum-unnecessary-also-for-l3-packets.patch
net-mlx5-fpga-tls-hold-rcu-read-lock-a-bit-longer.patch
net-tls-prevent-bad-memory-access-in-tls_is_sk_tx_device_offloaded.patch
net-mlx5-fpga-tls-idr-remove-on-flow-delete.patch
route-avoid-crash-from-dereferencing-null-rt-from.patch
nfp-flower-replace-cfi-with-vlan-present.patch
nfp-flower-remove-vlan-cfi-bit-from-push-vlan-action.patch
sch_cake-use-tc_skb_protocol-helper-for-getting-packet-protocol.patch
sch_cake-make-sure-we-can-write-the-ip-header-before-changing-dscp-bits.patch
nfc-nci-add-some-bounds-checking-in-nci_hci_cmd_received.patch
nfc-nci-potential-off-by-one-in-pipes-array.patch
sch_cake-simplify-logic-in-cake_select_tin.patch
nfit-ars-remove-ars_start_flags.patch
nfit-ars-introduce-scrub_flags.patch
nfit-ars-allow-root-to-busy-poll-the-ars-state-machi.patch
nfit-ars-avoid-stale-ars-results.patch
tpm-tpm_i2c_atmel-return-e2big-when-the-transfer-is-.patch
tpm-fix-the-type-of-the-return-value-in-calc_tpm2_ev.patch
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/aarch64/kernel-stable_queue-aarch6…
kernel build: https://artifacts.cki-project.org/builds/aarch64/kernel-stable_queue-aarch6…
ppc64le:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/ppc64le/kernel-stable_queue-ppc64l…
kernel build: https://artifacts.cki-project.org/builds/ppc64le/kernel-stable_queue-ppc64l…
s390x:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/s390x/kernel-stable_queue-s390x-e0…
kernel build: https://artifacts.cki-project.org/builds/s390x/kernel-stable_queue-s390x-e0…
x86_64:
build options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/x86_64/kernel-stable_queue-x86_64-…
kernel build: https://artifacts.cki-project.org/builds/x86_64/kernel-stable_queue-x86_64-…
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
✅ Boot test [0]
✅ LTP lite [1]
✅ AMTU (Abstract Machine Test Utility) [2]
✅ httpd: mod_ssl smoke sanity [3]
✅ iotop: sanity [4]
✅ tuned: tune-processes-through-perf [5]
🚧 ✅ Networking route: pmtu [6]
🚧 ✅ audit: audit testsuite test [7]
🚧 ✅ httpd: php sanity [8]
🚧 ✅ stress: stress-ng [9]
ppc64le:
✅ Boot test [0]
✅ LTP lite [1]
✅ AMTU (Abstract Machine Test Utility) [2]
✅ httpd: mod_ssl smoke sanity [3]
✅ iotop: sanity [4]
✅ tuned: tune-processes-through-perf [5]
🚧 ✅ Networking route: pmtu [6]
🚧 ✅ audit: audit testsuite test [7]
🚧 ✅ httpd: php sanity [8]
🚧 ✅ selinux-policy: serge-testsuite [10]
🚧 ✅ stress: stress-ng [9]
s390x:
✅ Boot test [0]
✅ LTP lite [1]
✅ httpd: mod_ssl smoke sanity [3]
✅ iotop: sanity [4]
✅ tuned: tune-processes-through-perf [5]
🚧 ✅ Networking route: pmtu [6]
🚧 ✅ audit: audit testsuite test [7]
🚧 ✅ httpd: php sanity [8]
🚧 ✅ stress: stress-ng [9]
x86_64:
Test source:
[0]: https://github.com/CKI-project/tests-beaker/archive/master.zip#distribution…
[1]: https://github.com/CKI-project/tests-beaker/archive/master.zip#distribution…
[2]: https://github.com/CKI-project/tests-beaker/archive/master.zip#misc/amtu
[3]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/htt…
[4]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/iot…
[5]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/tun…
[6]: https://github.com/CKI-project/tests-beaker/archive/master.zip#/networking/…
[7]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/aud…
[8]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/htt…
[9]: https://github.com/CKI-project/tests-beaker/archive/master.zip#stress/stres…
[10]: https://github.com/CKI-project/tests-beaker/archive/master.zip#/packages/se…
Waived tests (marked with 🚧)
-----------------------------
This test run included waived tests. Such tests are executed but their results
are not taken into account. Tests are waived when their results are not
reliable enough, e.g. when they're just introduced or are being fixed.
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001
From: Phil Auld <pauld(a)redhat.com>
Date: Tue, 19 Mar 2019 09:00:05 -0400
Subject: [PATCH] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard
lockup
With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0. The large number of children can make
do_sched_cfs_period_timer() take longer than the period.
NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
RIP: 0010:tg_nop+0x0/0x10
<IRQ>
walk_tg_tree_from+0x29/0xb0
unthrottle_cfs_rq+0xe0/0x1a0
distribute_cfs_runtime+0xd3/0xf0
sched_cfs_period_timer+0xcb/0x160
? sched_cfs_slack_timer+0xd0/0xd0
__hrtimer_run_queues+0xfb/0x270
hrtimer_interrupt+0x122/0x270
smp_apic_timer_interrupt+0x6a/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires. This preserves the relative runtime
quota while preventing the hard lockup.
A warning is issued reporting this state and the new values.
Signed-off-by: Phil Auld <pauld(a)redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Cc: Anton Blanchard <anton(a)ozlabs.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+extern const u64 max_cfs_quota_period;
+
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+ int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
+ if (++count > 3) {
+ u64 new, old = ktime_to_ns(cfs_b->period);
+
+ new = (old * 147) / 128; /* ~115% */
+ new = min(new, max_cfs_quota_period);
+
+ cfs_b->period = ns_to_ktime(new);
+
+ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
+ cfs_b->quota *= new;
+ cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+ pr_warn_ratelimited(
+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+ smp_processor_id(),
+ div_u64(new, NSEC_PER_USEC),
+ div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+ /* reset count so we don't come right back in here */
+ count = 0;
+ }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001
From: Phil Auld <pauld(a)redhat.com>
Date: Tue, 19 Mar 2019 09:00:05 -0400
Subject: [PATCH] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard
lockup
With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0. The large number of children can make
do_sched_cfs_period_timer() take longer than the period.
NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
RIP: 0010:tg_nop+0x0/0x10
<IRQ>
walk_tg_tree_from+0x29/0xb0
unthrottle_cfs_rq+0xe0/0x1a0
distribute_cfs_runtime+0xd3/0xf0
sched_cfs_period_timer+0xcb/0x160
? sched_cfs_slack_timer+0xd0/0xd0
__hrtimer_run_queues+0xfb/0x270
hrtimer_interrupt+0x122/0x270
smp_apic_timer_interrupt+0x6a/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires. This preserves the relative runtime
quota while preventing the hard lockup.
A warning is issued reporting this state and the new values.
Signed-off-by: Phil Auld <pauld(a)redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Cc: Anton Blanchard <anton(a)ozlabs.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+extern const u64 max_cfs_quota_period;
+
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+ int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
+ if (++count > 3) {
+ u64 new, old = ktime_to_ns(cfs_b->period);
+
+ new = (old * 147) / 128; /* ~115% */
+ new = min(new, max_cfs_quota_period);
+
+ cfs_b->period = ns_to_ktime(new);
+
+ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
+ cfs_b->quota *= new;
+ cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+ pr_warn_ratelimited(
+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+ smp_processor_id(),
+ div_u64(new, NSEC_PER_USEC),
+ div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+ /* reset count so we don't come right back in here */
+ count = 0;
+ }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001
From: Phil Auld <pauld(a)redhat.com>
Date: Tue, 19 Mar 2019 09:00:05 -0400
Subject: [PATCH] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard
lockup
With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0. The large number of children can make
do_sched_cfs_period_timer() take longer than the period.
NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
RIP: 0010:tg_nop+0x0/0x10
<IRQ>
walk_tg_tree_from+0x29/0xb0
unthrottle_cfs_rq+0xe0/0x1a0
distribute_cfs_runtime+0xd3/0xf0
sched_cfs_period_timer+0xcb/0x160
? sched_cfs_slack_timer+0xd0/0xd0
__hrtimer_run_queues+0xfb/0x270
hrtimer_interrupt+0x122/0x270
smp_apic_timer_interrupt+0x6a/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires. This preserves the relative runtime
quota while preventing the hard lockup.
A warning is issued reporting this state and the new values.
Signed-off-by: Phil Auld <pauld(a)redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Cc: Anton Blanchard <anton(a)ozlabs.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+extern const u64 max_cfs_quota_period;
+
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+ int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
+ if (++count > 3) {
+ u64 new, old = ktime_to_ns(cfs_b->period);
+
+ new = (old * 147) / 128; /* ~115% */
+ new = min(new, max_cfs_quota_period);
+
+ cfs_b->period = ns_to_ktime(new);
+
+ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
+ cfs_b->quota *= new;
+ cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+ pr_warn_ratelimited(
+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+ smp_processor_id(),
+ div_u64(new, NSEC_PER_USEC),
+ div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+ /* reset count so we don't come right back in here */
+ count = 0;
+ }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001
From: Phil Auld <pauld(a)redhat.com>
Date: Tue, 19 Mar 2019 09:00:05 -0400
Subject: [PATCH] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard
lockup
With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0. The large number of children can make
do_sched_cfs_period_timer() take longer than the period.
NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
RIP: 0010:tg_nop+0x0/0x10
<IRQ>
walk_tg_tree_from+0x29/0xb0
unthrottle_cfs_rq+0xe0/0x1a0
distribute_cfs_runtime+0xd3/0xf0
sched_cfs_period_timer+0xcb/0x160
? sched_cfs_slack_timer+0xd0/0xd0
__hrtimer_run_queues+0xfb/0x270
hrtimer_interrupt+0x122/0x270
smp_apic_timer_interrupt+0x6a/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires. This preserves the relative runtime
quota while preventing the hard lockup.
A warning is issued reporting this state and the new values.
Signed-off-by: Phil Auld <pauld(a)redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Cc: Anton Blanchard <anton(a)ozlabs.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+extern const u64 max_cfs_quota_period;
+
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+ int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
+ if (++count > 3) {
+ u64 new, old = ktime_to_ns(cfs_b->period);
+
+ new = (old * 147) / 128; /* ~115% */
+ new = min(new, max_cfs_quota_period);
+
+ cfs_b->period = ns_to_ktime(new);
+
+ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
+ cfs_b->quota *= new;
+ cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+ pr_warn_ratelimited(
+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+ smp_processor_id(),
+ div_u64(new, NSEC_PER_USEC),
+ div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+ /* reset count so we don't come right back in here */
+ count = 0;
+ }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2f5fb19341883bb6e37da351bc3700489d8506a7 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Sun, 14 Apr 2019 19:51:06 +0200
Subject: [PATCH] x86/speculation: Prevent deadlock on ssb_state::lock
Mikhail reported a lockdep splat related to the AMD specific ssb_state
lock:
CPU0 CPU1
lock(&st->lock);
local_irq_disable();
lock(&(&sighand->siglock)->rlock);
lock(&st->lock);
<Interrupt>
lock(&(&sighand->siglock)->rlock);
*** DEADLOCK ***
The connection between sighand->siglock and st->lock comes through seccomp,
which takes st->lock while holding sighand->siglock.
Make sure interrupts are disabled when __speculation_ctrl_update() is
invoked via prctl() -> speculation_ctrl_update(). Add a lockdep assert to
catch future offenders.
Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD")
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Cc: Thomas Lendacky <thomas.lendacky(a)amd.com>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1904141948200.4917@nanos.tec.linu…
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 58ac7be52c7a..957eae13b370 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -426,6 +426,8 @@ static __always_inline void __speculation_ctrl_update(unsigned long tifp,
u64 msr = x86_spec_ctrl_base;
bool updmsr = false;
+ lockdep_assert_irqs_disabled();
+
/*
* If TIF_SSBD is different, select the proper mitigation
* method. Note that if SSBD mitigation is disabled or permanentely
@@ -477,10 +479,12 @@ static unsigned long speculation_ctrl_update_tif(struct task_struct *tsk)
void speculation_ctrl_update(unsigned long tif)
{
+ unsigned long flags;
+
/* Forced update. Make sure all relevant TIF flags are different */
- preempt_disable();
+ local_irq_save(flags);
__speculation_ctrl_update(~tif, tif);
- preempt_enable();
+ local_irq_restore(flags);
}
/* Called from seccomp/prctl update */