From: Michal Hocko <mhocko(a)suse.com>
[ Upstream commit f01f17d3705bb6081c9e5728078f64067982be36 ]
Mike has reported a considerable overhead of refresh_cpu_vm_stats from
the idle entry during pipe test:
12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12
4.75% [kernel] [k] __schedule
4.70% [kernel] [k] mutex_unlock
3.14% [kernel] [k] __switch_to
This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
deferrable again and shut down on idle") which has placed quiet_vmstat
into cpu_idle_loop. The main reason here seems to be that the idle
entry has to get over all zones and perform atomic operations for each
vmstat entry even though there might be no per cpu diffs. This is a
pointless overhead for _each_ idle entry.
Make sure that quiet_vmstat is as light as possible.
First of all it doesn't make any sense to do any local sync if the
current cpu is already set in oncpu_stat_off because vmstat_update puts
itself there only if there is nothing to do.
Then we can check need_update which should be a cheap way to check for
potential per-cpu diffs and only then do refresh_cpu_vm_stats.
The original patch also did cancel_delayed_work which we are not doing
here. There are two reasons for that. Firstly cancel_delayed_work from
idle context will blow up on RT kernels (reported by Mike):
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
Call Trace:
dump_stack+0x49/0x67
___might_sleep+0xf5/0x180
rt_spin_lock+0x20/0x50
try_to_grab_pending+0x69/0x240
cancel_delayed_work+0x26/0xe0
quiet_vmstat+0x75/0xa0
cpu_idle_loop+0x38/0x3e0
cpu_startup_entry+0x13/0x20
start_secondary+0x114/0x140
And secondly, even on !RT kernels it might add some non trivial overhead
which is not necessary. Even if the vmstat worker wakes up and preempts
idle then it will be most likely a single shot noop because the stats
were already synced and so it would end up on the oncpu_stat_off anyway.
We just need to teach both vmstat_shepherd and vmstat_update to stop
scheduling the worker if there is nothing to do.
[mgalbraith(a)suse.de: cancel pending work of the cpu_stat_off CPU]
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Mike Galbraith <umgwanakikbuti(a)gmail.com>
Acked-by: Christoph Lameter <cl(a)linux.com>
Signed-off-by: Mike Galbraith <mgalbraith(a)suse.de>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Daniel Wagner <wagi(a)monom.org>
---
Hi Greg,
Upstream commmit 0eb77e988032 ("vmstat: make vmstat_updater deferrable
again and shut down on idle") was back ported in v4.4.178
(bdf3c006b9a2). For -rt we definitely need the bugfix f01f17d3705b
("mm, vmstat: make quiet_vmstat lighter") as well.
Since the offending patch was back ported to v4.4 stable only, the
other stable branches don't need an update (offending patch and bug
fix are already in).
Could you please queue the above patch for v4.4.y?
Thanks,
Daniel
mm/vmstat.c | 68 ++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 46 insertions(+), 22 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d0ff5de06bfa..929f1fc3a4b1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1406,10 +1406,15 @@ static void vmstat_update(struct work_struct *w)
* Counters were updated so we expect more updates
* to occur in the future. Keep on running the
* update worker thread.
+ * If we were marked on cpu_stat_off clear the flag
+ * so that vmstat_shepherd doesn't schedule us again.
*/
- queue_delayed_work_on(smp_processor_id(), vmstat_wq,
- this_cpu_ptr(&vmstat_work),
- round_jiffies_relative(sysctl_stat_interval));
+ if (!cpumask_test_and_clear_cpu(smp_processor_id(),
+ cpu_stat_off)) {
+ queue_delayed_work_on(smp_processor_id(), vmstat_wq,
+ this_cpu_ptr(&vmstat_work),
+ round_jiffies_relative(sysctl_stat_interval));
+ }
} else {
/*
* We did not update any counters so the app may be in
@@ -1437,18 +1442,6 @@ static void vmstat_update(struct work_struct *w)
* until the diffs stay at zero. The function is used by NOHZ and can only be
* invoked when tick processing is not active.
*/
-void quiet_vmstat(void)
-{
- if (system_state != SYSTEM_RUNNING)
- return;
-
- do {
- if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
- cancel_delayed_work(this_cpu_ptr(&vmstat_work));
-
- } while (refresh_cpu_vm_stats(false));
-}
-
/*
* Check if the diffs for a certain cpu indicate that
* an update is needed.
@@ -1472,6 +1465,30 @@ static bool need_update(int cpu)
return false;
}
+void quiet_vmstat(void)
+{
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ /*
+ * If we are already in hands of the shepherd then there
+ * is nothing for us to do here.
+ */
+ if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+ return;
+
+ if (!need_update(smp_processor_id()))
+ return;
+
+ /*
+ * Just refresh counters and do not care about the pending delayed
+ * vmstat_update. It doesn't fire that often to matter and canceling
+ * it would be too expensive from this path.
+ * vmstat_shepherd will take care about that for us.
+ */
+ refresh_cpu_vm_stats(false);
+}
+
/*
* Shepherd worker thread that checks the
@@ -1489,18 +1506,25 @@ static void vmstat_shepherd(struct work_struct *w)
get_online_cpus();
/* Check processors whose vmstat worker threads have been disabled */
- for_each_cpu(cpu, cpu_stat_off)
- if (need_update(cpu) &&
- cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
-
- queue_delayed_work_on(cpu, vmstat_wq,
- &per_cpu(vmstat_work, cpu), 0);
+ for_each_cpu(cpu, cpu_stat_off) {
+ struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+ if (need_update(cpu)) {
+ if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
+ queue_delayed_work_on(cpu, vmstat_wq, dw, 0);
+ } else {
+ /*
+ * Cancel the work if quiet_vmstat has put this
+ * cpu on cpu_stat_off because the work item might
+ * be still scheduled
+ */
+ cancel_delayed_work(dw);
+ }
+ }
put_online_cpus();
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
-
}
static void __init start_shepherd_timer(void)
--
2.20.1
On Thu, May 2, 2019 at 9:26 AM Sasha Levin <sashal(a)kernel.org> wrote:
>
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 1e434b703248 ARM: imx: update the cpu power up timing setting on i.mx6sx.
>
> The bot has tested the following trees: v5.0.10, v4.19.37, v4.14.114, v4.9.171, v4.4.179.
>
> v5.0.10: Build OK!
> v4.19.37: Build OK!
> v4.14.114: Build OK!
> v4.9.171: Build OK!
> v4.4.179: Failed to apply! Possible dependencies:
> 6ae44aa651d0 ("ARM: imx: enable WAIT mode hardware workaround for imx6sx")
>
>
> How should we proceed with this patch?
I can submit a version for 4.4 stable tree once this hits mainline.
The conflict resolution is very simple.
"To track whether a request has started on HW, we can emit a breadcrumb at
the beginning of the request and check its timeline's HWSP to see if the
breadcrumb has advanced past the start of this request." It means all the
request which timeline's has_init_breadcrumb is true, then the
emit_init_breadcrumb process must have before emitting the real commands,
otherwise, the scheduler might get a wrong state of this request during
reset. If the request is exactly the guilty one, the scheduler won't
terminate it with the wrong state. To avoid this, do emit_init_breadcrumb
for all the requests from gvt.
v2: cc to stable kernel
Fixes: 8547444137ec ("drm/i915: Identify active requests")
Cc: stable(a)vger.kernel.org
Signed-off-by: Weinan <weinan.z.li(a)intel.com>
---
drivers/gpu/drm/i915/gvt/scheduler.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c b/drivers/gpu/drm/i915/gvt/scheduler.c
index 7c99bbc3e2b8..ccd71152c9bc 100644
--- a/drivers/gpu/drm/i915/gvt/scheduler.c
+++ b/drivers/gpu/drm/i915/gvt/scheduler.c
@@ -298,12 +298,31 @@ static int copy_workload_to_ring_buffer(struct intel_vgpu_workload *workload)
struct i915_request *req = workload->req;
void *shadow_ring_buffer_va;
u32 *cs;
+ int err;
if ((IS_KABYLAKE(req->i915) || IS_BROXTON(req->i915)
|| IS_COFFEELAKE(req->i915))
&& is_inhibit_context(req->hw_context))
intel_vgpu_restore_inhibit_context(vgpu, req);
+ /*
+ * To track whether a request has started on HW, we can emit a
+ * breadcrumb at the beginning of the request and check its
+ * timeline's HWSP to see if the breadcrumb has advanced past the
+ * start of this request. Actually, the request must have the
+ * init_breadcrumb if its timeline set has_init_bread_crumb, or the
+ * scheduler might get a wrong state of it during reset. Since the
+ * requests from gvt always set the has_init_breadcrumb flag, here
+ * need to do the emit_init_breadcrumb for all the requests.
+ */
+ if (req->engine->emit_init_breadcrumb) {
+ err = req->engine->emit_init_breadcrumb(req);
+ if (err) {
+ gvt_vgpu_err("fail to emit init breadcrumb\n");
+ return err;
+ }
+ }
+
/* allocate shadow ring buffer */
cs = intel_ring_begin(workload->req, workload->rb_len / sizeof(u32));
if (IS_ERR(cs)) {
--
2.17.1
I'm announcing the release of the 3.16.67 kernel.
All users of the 3.16 kernel series should upgrade.
The updated 3.16.y git tree can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-3.16.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git
The diff from 3.16.66 is attached to this message.
Ben.
------------
Makefile | 2 +-
arch/x86/kvm/vmx.c | 4 +++-
drivers/net/vxlan.c | 2 +-
drivers/net/wireless/brcm80211/brcmfmac/wl_cfg80211.c | 16 +++++++++++-----
drivers/spi/spi-omap-100k.c | 4 ----
include/net/ip_fib.h | 2 +-
kernel/fork.c | 15 ++++++++++++---
kernel/time/timer_stats.c | 2 +-
mm/percpu.c | 8 ++++----
net/ipv4/fib_semantics.c | 8 +++++---
net/ipv4/route.c | 10 ++++++----
net/ipv6/ip6_output.c | 3 +++
12 files changed, 48 insertions(+), 28 deletions(-)
Amit Klein (1):
inet: update the IP ID generation algorithm to higher standards.
Arend Van Spriel (1):
brcmfmac: add length checks in scheduled scan result handler
Ben Hutchings (4):
Revert "brcmfmac: assure SSID length from firmware is limited"
vxlan: Fix big-endian declaration of VNI
timer/debug: Change /proc/timer_stats from 0644 to 0600
Linux 3.16.67
David Herrmann (1):
fork: record start_time late
Eric Dumazet (1):
ipv4: fix a race in update_or_create_fnhe()
Joerg Roedel (1):
KVM: VMX: Fix x2apic check in vmx_msr_bitmap_mode()
Matteo Croce (1):
percpu: stop printing kernel addresses
Nick Krause (1):
spi: omap-100k: Remove unused definitions
The patch titled
Subject: fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
has been added to the -mm tree. Its filename is
fs-writeback-use-rcu_barrier-to-wait-for-inflight-wb-switches-going-into-workqueue-when-umount.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/fs-writeback-use-rcu_barrier-to-wa…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/fs-writeback-use-rcu_barrier-to-wa…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Jiufei Xue <jiufei.xue(a)linux.alibaba.com>
Subject: fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb switch
may not go to the workqueue after synchronize_rcu(). Thus previous
scheduled switches was not finished even flushing the workqueue, which
will cause a NULL pointer dereferenced followed below.
VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
[<ffffffff8126a303>] evict+0xb3/0x180
[<ffffffff8126a760>] iput+0x1b0/0x230
[<ffffffff8127c690>] inode_switch_wbs_work_fn+0x3c0/0x6a0
[<ffffffff810a5b2e>] worker_thread+0x4e/0x490
[<ffffffff810a5ae0>] ? process_one_work+0x410/0x410
[<ffffffff810ac056>] kthread+0xe6/0x100
[<ffffffff8173c199>] ret_from_fork+0x39/0x50
Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
pending callbacks to finish. And inc isw_nr_in_flight after call_rcu() in
inode_switch_wbs() to make more sense.
Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com
Signed-off-by: Jiufei Xue <jiufei.xue(a)linux.alibaba.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Suggested-by: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/fs-writeback.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
--- a/fs/fs-writeback.c~fs-writeback-use-rcu_barrier-to-wait-for-inflight-wb-switches-going-into-workqueue-when-umount
+++ a/fs/fs-writeback.c
@@ -523,8 +523,6 @@ static void inode_switch_wbs(struct inod
isw->inode = inode;
- atomic_inc(&isw_nr_in_flight);
-
/*
* In addition to synchronizing among switchers, I_WB_SWITCH tells
* the RCU protected stat update paths to grab the i_page
@@ -532,6 +530,9 @@ static void inode_switch_wbs(struct inod
* Let's continue after I_WB_SWITCH is guaranteed to be visible.
*/
call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
+
+ atomic_inc(&isw_nr_in_flight);
+
goto out_unlock;
out_free:
@@ -901,7 +902,11 @@ restart:
void cgroup_writeback_umount(void)
{
if (atomic_read(&isw_nr_in_flight)) {
- synchronize_rcu();
+ /*
+ * Use rcu_barrier() to wait for all pending callbacks to
+ * ensure that all in-flight wb switches are in the workqueue.
+ */
+ rcu_barrier();
flush_workqueue(isw_wq);
}
}
_
Patches currently in -mm which might be from jiufei.xue(a)linux.alibaba.com are
fs-writeback-use-rcu_barrier-to-wait-for-inflight-wb-switches-going-into-workqueue-when-umount.patch