[ACTIVITY] (Linus Walleij) 2012-09-15 - 2012-09-23 - linaro-kernel

List overview All Threads
Download

newer

[ACTIVITY] (Linus Walleij) 2012-09-15 - 2012-09-23

older

[RFC v2 0/2] vmevent: A bit...

Connect session: "Kernel...

Linus Walleij

24 Sep 2012 24 Sep '12

7:37 a.m.

== Linus Walleij linusw ==

=== Highlights ===

* Drilled into the RCU lockup. With some help from Paul McKenney we got a task dump from CPU0 (the CPU that is locking) but not much more - only the swapper thread is running there and we need to dump its stack.

* Puffed and pinged on the SPARSE_IRQ patch, yet no reaction so far.

* Mainlining some stuff in the PL011 driver that has been sitting internally for some time. Easy stuff went in first and then we ran into the trickier patches - some odd baudrates are impossible since th TTY layer will "snap" the baudrate to a standard baudrate from a table if it's "close enough". We used to work around this by not calling out to the TTY layer in the PL011 driver for certain baudrates, but this is duct-taping. Still debating the real solution with Alan Cox and Russell King who help out a lot.

* Reviewed and merged various GPIO patches.

* Reviewed and merged various pinctrl patches.

* Reviewed some ux500 patches coming from Lee.

* Reviewed Arnd's iomem cleanups.

* Sent out Patrice Chotard's pinctrl patch for the PL022 SPI driver, Mark has merged it.

=== Plans ===

* Ongoing work on sparse IRQ for Nomadik and Ux500.

* Test the PL08x patches on the Ericsson Research PB11MPCore and submit platform data for using pl08x DMA on that platform.

* Look into other Ux500 stuff in need of mainlining... using an internal tracking sheet for this.

* Look into regmap. Try something out, get to know it.

=== Issues ===

* N/A

Thanks, Linus Walleij

Show replies by date

Jon Medhurst (Tixy)

24 Sep 24 Sep

9:08 a.m.

On Mon, 2012-09-24 at 09:37 +0200, Linus Walleij wrote:

...

Drilled into the RCU lockup. With some help from Paul McKenney we got a task dump from CPU0 (the CPU that is locking) but not much more - only the swapper thread is running there and we need to dump its stack.

Quite probably unrelated, but I've seen an RCU lockup on vexpress TC2... https://bugs.launchpad.net/linaro-landing-team-arm/+bug/1051993

That was a kernel built from linux-linaro branch at this tag http://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git%3Ba=shortlog...

This is 3.6-rc5-ish and includes Linaro's big.LITTLE MP branch http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git%3Ba=shortlog%3Bh=refs/h... which has a couple of RCU patches from Paul, amongst other things.

-- Tixy

Linus Walleij

25 Sep 25 Sep

8:14 a.m.

On Mon, Sep 24, 2012 at 11:08 AM, Jon Medhurst (Tixy) tixy@linaro.org wrote:

...

On Mon, 2012-09-24 at 09:37 +0200, Linus Walleij wrote:

...

Drilled into the RCU lockup. With some help from Paul McKenney we got a task dump from CPU0 (the CPU that is locking) but not much more - only the swapper thread is running there and we need to dump its stack.

Quite probably unrelated, but I've seen an RCU lockup on vexpress TC2... https://bugs.launchpad.net/linaro-landing-team-arm/+bug/1051993

Actually it looks related.

Both lockups are triggered by cpuidle and I also have an indication from John Stultz who has seen the same lockup (not clear which system). This is how it looks to me: http://marc.info/?l=linux-kernel&m=134753979619804&w=2

It appears the rate of triggering is system-dependent, possibly this even affects x86 they just haven't seen it because of ... something. So Vexpress seems to see even more of it.

Pawel have you seen the RCU lockups on the v3.6 kernels on your boards?

Yours, Linus Walleij

Pawel Moll

10:54 a.m.

On Tue, 2012-09-25 at 09:14 +0100, Linus Walleij wrote:

...

On Mon, Sep 24, 2012 at 11:08 AM, Jon Medhurst (Tixy) tixy@linaro.org wrote:

...
On Mon, 2012-09-24 at 09:37 +0200, Linus Walleij wrote:

...

Drilled into the RCU lockup. With some help from Paul McKenney we got a task dump from CPU0 (the CPU that is locking) but not much more - only the swapper thread is running there and we need to dump its stack.

Quite probably unrelated, but I've seen an RCU lockup on vexpress TC2... https://bugs.launchpad.net/linaro-landing-team-arm/+bug/1051993

Actually it looks related.

Both lockups are triggered by cpuidle and I also have an indication from John Stultz who has seen the same lockup (not clear which system). This is how it looks to me: http://marc.info/?l=linux-kernel&m=134753979619804&w=2

It appears the rate of triggering is system-dependent, possibly this even affects x86 they just haven't seen it because of ... something. So Vexpress seems to see even more of it.

Pawel have you seen the RCU lockups on the v3.6 kernels on your boards?

No, not RCU lockups, however some of my colleagues had some problems originating at twd_handler, similarly to your callstack. Apparently reverting commit 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 "time: Condense timekeeper.xtime into xtime_sec" helped.

Paweł

John Stultz

4:41 p.m.

On 09/25/2012 03:54 AM, Pawel Moll wrote:

...

No, not RCU lockups, however some of my colleagues had some problems originating at twd_handler, similarly to your callstack. Apparently reverting commit 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 "time: Condense timekeeper.xtime into xtime_sec" helped.

I've had a fix that was pending in tip/timers/urgent and just landed upstream on Friday. Do let me know if you continue to see this issue with -rc7.

thanks -john

Linus Walleij

8:16 p.m.

On Tue, Sep 25, 2012 at 6:41 PM, John Stultz john.stultz@linaro.org wrote:

...

I've had a fix that was pending in tip/timers/urgent and just landed upstream on Friday. Do let me know if you continue to see this issue with -rc7.

I've reproduced the bug on v3.6-rc7, and actually David Lezcano told me to test that patch already, so sorry...

Testing Paul's patch next.

Thanks for all help! Linus Walleij

John Stultz

8:28 p.m.

On 09/25/2012 01:16 PM, Linus Walleij wrote:

...

On Tue, Sep 25, 2012 at 6:41 PM, John Stultz john.stultz@linaro.org wrote:

...
I've had a fix that was pending in tip/timers/urgent and just landed upstream on Friday. Do let me know if you continue to see this issue with -rc7.

I've reproduced the bug on v3.6-rc7, and actually David Lezcano told me to test that patch already, so sorry...

Testing Paul's patch next.

Thanks for all help!

Ok, but it sounds like your issue and Pawel's are different, no?

Or are you seeing 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 "time: Condense timekeeper.xtime into xtime_sec" as the culprit too?

Ether way, I'd be interested in hearing of Pawel continues to see issues w/ 3.6-rc7+ kernels.

thanks -john

Linus Walleij

8:37 p.m.

On Tue, Sep 25, 2012 at 10:28 PM, John Stultz john.stultz@linaro.org wrote:

...

Ok, but it sounds like your issue and Pawel's are different, no?

Probably... I got it wrong.

...

Or are you seeing 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 "time: Condense timekeeper.xtime into xtime_sec" as the culprit too?

Actually I reverted three totally different patches and they *did* seem to actually fix it (or maybe just make it all the less common):

commit 8db25e7891a47e03db6f04344a9c92be16e391bb workqueue: simplify CPU hotplug code commit 628c78e7ea19d5b70d2b6a59030362168cdbe1ad workqueue: remove CPU offline trustee commit 3ce63377305b694f53e7dd0c72907591c5344224 workqueue: don't butcher idle workers on an offline CPU

However: as of writing, I haven't had a lockup since I booted with Paul's patch, so I'm hoping he found the silver bullet!

Linus

Paul E. McKenney

11:03 p.m.

On Tue, Sep 25, 2012 at 10:37:42PM +0200, Linus Walleij wrote:

...

On Tue, Sep 25, 2012 at 10:28 PM, John Stultz john.stultz@linaro.org wrote:

...
Ok, but it sounds like your issue and Pawel's are different, no?

Probably... I got it wrong.

...
Or are you seeing 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 "time: Condense timekeeper.xtime into xtime_sec" as the culprit too?

Actually I reverted three totally different patches and they *did* seem to actually fix it (or maybe just make it all the less common):

commit 8db25e7891a47e03db6f04344a9c92be16e391bb workqueue: simplify CPU hotplug code commit 628c78e7ea19d5b70d2b6a59030362168cdbe1ad workqueue: remove CPU offline trustee commit 3ce63377305b694f53e7dd0c72907591c5344224 workqueue: don't butcher idle workers on an offline CPU

However: as of writing, I haven't had a lockup since I booted with Paul's patch, so I'm hoping he found the silver bullet!

Keeping fingers firmly crossed...

Thanx, Paul

Pawel Moll

26 Sep 26 Sep

12:15 p.m.

On Tue, 2012-09-25 at 21:28 +0100, John Stultz wrote:

...

Or are you seeing 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 "time: Condense timekeeper.xtime into xtime_sec" as the culprit too?

Ether way, I'd be interested in hearing of Pawel continues to see issues w/ 3.6-rc7+ kernels.

It seems to be fixed indeed (at least that what I'm told :-)

Thanks!

Paweł

Paul E. McKenney

25 Sep 25 Sep

12:52 p.m.

On Tue, Sep 25, 2012 at 10:14:36AM +0200, Linus Walleij wrote:

...

On Mon, Sep 24, 2012 at 11:08 AM, Jon Medhurst (Tixy) tixy@linaro.org wrote:

...
On Mon, 2012-09-24 at 09:37 +0200, Linus Walleij wrote:

...

Drilled into the RCU lockup. With some help from Paul McKenney we got a task dump from CPU0 (the CPU that is locking) but not much more - only the swapper thread is running there and we need to dump its stack.

Quite probably unrelated, but I've seen an RCU lockup on vexpress TC2... https://bugs.launchpad.net/linaro-landing-team-arm/+bug/1051993

Actually it looks related.

Both lockups are triggered by cpuidle and I also have an indication from John Stultz who has seen the same lockup (not clear which system). This is how it looks to me: http://marc.info/?l=linux-kernel&m=134753979619804&w=2

It appears the rate of triggering is system-dependent, possibly this even affects x86 they just haven't seen it because of ... something. So Vexpress seems to see even more of it.

Pawel have you seen the RCU lockups on the v3.6 kernels on your boards?

Does the following patch help?

Thanx, Paul

------------------------------------------------------------------------

rcu: Fix day-one dyntick-idle stall-warning bug

Each grace period is supposed to have at least one callback waiting for that grace period to complete. However, if CONFIG_NO_HZ=n, an extra callback-free grace period is no big problem -- it will chew up a tiny bit of CPU time, but it will complete normally. In contrast, CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to sleep indefinitely, in turn indefinitely delaying completion of the callback-free grace period. Given that nothing is waiting on this grace period, this is also not a problem.

That is, unless RCU CPU stall warnings are also enabled, as they are in recent kernels. In this case, if a CPU wakes up after at least one minute of inactivity, an RCU CPU stall warning will result. The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute. But there are some embedded systems with cut-down userspace configurations that consistently get into this situation.

All this begs the question of exactly how a callback-free grace period gets started in the first place. This can happen due to the fact that CPUs do not necessarily agree on which grace period is in progress. If a CPU still believes that the grace period that just completed is still ongoing, it will believe that it has callbacks that need to wait for another grace period, never mind the fact that the grace period that they were waiting for just completed. This CPU can therefore erroneously decide to start a new grace period. Note that this can happen in TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system: Deadlock considerations mean that the CPU that detected the end of the grace period is not necessarily officially informed of this fact for some time.

Once this CPU notices that the earlier grace period completed, it will invoke its callbacks. It then won't have any callbacks left. If no other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a new grace period. This new check relies on an array of tail pointers into each CPU's list of callbacks. If the CPU is up to date on which grace periods have completed, it checks to see if any callbacks follow the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks follow the RCU_WAIT_TAIL segment. The reason that this works is that the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment as soon as the CPU is officially notified that the old grace period has ended.

This change is to cpu_needs_another_gp(), which is called in a number of places. The only one that really matters is in rcu_start_gp(), where the root rcu_node structure's ->lock is held, which prevents any other CPU from starting or completing a grace period, so that the comparison that determines whether the CPU is missing the completion of a grace period is stable.

Reported-by: Becky Bruce bgillbruce@gmail.com Reported-by: Subodh Nijsure snijsure@grid-net.com Reported-by: Paul Walmsley paul@pwsan.com Signed-off-by: Paul E. McKenney paul.mckenney@linaro.org Signed-off-by: Paul E. McKenney paulmck@linux.vnet.ibm.com Tested-by: Paul Walmsley paul@pwsan.com # OMAP3730, OMAP4430 Cc: stable@vger.kernel.org

diff --git a/kernel/rcutree.c b/kernel/rcutree.c index f280e54..f7bcd9e 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -305,7 +305,9 @@ cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp) static int cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) { - return *rdp->nxttail[RCU_DONE_TAIL] && !rcu_gp_in_progress(rsp); + return *rdp->nxttail[RCU_DONE_TAIL + + ACCESS_ONCE(rsp->completed) != rdp->completed] && + !rcu_gp_in_progress(rsp); }

Linus Walleij

26 Sep 26 Sep

6:56 a.m.

On Tue, Sep 25, 2012 at 2:52 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

...

Does the following patch help?

                                                    Thanx, Paul

Rock solid! Left it overnight and no lockups so far!

...

The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute.

Ux500 totally rocks. ;-)

My dead simple busybox userspace is very good at doing absolutely nothing...

...

Reported-by: Becky Bruce bgillbruce@gmail.com Reported-by: Subodh Nijsure snijsure@grid-net.com Reported-by: Paul Walmsley paul@pwsan.com Signed-off-by: Paul E. McKenney paul.mckenney@linaro.org Signed-off-by: Paul E. McKenney paulmck@linux.vnet.ibm.com Tested-by: Paul Walmsley paul@pwsan.com # OMAP3730, OMAP4430 Cc: stable@vger.kernel.org

Reported-by/Tested-by: Linus Walleij linus.walleij@linaro.org

I trust you to get this to Mr. Torvalds in no time ;-)

Yours, Linus Walleij

Paul E. McKenney

1:22 p.m.

On Wed, Sep 26, 2012 at 08:56:41AM +0200, Linus Walleij wrote:

...

On Tue, Sep 25, 2012 at 2:52 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

...
Does the following patch help?
                                                    Thanx, Paul
Rock solid! Left it overnight and no lockups so far!

Cool!!! ;-)

...

...
The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute.

Ux500 totally rocks. ;-)

My dead simple busybox userspace is very good at doing absolutely nothing...

;-) ;-) ;-)

...

...
Reported-by: Becky Bruce bgillbruce@gmail.com Reported-by: Subodh Nijsure snijsure@grid-net.com Reported-by: Paul Walmsley paul@pwsan.com Signed-off-by: Paul E. McKenney paul.mckenney@linaro.org Signed-off-by: Paul E. McKenney paulmck@linux.vnet.ibm.com Tested-by: Paul Walmsley paul@pwsan.com # OMAP3730, OMAP4430 Cc: stable@vger.kernel.org

Reported-by/Tested-by: Linus Walleij linus.walleij@linaro.org

I trust you to get this to Mr. Torvalds in no time ;-)

The good news is that it is already in -tip, and if Ingo's testing of it goes well, it should hit mainline soon. The corresponding bad news is that this means that I no longer am able to add your Reported-by and Tested-by...

Thanx, Paul

Linus Walleij

1:28 p.m.

On Wed, Sep 26, 2012 at 3:22 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

...

The corresponding bad news is that this means that I no longer am able to add your Reported-by and Tested-by...

Who cares. The important thing is that it gets fixed.

Thanks, Linus Walleij

Jon Medhurst (Tixy)

26 Oct 26 Oct

1:55 p.m.

New subject: rcu: Fix day-one dyntick-idle stall-warning bug! [Re: [ACTIVITY] (Linus Walleij) 2012-09-15 - 2012-09-23]

On Wed, 2012-09-26 at 06:22 -0700, Paul E. McKenney wrote:

...

On Wed, Sep 26, 2012 at 08:56:41AM +0200, Linus Walleij wrote:

...
On Tue, Sep 25, 2012 at 2:52 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

...
Does the following patch help?
                                                    Thanx, Paul
Rock solid! Left it overnight and no lockups so far!
Cool!!! ;-)

...
...
The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute.

Ux500 totally rocks. ;-)

My dead simple busybox userspace is very good at doing absolutely nothing...

;-) ;-) ;-)

...
...
Reported-by: Becky Bruce bgillbruce@gmail.com Reported-by: Subodh Nijsure snijsure@grid-net.com Reported-by: Paul Walmsley paul@pwsan.com Signed-off-by: Paul E. McKenney paul.mckenney@linaro.org Signed-off-by: Paul E. McKenney paulmck@linux.vnet.ibm.com Tested-by: Paul Walmsley paul@pwsan.com # OMAP3730, OMAP4430 Cc: stable@vger.kernel.org

Reported-by/Tested-by: Linus Walleij linus.walleij@linaro.org

I trust you to get this to Mr. Torvalds in no time ;-)

The good news is that it is already in -tip, and if Ingo's testing of it goes well, it should hit mainline soon. The corresponding bad news is that this means that I no longer am able to add your Reported-by and Tested-by...

I don't see this fix in 3.7, did it get superseded by something else?

I ask, because I've heard reports of the problem still being present and on investigation see this patch isn't in mainline Linux.

-- Tixy

Jon Medhurst (Tixy)

2:16 p.m.

New subject: rcu: Fix day-one dyntick-idle stall-warning bug! [Re: [ACTIVITY] (Linus Walleij) 2012-09-15 - 2012-09-23]

On Fri, 2012-10-26 at 14:55 +0100, Jon Medhurst (Tixy) wrote:

...

I don't see this fix in 3.7, did it get superseded by something else?

Sorry for the noise, I'm going blind, it is in there...

commit a10d206ef1a83121ab7430cb196e0376a7145b22 Author: Paul E. McKenney paul.mckenney@linaro.org Date: Sat Sep 22 13:55:30 2012 -0700

rcu: Fix day-one dyntick-idle stall-warning bug

...

I ask, because I've heard reports of the problem still being present and on investigation see this patch isn't in mainline Linux.

So its a different problem, possibly TC2 specific.

-- Tixy

Paul E. McKenney

2:35 p.m.

New subject: rcu: Fix day-one dyntick-idle stall-warning bug! [Re: [ACTIVITY] (Linus Walleij) 2012-09-15 - 2012-09-23]

On Fri, Oct 26, 2012 at 03:16:22PM +0100, Jon Medhurst (Tixy) wrote:

...

On Fri, 2012-10-26 at 14:55 +0100, Jon Medhurst (Tixy) wrote:

...
I don't see this fix in 3.7, did it get superseded by something else?

Sorry for the noise, I'm going blind, it is in there...

commit a10d206ef1a83121ab7430cb196e0376a7145b22 Author: Paul E. McKenney paul.mckenney@linaro.org Date: Sat Sep 22 13:55:30 2012 -0700
rcu: Fix day-one dyntick-idle stall-warning bug

You had me worried for a bit there. ;-)

...

...
I ask, because I've heard reports of the problem still being present and on investigation see this patch isn't in mainline Linux.

So its a different problem, possibly TC2 specific.

OK. If you want me to look at it, please send specifics.

Thanx, Paul

5009

days inactive

5041

days old

linaro-kernel@lists.linaro.org

16 comments

participants

tags (0)

participants (5)

John Stultz
Jon Medhurst (Tixy)
Linus Walleij
Paul E. McKenney
Pawel Moll