Hi,
On Friday 22 Mar 2019 at 13:01:13 (-0500), Zachariah Kennedy wrote:
Good day,
First, I apologize for the information overload below. I just wanted to make sure to provided as much detail as possible.
I do kernel development as a hobby for XDA and other communities. For a given device that we support, we tend to follow the same sched changes as Google and will pull in changes from upstream. One of the devices we dev for is the OnePlus6 and 6T. It uses the same SD845 SOC as the P3. Long story short, I started to notice an issue that we did not have on our Oneplus5 Device (SD835 same as the Pixel 2). When boosting top-app/schedtune.boost to 50 for example, it would appear to boost both small and big cores. With a boost of 50, this means both clusters would be running at 1766MHz and above. But this behavior was inconsistent. I could reboot, and sometimes only the big cluster would appear to be affected by the stune boost but it would seem that after hours/days of up time, the issue would occur again and both clusters would be affected.
So I started kernel tracing trying to narrow down a culprit. One of the things I noticed between a trace that had the bug captured and a trace where the bug isn't occurring, is the boost margin assigned each cpu. I could tell this bug would occur when the boosted task could easily fit on a big core due to low utilization and margin assigned to them. Also, there was not any other noticeable reason I could discern. Below is an example of what I am referring to. Also note that I had a friend with a Pixel 3 confirm it had the problem as well before writing here and the logs/info below are taken from a P3.
Kernel version:
Linux version 4.9.124-gdff097c-ab5226088 (android-build@abfarm934) (Android clang version 5.0.1 (https://us3-mirror-android.googlesource.com/toolchain/clang 00e4a5a67eb7d626653c23780ff02367ead74955) (https://us3-mirror-android.googlesource.com/toolchain/llvm ef376ecb7d9c1460216126d102bb32fc5f73800d) (based on LLVM 5.0.1svn)) #0 SMP PREEMPT Fri Jan 11 21:28:03 UTC 2019
Extracted from PQ2A.190305.002 factory images
Events used: sched_boost_cpu sched:sched_find_best_target sched:sched_boost_task sched:sched_load_balance sched:sched_migrate_task sched:sched_switch sched:sched_wakeup_new sched:sched_waking sched:sched_task_util sched:sched_tune_boostgroup_update sched:sched_tune_tasks_update
Bug (no interaction): <idle>-0 [000] d.H9 12042.557751: sched_boost_cpu: cpu=0 util=22 margin=100 <idle>-0 [000] d.H9 12042.557756: sched_boost_cpu: cpu=1 util=26 margin=0 <idle>-0 [000] d.H9 12042.557759: sched_boost_cpu: cpu=2 util=0 margin=0 <idle>-0 [000] d.H9 12042.557761: sched_boost_cpu: cpu=3 util=4 margin=0 <idle>-0 [000] d.H9 12042.557764: sched_boost_cpu: cpu=4 util=0 margin=0 <idle>-0 [000] d.H9 12042.557767: sched_boost_cpu: cpu=5 util=0 margin=0 <idle>-0 [000] d.H9 12042.557770: sched_boost_cpu: cpu=6 util=0 margin=102 <idle>-0 [000] d.H9 12042.557772: sched_boost_cpu: cpu=7 util=0 margin=102
Bug during interactions: <...>-640 [000] d.H8 12044.737723: sched_boost_cpu: cpu=0 util=33 margin=495 <...>-640 [000] d.H8 12044.737725: sched_boost_cpu: cpu=1 util=22 margin=0 <...>-640 [000] d.H8 12044.737727: sched_boost_cpu: cpu=2 util=47 margin=0 <...>-640 [000] d.H8 12044.737728: sched_boost_cpu: cpu=3 util=12 margin=0 <...>-640 [000] d.H8 12044.737730: sched_boost_cpu: cpu=4 util=119 margin=0 <...>-640 [000] d.H8 12044.737732: sched_boost_cpu: cpu=5 util=83 margin=0 <...>-640 [000] d.H8 12044.737736: sched_boost_cpu: cpu=6 util=0 margin=512 <...>-640 [000] d.H8 12044.737738: sched_boost_cpu: cpu=7 util=2 margin=511
NO Bug during interactions: ndroid.settings-10354 [004] d.hb 117.480930: sched_boost_cpu: cpu=0 util=50 margin=0 ndroid.settings-10354 [004] d.hb 117.480932: sched_boost_cpu: cpu=1 util=184 margin=0 ndroid.settings-10354 [004] d.hb 117.480933: sched_boost_cpu: cpu=2 util=66 margin=0 ndroid.settings-10354 [004] d.hb 117.480934: sched_boost_cpu: cpu=3 util=46 margin=0 ndroid.settings-10354 [004] d.hb 117.480935: sched_boost_cpu: cpu=4 util=119 margin=452 ndroid.settings-10354 [004] d.hb 117.480936: sched_boost_cpu: cpu=5 util=2 margin=0 ndroid.settings-10354 [004] d.hb 117.480936: sched_boost_cpu: cpu=6 util=112 margin=0 ndroid.settings-10354 [004] d.hb 117.480937: sched_boost_cpu: cpu=7 util=7 margin=0
When you take a closer look at what is running on the small cores, I started to notice something else. With the event:
sched_tune_tasks_update
I noticed a pattern with tasks that had "boost=0 max_boost=50". When the bug is occurring, you would see small cores running tasks with "boost=0 max_boost=50" and when no bug, only big cores would have tasks with "boost=0 max_boost=50". I have not been able to track down why sometimes these tasks that received a boost as some point are running on small cores. I also cannot discern why this issue has been inconsistent. I wanted to give a heads up and would appreciate any assistance or information anyone here might have with regards to a possible solution. Thanks for the great work! I look forward to hearing back from you guys.
I have uploaded the a number of traces captured to my G Drive here:
https://drive.google.com/open?id=11YTlGezAMllmlVr9CtMNCz56CTfrJGBd
If there is a better method, please let me know.
Thanks for reporting all this :-)
The schedtune boost level is per cgroup, which means all tasks in the (say) top-app group get the same boost. In case of interactions (a user touching the screen), userspace raises the boost level temporarily to 50, and that applies to _all_ top-app tasks. Most of these tasks typically run on big cores, but this is not guaranteed. Top-app tasks can end up on little cores if there is no idle big CPU when they wake up for example -- we really want to start them ASAP after wake up, even if that means using a little CPU. This scenario is supposed to be infrequent, but certainly not impossible in practice.
So, if you have top-app tasks running on both clusters at any given moment, the boost level is applied to both clusters. It's not really a 'bug' I would say as this is sort of expected. I guess one thing you could try is check in which scenario do you get top-app tasks on the little side. Are there big CPUs idle when the task wakes-up ? If yes, then maybe there is indeed something that needs fixing ... Otherwise, this is basically a design choice :-)
I hope that makes sense.
Thanks, Quentin