On 06/04/2015 11:41 AM, Viresh Kumar wrote:
On 04-06-15, 11:38, Preeti U Murthy wrote:
And a crash at the cpufreq worker thread again due to data access exception when I change governors in parallel on a single core.
cpu 0x3: Vector: 300 (Data Access) at [c000000fedb538f0] pc: c000000000856750: od_dbs_timer+0x60/0x1e0 lr: c0000000000f489c: process_one_work+0x24c/0x910 sp: c000000fedb53b70 msr: 9000000100009033 dar: 10 dsisr: 40000000 current = 0xc000000fe3d128e0 paca = 0xc000000007da1c80 softe: 0 irq_happened: 0x01 pid = 17227, comm = kworker/3:1
With the backtrace being:
[c000000fedb53be0] c0000000000f489c process_one_work+0x24c/0x910 [c000000fedb53c90] c0000000000f50dc worker_thread+0x17c/0x540 [c000000fedb53d20] c0000000000fed70 kthread+0x120/0x140 [c000000fedb53e30] c000000000009678 ret_from_kernel_thread+0x5c/0x64
But the kernel stays sane longer than before with the patchset. The above crash happens around 15 seconds after the test begins, while earlier it wouldn't survive 2 seconds even.
I haven't attempted to solve the race between the worker threads and governor-callbacks yet. What I have tried to solve is the race between different callbacks. And you shouldn't see a race there for now. For example a race between INIT/EXIT/START/STOP/LIMITS.
Your fix may not be complete and here is why. The reason we see the crash is because we have *only* attempted to serialize calls to cpufreq_governor_dbs() and not attempted to serialize *entire logical sequence of operations*. Let's take a look at what is happening as a consequence.
CPU0 CPU1 store_scaling_governor() __cpufreq_remove_dev_finish() __cpufreq_governor(CPUFREQ_GOV_STOP) __cpufreq_governor(CPUFREQ_GOV_START) policy->governor_enabled = false cpufreq_governor_dbs() policy->governor_enabled = true mutex_lock() gov_cancel_work() cpufreq_governor_dbs() wait on lock
may call gov_queue_work() if (!policy->enabled) : fails and we end up queuing work
mutex_unlock() mutex_lock() gov_queue_work() mutex_unlock()
__cpufreq_governor(CPUFREQ_GOV_POLICY_EXIT) mutex_lock() cpufreq_governor_dbs() kfree(dbs_data)
timer fires and od_dbs_timer/cs_dbs_timer() runs References governor data structures which are freed
The issue as I see it is one set of operations must be allowed to run *completely* before another begins. When store_scaling_governor() says STOP, all governor operations must be stopped, till the time store_scaling_governor() itself gives permission to restart. Somebody else, in this case __cpufreq_remove_dev_finish() cannot overrule this if it arrives late. This is what is happening above.
So if store_scaling_governor() arrives first, STOP|EXIT|START|LIMIT must complete before START|LIMIT of __cpufreq_remove_dev_finish() is allowed to run. So it is just not about serializing, its about *what needs to be serialized*.
Regards Preeti U Murthy