Hello,
I have two questions:
1- I was wondering what should be the expected semantics of "flush_cache_all" on a Big.LITTLE architecture.
I can see that the implementation of this function under linux kernel is doing the following:
a- Read the value of LoC ( level of coherency ) b- Flush each level of cache to that LoC value using DCCISW co-processor register.
My expectation would be that if this is executed on one of the processors of the Big cluster it should flush all L1 and L2 caches on this cluster and then signal the CCI interconnect of the cache cleaning operation and then the CCI interconnect would propagate this signal downstream to the LITTLE cluster. This will mean that at the end all cache will be flushed.
Is that the proper semantics of this operation ?
or it's only going to affect this CPU and no other CPUs in the cluster ( and consequently no other CPUs on the other cluster ). And if that's the case, does this mean that I've to do the cache flushing per_cpu ?
2- and Is there a difference in semantics between flushing each cache till I reach the Level of coherency ( using DCCISW register ) and flushing the first cache only to the point of coherency ( using DCCIMVAC register ) ?
Thanks.
On Mon, Mar 10, 2014 at 10:44:05AM +0000, karim.allah.ahmed@gmail.com wrote:
I have two questions:
1- I was wondering what should be the expected semantics of "flush_cache_all" on a Big.LITTLE architecture.
I can see that the implementation of this function under linux kernel is doing the following:
a- Read the value of LoC ( level of coherency ) b- Flush each level of cache to that LoC value using DCCISW co-processor register.
My expectation would be that if this is executed on one of the processors of the Big cluster it should flush all L1 and L2 caches on this cluster and then signal the CCI interconnect of the cache cleaning operation and then the CCI interconnect would propagate this signal downstream to the LITTLE cluster. This will mean that at the end all cache will be flushed.
I am not sure exactly how the CCI behaves here but cache flushing by set/way (like the flush_cache_all function) is not safe on SMP (independent of big.LITTLE) and it should only be used in certain contexts like suspend/resume where we have more control about cache lines migration between CPUs/clusters.
Is that the proper semantics of this operation ?
or it's only going to affect this CPU and no other CPUs in the cluster ( and consequently no other CPUs on the other cluster ). And if that's the case, does this mean that I've to do the cache flushing per_cpu ?
The safe thing is to assume that it only affects a single CPU (and as an optimisation we use a flush_cache_louis which does the L1 cache only). When the whole cluster is going down and we know that only one CPU is running, we can use flush_cache_all for that cluster but it does not affect the caches in the other cluster.
Per-CPU cache flushing isn't useful either when all the CPUs are active since cache lines can still migrate (unless you use something like stop_machine, disable the MMU on all CPUs, do the flushing after the MMUs have been disabled).
2- and Is there a difference in semantics between flushing each cache till I reach the Level of coherency ( using DCCISW register ) and flushing the first cache only to the point of coherency ( using DCCIMVAC register ) ?
The difference is that the MVA operation is guaranteed to work on SMP since it is broadcast to the other CPUs in hardware. The SW ops are not.
On Mon, Mar 10, 2014 at 10:52 AM, Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, Mar 10, 2014 at 10:44:05AM +0000, karim.allah.ahmed@gmail.com wrote:
I have two questions:
1- I was wondering what should be the expected semantics of "flush_cache_all" on a Big.LITTLE architecture.
I can see that the implementation of this function under linux kernel is doing the following:
a- Read the value of LoC ( level of coherency ) b- Flush each level of cache to that LoC value using DCCISW co-processor register.
My expectation would be that if this is executed on one of the processors of the Big cluster it should flush all L1 and L2 caches on this cluster and then signal the CCI interconnect of the cache cleaning operation and then the CCI interconnect would propagate this signal downstream to the LITTLE cluster. This will mean that at the end all cache will be flushed.
I am not sure exactly how the CCI behaves here but cache flushing by set/way (like the flush_cache_all function) is not safe on SMP (independent of big.LITTLE) and it should only be used in certain contexts like suspend/resume where we have more control about cache lines migration between CPUs/clusters.
Is that the proper semantics of this operation ?
or it's only going to affect this CPU and no other CPUs in the cluster ( and consequently no other CPUs on the other cluster ). And if that's the case, does this mean that I've to do the cache flushing per_cpu ?
The safe thing is to assume that it only affects a single CPU (and as an optimisation we use a flush_cache_louis which does the L1 cache only). When the whole cluster is going down and we know that only one CPU is running, we can use flush_cache_all for that cluster but it does not affect the caches in the other cluster.
I see. I assumed that flush_cache_all is going to be seen by the other cluster as well! For example if my system was already declaring L2 as the LoC, my understanding is that if I flushed my caches till I reached L2 the CCI ( or something ) should be signalled to propagate this flush downstream to the other cluster in order to maintain the semantics of the Level of Coherency in ARM TRM. Is this a correct understanding of LoC ? Maybe I should explicitly notify the other observers as well after flushing to refresh their view of the memory ( like you said by flushing using MVA ) otherwise they might see stale data ?
CCI already have signals to propagate the cache maintenance operations downstream, I just didn't know when they are invoked and when they're not!
Per-CPU cache flushing isn't useful either when all the CPUs are active since cache lines can still migrate (unless you use something like stop_machine, disable the MMU on all CPUs, do the flushing after the MMUs have been disabled).
2- and Is there a difference in semantics between flushing each cache till I reach the Level of coherency ( using DCCISW register ) and flushing the first cache only to the point of coherency ( using DCCIMVAC register ) ?
The difference is that the MVA operation is guaranteed to work on SMP since it is broadcast to the other CPUs in hardware. The SW ops are not.
Thanks Catalin for your reply.
-- Catalin
Date: Mon, 10 Mar 2014 11:29:13 +0000 Subject: Re: linux kernel flush_cache_all behaviour on a Big.LITTLE system From: karim.allah.ahmed@gmail.com To: catalin.marinas@arm.com CC: linaro-dev@lists.linaro.org
On Mon, Mar 10, 2014 at 10:52 AM, Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, Mar 10, 2014 at 10:44:05AM +0000, karim.allah.ahmed@gmail.com wrote:
I have two questions:
1- I was wondering what should be the expected semantics of "flush_cache_all" on a Big.LITTLE architecture.
I can see that the implementation of this function under linux kernel is doing the following:
a- Read the value of LoC ( level of coherency ) b- Flush each level of cache to that LoC value using DCCISW co-processor register.
My expectation would be that if this is executed on one of the processors of the Big cluster it should flush all L1 and L2 caches on this cluster and then signal the CCI interconnect of the cache cleaning operation and then the CCI interconnect would propagate this signal downstream to the LITTLE cluster. This will mean that at the end all cache will be flushed.
I am not sure exactly how the CCI behaves here but cache flushing by set/way (like the flush_cache_all function) is not safe on SMP (independent of big.LITTLE) and it should only be used in certain contexts like suspend/resume where we have more control about cache lines migration between CPUs/clusters.
Is that the proper semantics of this operation ?
or it's only going to affect this CPU and no other CPUs in the cluster ( and consequently no other CPUs on the other cluster ). And if that's the case, does this mean that I've to do the cache flushing per_cpu ?
The safe thing is to assume that it only affects a single CPU (and as an optimisation we use a flush_cache_louis which does the L1 cache only). When the whole cluster is going down and we know that only one CPU is running, we can use flush_cache_all for that cluster but it does not affect the caches in the other cluster.
I see. I assumed that flush_cache_all is going to be seen by the other cluster as well! For example if my system was already declaring L2 as the LoC, my understanding is that if I flushed my caches till I reached L2 the CCI ( or something ) should be signalled to propagate this flush downstream to the other cluster in order to maintain the semantics of the Level of Coherency in ARM TRM. Is this a correct understanding of LoC ? Maybe I should explicitly notify the other observers as well after flushing to refresh their view of the memory ( like you said by flushing using MVA ) otherwise they might see stale data ?
CCI already have signals to propagate the cache maintenance operations downstream, I just didn't know when they are invoked and when they're not!
No. Your assumption is correct for DCCIMVAC. But flush_cache_all uses DCCISW. It only affects a single CPU. Other processors and clusters won't observe it.