Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel_th/sth...
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will: * Send a D32 packet consisting of data[4..1] (assuming they don't fault misaligned addresses). Because size > 8, it rounds size down to 4. * Ignore the other data. * Only ever generates a single packet. * For other sizes, rounds size down to power of 2 and returns number of bytes written.
The Intel 64bit driver (sth_stm_driver) will: * Do nothing because size > 8. * Only ever generates a zero or one packets. * For size <= 8, rounds size down to power of 2 and returns number of bytes written.
The CoreSight 32bit driver (stm_send) will: * Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8], D4=data[15..12], D1=data[16] stream. * This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will: * Send a D8=ZeroExtend(data[7..1]), D8=data[15..8] D8=ZeroExtend(data[16]) * This function only ever sends D8 packets. * There is no way for the decoder to work out what the original data was.
I think this function is only called from within the generic driver [3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/core.c
It looks like there might be support for other sources to link to the driver, but I could only find the stm_console when I looked.
Of course, this is based on my limited understanding of how this is used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour incorrectly.
The other point (which Will has raised on the mailing lists in the past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
Mike. -- Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hello Michael,
I didn't have time to get to your email today - you're first on tomorrow's priority list.
Thanks, Mathieu
On 4 February 2016 at 08:24, Michael Williams Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel_th/sth...
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number of bytes written.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8], D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8] D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data was.
I think this function is only called from within the generic driver [3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/core.c
It looks like there might be support for other sources to link to the driver, but I could only find the stm_console when I looked.
Of course, this is based on my limited understanding of how this is used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour incorrectly.
The other point (which Will has raised on the mailing lists in the past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Michael,
Let me try to explain some points based on my current understanding.
On Fri, Feb 5, 2016 at 6:25 AM, Mathieu Poirier mathieu.poirier@linaro.org wrote:
Hello Michael,
I didn't have time to get to your email today - you're first on tomorrow's priority list.
Thanks, Mathieu
On 4 February 2016 at 08:24, Michael Williams Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel_th/sth...
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
There's a loop [5] to send packet, so will not ignore other data. For this case, the process should be that send 4*D32 packets to STM buffer.
[5] http://lxr.free-electrons.com/source/drivers/hwtracing/stm/core.c#L391
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number of bytes written.
For other sizes on 32bit driver, if the size > 4, send D32 at a time until the size of left packets which haven't been send to buffer is less than 4, and then if the size of left packets is larger than 2, will send D16, and then if still one byte of packets left, then send D4.
64bit driver has a similar process, one exception is the largest size of packets that are sent at a time is 8 instead of 4.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
Like what I explained above, it will send 8 bytes at a time until the size of left packets less then 8, and then D32 D16 D8 if needed.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8], D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8] D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data was.
How about if we process packets on 64bit like we did on 32bit? Can the decoder work with the original data generated by the way we are using on CoreSight 32bit system.
I think this function is only called from within the generic driver [3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
Yes, this is indeed something that have to be optimized/revised. I will spend time to look into it. Thanks for pointing this.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/core.c
It looks like there might be support for other sources to link to the driver, but I could only find the stm_console when I looked.
Yes, I'm working on a integration of Linux Kernel Ftrace subsystem with CoreSight STM, there's another STM source to serve this feature. If you're interested in this, please let me know, I will cc you when sending out the patches for public review.
Hope these are helpful.
Thanks, Chunyan
Of course, this is based on my limited understanding of how this is used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour incorrectly.
The other point (which Will has raised on the mailing lists in the past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
Hi Chunyan,
Thanks for the reply.
Hi Michael,
Let me try to explain some points based on my current understanding.
On Fri, Feb 5, 2016 at 6:25 AM, Mathieu Poirier mathieu.poirier@linaro.org wrote:
Hello Michael,
I didn't have time to get to your email today - you're first on tomorrow's priority list.
Thanks, Mathieu
On 4 February 2016 at 08:24, Michael Williams
Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with
responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being
made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant
pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't
follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel _th/sth.c
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't
fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
There's a loop [5] to send packet, so will not ignore other data. For this case, the process should be that send 4*D32 packets to STM buffer.
[5] http://lxr.free- electrons.com/source/drivers/hwtracing/stm/core.c#L391
Yes; I noticed that.
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number
of bytes written.
For other sizes on 32bit driver, if the size > 4, send D32 at a time until the size of left packets which haven't been send to buffer is less than 4, and then if the size of left packets is larger than 2, will send D16, and then if still one byte of packets left, then send D4.
64bit driver has a similar process, one exception is the largest size of packets that are sent at a time is 8 instead of 4.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
Like what I explained above, it will send 8 bytes at a time until the size of left packets less then 8, and then D32 D16 D8 if needed.
Yes; and the code in stm/core.c will never call it with >8, so the difference against the 32bit driver isn't exposed.
I guess part of the problem I have looking at it is that the API to stm_data::packet isn't fully described in stm.h.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number
of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8],
D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8]
D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data
was.
How about if we process packets on 64bit like we did on 32bit? Can the decoder work with the original data generated by the way we are using on CoreSight 32bit system.
Yes; the processing should be the same. It should send the largest packet it can (singular) and return the number of bytes it sent. That is, like the STH driver.
If you're concerned about unaligned pointers, you can realign the data. The overhead of that isn't going to be onerous compared to the saving from fewer Device memory writes, and it means you're generating less trace (i.e. you're less likely to lose trace, or stall waiting for the STM).
(See also the comment below about using 64-bit packets in 32-bit state.)
I think this function is only called from within the generic driver
[3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
Yes, this is indeed something that have to be optimized/revised. I will spend time to look into it. Thanks for pointing this.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/c ore.c
It looks like there might be support for other sources to link to the
driver, but I could only find the stm_console when I looked.
Yes, I'm working on a integration of Linux Kernel Ftrace subsystem with CoreSight STM, there's another STM source to serve this feature.
That would certainly be interesting for our STM users.
If you're interested in this, please let me know, I will cc you when sending out the patches for public review.
Hope these are helpful.
Thanks, Chunyan
Of course, this is based on my limited understanding of how this is
used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour
incorrectly.
The other point (which Will has raised on the mailing lists in the
past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32
system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/29 7379.html
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and
may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Fri, Feb 5, 2016 at 5:15 PM, Michael Williams Michael.Williams@arm.com wrote:
Hi Chunyan,
Thanks for the reply.
Hi Michael,
Let me try to explain some points based on my current understanding.
On Fri, Feb 5, 2016 at 6:25 AM, Mathieu Poirier mathieu.poirier@linaro.org wrote:
Hello Michael,
I didn't have time to get to your email today - you're first on tomorrow's priority list.
Thanks, Mathieu
On 4 February 2016 at 08:24, Michael Williams
Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with
responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being
made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant
pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't
follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel _th/sth.c
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't
fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
There's a loop [5] to send packet, so will not ignore other data. For this case, the process should be that send 4*D32 packets to STM buffer.
[5] http://lxr.free- electrons.com/source/drivers/hwtracing/stm/core.c#L391
Yes; I noticed that.
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number
of bytes written.
For other sizes on 32bit driver, if the size > 4, send D32 at a time until the size of left packets which haven't been send to buffer is less than 4, and then if the size of left packets is larger than 2, will send D16, and then if still one byte of packets left, then send D4.
64bit driver has a similar process, one exception is the largest size of packets that are sent at a time is 8 instead of 4.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
Like what I explained above, it will send 8 bytes at a time until the size of left packets less then 8, and then D32 D16 D8 if needed.
Yes; and the code in stm/core.c will never call it with >8, so the difference against the 32bit driver isn't exposed.
I guess part of the problem I have looking at it is that the API to stm_data::packet isn't fully described in stm.h.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number
of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8],
D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8]
D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data
was.
How about if we process packets on 64bit like we did on 32bit? Can the decoder work with the original data generated by the way we are using on CoreSight 32bit system.
Yes; the processing should be the same. It should send the largest packet it can (singular) and return the number of bytes it sent. That is, like the STH driver.
Got you. I will revise this part according to your comments.
If you're concerned about unaligned pointers, you can realign the data. The overhead of that isn't going to be onerous compared to the saving from fewer Device memory writes, and it means you're generating less trace (i.e. you're less likely to lose trace, or stall waiting for the STM).
(See also the comment below about using 64-bit packets in 32-bit state.)
Yes, I have to spend time to look into that after the holidays, since Chinese New Year is coming. Your comments really give me a large help.
Thank you, Chunyan
I think this function is only called from within the generic driver
[3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
Yes, this is indeed something that have to be optimized/revised. I will spend time to look into it. Thanks for pointing this.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/c ore.c
It looks like there might be support for other sources to link to the
driver, but I could only find the stm_console when I looked.
Yes, I'm working on a integration of Linux Kernel Ftrace subsystem with CoreSight STM, there's another STM source to serve this feature.
That would certainly be interesting for our STM users.
If you're interested in this, please let me know, I will cc you when sending out the patches for public review.
Hope these are helpful.
Thanks, Chunyan
Of course, this is based on my limited understanding of how this is
used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour
incorrectly.
The other point (which Will has raised on the mailing lists in the
past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32
system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/29 7379.html
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and
may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Good morning Michael,
On 4 February 2016 at 08:24, Michael Williams Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
Yes, I remember - it was in San Francisco last year.
I have to say it's all looking very encouraging, with progress being made on CoreSight and trace support. So that's all good!
Good to hear, we are working hard.
Now, I don't normally read the Linux mailing lists, but Al Grant pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
First and foremost it is important to understand that this driver is _not_ tailored for STM500. I don't have HW with an STM500 and as such can't test the integration. Any code that seem to imply the contrary is coincidental.
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel_th/sth...
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number of bytes written.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8], D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8] D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data was.
I think this function is only called from within the generic driver [3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/core.c
It looks like there might be support for other sources to link to the driver, but I could only find the stm_console when I looked.
Of course, this is based on my limited understanding of how this is used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
This is the kind of feedback we need. It would be great if you could work with us to make things better. Comments and request for modifications have to be done on the public mailing list. Otherwise people will ask (and rightly so) why things were modified from one version to another. Chunyan will CC you on her next submission - please reply to all.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour incorrectly.
The other point (which Will has raised on the mailing lists in the past [4]) is this code:
The documentation is very _unclear_ on this topic. Catalin, Will and I couldn't make sense of it and we decided to leave things as is.
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
Perfect, so it's officially out on 32 architectures. Chunyan, please remove (for 32 bit) in your next release.
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
I'm really happy to have received comments on this driver. As mentioned above we would welcome more reviews on the upcoming patchsets.
Thanks, Mathieu
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 4 February 2016 at 08:24, Michael Williams Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel_th/sth...
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number of bytes written.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8], D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8] D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data was.
I think this function is only called from within the generic driver [3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/core.c
It looks like there might be support for other sources to link to the driver, but I could only find the stm_console when I looked.
Of course, this is based on my limited understanding of how this is used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour incorrectly.
The other point (which Will has raised on the mailing lists in the past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
Hey Michael,
The above two paragraphs have been running around in my head all weekend long and I thought it best to clarify things before going any further with regards to upstreaming the driver.
First I did some soul searching in the documentation and found the following:
On page 32 of document [1], it is mentioned that an STM's fundamental data size can be either 32 or 64 bit. On page 3-12 of document [2], it is mentioned that an STM's fundamental data size can only be 32 bit.
From there have the following questions:
1) Can an STM fitted on a 32 bit system have a fundamental data size of 64 bit? 2) Can an STM fitted on a 64 bit system have a fundamental data size of 32 bit? 3) Can an STM fitted on a 64 bit system have a fundamental data size of 64 bit? 4) In all of the above cases will STMPIDR0[7:0] still read 0x962?
Clarifying the above 4 questions will go a long way.
Many thanks, Mathieu
[1]. ARM IHI 0054B (ID092613) [2]. ARM DDI 0444B (ID010111)
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hey Michael,
Have you had time to look into this? I'm afraid that without the information upstreaming of the STM driver can't move ahead.
Thanks, Mathieu.
[...]
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
Hey Michael,
The above two paragraphs have been running around in my head all weekend long and I thought it best to clarify things before going any further with regards to upstreaming the driver.
First I did some soul searching in the documentation and found the following:
On page 32 of document [1], it is mentioned that an STM's fundamental data size can be either 32 or 64 bit. On page 3-12 of document [2], it is mentioned that an STM's fundamental data size can only be 32 bit.
From there have the following questions:
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit? 2) Can an STM fitted on a 64 bit system have a fundamental data size of 32 bit? 3) Can an STM fitted on a 64 bit system have a fundamental data size of 64 bit? 4) In all of the above cases will STMPIDR0[7:0] still read 0x962?
Clarifying the above 4 questions will go a long way.
Many thanks, Mathieu
[1]. ARM IHI 0054B (ID092613) [2]. ARM DDI 0444B (ID010111)
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Mathieu,
As this is a blocker I pinged a mail to one of the CoreSight architects in case MikeW was not available.
Answers below after my comments.
I would add to this:
a) The difference you are seeing in the two docs is that one doc is the architecture doc - listing all possible fundamental data sizes, the other is an implementation of that architecture with the data size fixed @ 32 bit.
b) Also as I am sure you know, strictly speaking you need to read STMPIDR1[3:0]+ STMPIDR0[7:0] to get the 0x962 (original 32 bit STM implementation) or 0x963 (STM-500 64 bit STM part).
This component ID is not however the best method for determining the fundamental data size. The architectural register STMFEAT2R[15:12] contains this value (DSize). It is feasible that another STM part that is architecturally compliant could come along with a different ID register value, but would have to have the correct entry in STMFEAT2R. This register also contains the information on the availability of guaranteed / invariant timing transactions which I imagine are also useful for the driver.
c) As seen below and from MikeWs comments it is evident that a 64 bit write should not be attempted on a 32 bit system or a 64 bit system running in AArch32 mode. In my view what the driver (or higher level software perhaps) does depends on the purpose/ API definition for the 64 bit write. If you want to generate a 64 bit STPv2 packet (D64[M][TS]) then the driver should refuse and error at this point. It may be acceptable to generate 2x32 bit packets under some circumstances, if the API definition allows for this.
Regards
Mike
=================================================================================== We have 2 implementations of the STM Architecture: a) STM, as documented in ARM DDI 0444B ([2] below). This is a 32-bit STM so has the 32-bit fundamental data size. b) STM-500 as documented in ARM DDI 0528B. This is a 64-bit STM so has the 64-bit fundamental data size.
You are likely to see either STM in 32 or 64-bit systems. We'd have preferred to only see the 64-bit STM in 64-bit systems, but it's not worked out that way unfortunately (e.g. Juno).
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit?
Yes, but see my comments further down.
- Can an STM fitted on a 64 bit system have a fundamental data size of
32 bit?
Yes.
- Can an STM fitted on a 64 bit system have a fundamental data size of
64 bit?
Yes.
- In all of the above cases will STMPIDR0[7:0] still read 0x962?
No. If you see a 64-bit STM-500 it has the part number of 0x963.
The purpose of the fundamental data size indication in the STMs is to indicate whether the STM will take a 64-bit access and generate a D64: - A 32-bit STM will probably generate 2xD32 packets from a 64-bit access, although this should not be relied upon. - A 64-bit STM will take a 64-bit access and generate a D64.
Note that there is no guaranteed way for a 32-bit system (either an ARMv7 core, or an ARMv8 core running in AArch32) to generate a 64-bit access to any STM. This means that even if you find a 64-bit STM in a 32-bit system, you should treat it as a 32-bit STM, and only perform 32-bit accesses. Treating a 64-bit STM as a 32-bit STM is fully compatible (i.e. code written to run on a 32-bit STM will work exactly the same on a 64-bit STM).
---------------------------------------------------------------- Mike Leach +44 (0)1254 893911 (Direct) Principal Engineer +44 (0)1254 893900 (Main) Arm Blackburn Design Centre +44 (0)1254 893901 (Fax) Belthorn House Walker Rd mailto:mike.leach@arm.com Guide Blackburn BB1 2QE ----------------------------------------------------------------
-----Original Message----- From: CoreSight [mailto:coresight-bounces@lists.linaro.org] On Behalf Of Mathieu Poirier Sent: 10 February 2016 18:16 To: Michael Williams Cc: coresight@lists.linaro.org Subject: Re: [PATCH V2 0/6] Introduce CoreSight STM support
Hey Michael,
Have you had time to look into this? I'm afraid that without the information upstreaming of the STM driver can't move ahead.
Thanks, Mathieu.
[...]
This isn't guaranteed to work on the ARM 32 bit architectures. The STM
might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system.
This is because the code interrogates the STM to decide whether it supports 64- bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
Hey Michael,
The above two paragraphs have been running around in my head all weekend long and I thought it best to clarify things before going any further with regards to upstreaming the driver.
First I did some soul searching in the documentation and found the following:
On page 32 of document [1], it is mentioned that an STM's fundamental data size can be either 32 or 64 bit. On page 3-12 of document [2], it is mentioned that an STM's fundamental data size can only be 32 bit.
From there have the following questions:
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit? 2) Can an STM fitted on a 64 bit system have a fundamental data size of 32
bit?
- Can an STM fitted on a 64 bit system have a fundamental data size of 64
bit?
- In all of the above cases will STMPIDR0[7:0] still read 0x962?
Clarifying the above 4 questions will go a long way.
Many thanks, Mathieu
[1]. ARM IHI 0054B (ID092613) [2]. ARM DDI 0444B (ID010111)
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-
October/297379.html
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Mike (L),
I think the function stm_fundamental_data_size() checks the FEAT2R register for 64-bit support at the STM. The issue is, as you say, that both the STM and CPU have to support the 64-bit data access for this to work.
My understanding of this driver layer is that it is being asked to write a single packet of a size, and it writes the largest size it supports. The caller needs to understand what the driver is going to do; the STM generic driver appears to do this. There do still seem to be some differences cf the Intel STH driver, though, but it should behave the same if the calls into the driver follow some (undocumented) constraints.
Sorry, I haven't much time to say much more at this time.
Mike (W).
Mike Leach wrote:
Hi Mathieu,
As this is a blocker I pinged a mail to one of the CoreSight architects in case MikeW was not available.
Answers below after my comments.
I would add to this:
a) The difference you are seeing in the two docs is that one doc is the architecture doc - listing all possible fundamental data sizes, the other is an implementation of that architecture with the data size fixed @ 32 bit.
b) Also as I am sure you know, strictly speaking you need to read STMPIDR1[3:0]+ STMPIDR0[7:0] to get the 0x962 (original 32 bit STM implementation) or 0x963 (STM-500 64 bit STM part).
This component ID is not however the best method for determining the fundamental data size. The architectural register STMFEAT2R[15:12] contains this value (DSize). It is feasible that another STM part that is architecturally compliant could come along with a different ID register value, but would have to have the correct entry in STMFEAT2R. This register also contains the information on the availability of guaranteed / invariant timing transactions which I imagine are also useful for the driver.
c) As seen below and from MikeWs comments it is evident that a 64 bit write should not be attempted on a 32 bit system or a 64 bit system running in AArch32 mode. In my view what the driver (or higher level software perhaps) does depends on the purpose/ API definition for the 64 bit write. If you want to generate a 64 bit STPv2 packet (D64[M][TS]) then the driver should refuse and error at this point. It may be acceptable to generate 2x32 bit packets under some circumstances, if the API definition allows for this.
Regards
Mike
========================================================================
We have 2 implementations of the STM Architecture: a) STM, as documented in ARM DDI 0444B ([2] below). This is a 32-bit STM so has the 32-bit fundamental data size. b) STM-500 as documented in ARM DDI 0528B. This is a 64-bit STM so has the 64-bit fundamental data size.
You are likely to see either STM in 32 or 64-bit systems. We'd have preferred to only see the 64-bit STM in 64-bit systems, but it's not worked out that way unfortunately (e.g. Juno).
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit?
Yes, but see my comments further down.
- Can an STM fitted on a 64 bit system have a fundamental data size
of 32 bit?
Yes.
- Can an STM fitted on a 64 bit system have a fundamental data size
of 64 bit?
Yes.
- In all of the above cases will STMPIDR0[7:0] still read 0x962?
No. If you see a 64-bit STM-500 it has the part number of 0x963.
The purpose of the fundamental data size indication in the STMs is to indicate whether the STM will take a 64-bit access and generate a D64:
- A 32-bit STM will probably generate 2xD32 packets from a 64-bit
access, although this should not be relied upon.
- A 64-bit STM will take a 64-bit access and generate a D64.
Note that there is no guaranteed way for a 32-bit system (either an ARMv7 core, or an ARMv8 core running in AArch32) to generate a 64-bit access to any STM. This means that even if you find a 64-bit STM in a 32-bit system, you should treat it as a 32-bit STM, and only perform 32- bit accesses. Treating a 64-bit STM as a 32-bit STM is fully compatible (i.e. code written to run on a 32-bit STM will work exactly the same on a 64-bit STM).
Mike Leach +44 (0)1254 893911 (Direct) Principal Engineer +44 (0)1254 893900 (Main) Arm Blackburn Design Centre +44 (0)1254 893901 (Fax) Belthorn House Walker Rd mailto:mike.leach@arm.com Guide Blackburn BB1 2QE
-----Original Message----- From: CoreSight [mailto:coresight-bounces@lists.linaro.org] On Behalf Of Mathieu Poirier Sent: 10 February 2016 18:16 To: Michael Williams Cc: coresight@lists.linaro.org Subject: Re: [PATCH V2 0/6] Introduce CoreSight STM support
Hey Michael,
Have you had time to look into this? I'm afraid that without the information upstreaming of the STM driver can't move ahead.
Thanks, Mathieu.
[...]
This isn't guaranteed to work on the ARM 32 bit architectures. The STM
might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32
system.
This is because the code interrogates the STM to decide whether it supports 64- bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
Hey Michael,
The above two paragraphs have been running around in my head all weekend long and I thought it best to clarify things before going any further with regards to upstreaming the driver.
First I did some soul searching in the documentation and found the
following:
On page 32 of document [1], it is mentioned that an STM's fundamental data size can be either 32 or 64 bit. On page 3-12 of document [2], it is mentioned that an STM's fundamental data size can only be 32 bit.
From there have the following questions:
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit? 2) Can an STM fitted on a 64 bit system have a fundamental data size of 32
bit?
- Can an STM fitted on a 64 bit system have a fundamental data size
of 64
bit?
- In all of the above cases will STMPIDR0[7:0] still read 0x962?
Clarifying the above 4 questions will go a long way.
Many thanks, Mathieu
[1]. ARM IHI 0054B (ID092613) [2]. ARM DDI 0444B (ID010111)
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-
October/297379.html
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Mike and Michael,
Thanks for the detailed explanation - very helpful and enlightening.
I think the statement that sums it all up is this one:
"This means that even if you find a 64-bit STM in a 32-bit system, you should treat it as a 32-bit STM, and only perform 32-bit accesses."
The condition is easy to check for and provides an all encompassing solution.
Mathieu
On 11 February 2016 at 03:04, Mike Leach Mike.Leach@arm.com wrote:
Hi Mathieu,
As this is a blocker I pinged a mail to one of the CoreSight architects in case MikeW was not available.
Answers below after my comments.
I would add to this:
a) The difference you are seeing in the two docs is that one doc is the architecture doc - listing all possible fundamental data sizes, the other is an implementation of that architecture with the data size fixed @ 32 bit.
b) Also as I am sure you know, strictly speaking you need to read STMPIDR1[3:0]+ STMPIDR0[7:0] to get the 0x962 (original 32 bit STM implementation) or 0x963 (STM-500 64 bit STM part).
This component ID is not however the best method for determining the fundamental data size. The architectural register STMFEAT2R[15:12] contains this value (DSize). It is feasible that another STM part that is architecturally compliant could come along with a different ID register value, but would have to have the correct entry in STMFEAT2R. This register also contains the information on the availability of guaranteed / invariant timing transactions which I imagine are also useful for the driver.
c) As seen below and from MikeWs comments it is evident that a 64 bit write should not be attempted on a 32 bit system or a 64 bit system running in AArch32 mode. In my view what the driver (or higher level software perhaps) does depends on the purpose/ API definition for the 64 bit write. If you want to generate a 64 bit STPv2 packet (D64[M][TS]) then the driver should refuse and error at this point. It may be acceptable to generate 2x32 bit packets under some circumstances, if the API definition allows for this.
Regards
Mike
=================================================================================== We have 2 implementations of the STM Architecture: a) STM, as documented in ARM DDI 0444B ([2] below). This is a 32-bit STM so has the 32-bit fundamental data size. b) STM-500 as documented in ARM DDI 0528B. This is a 64-bit STM so has the 64-bit fundamental data size.
You are likely to see either STM in 32 or 64-bit systems. We'd have preferred to only see the 64-bit STM in 64-bit systems, but it's not worked out that way unfortunately (e.g. Juno).
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit?
Yes, but see my comments further down.
- Can an STM fitted on a 64 bit system have a fundamental data size of
32 bit?
Yes.
- Can an STM fitted on a 64 bit system have a fundamental data size of
64 bit?
Yes.
- In all of the above cases will STMPIDR0[7:0] still read 0x962?
No. If you see a 64-bit STM-500 it has the part number of 0x963.
The purpose of the fundamental data size indication in the STMs is to indicate whether the STM will take a 64-bit access and generate a D64:
- A 32-bit STM will probably generate 2xD32 packets from a 64-bit access, although this should not be relied upon.
- A 64-bit STM will take a 64-bit access and generate a D64.
Note that there is no guaranteed way for a 32-bit system (either an ARMv7 core, or an ARMv8 core running in AArch32) to generate a 64-bit access to any STM. This means that even if you find a 64-bit STM in a 32-bit system, you should treat it as a 32-bit STM, and only perform 32-bit accesses. Treating a 64-bit STM as a 32-bit STM is fully compatible (i.e. code written to run on a 32-bit STM will work exactly the same on a 64-bit STM).
Mike Leach +44 (0)1254 893911 (Direct) Principal Engineer +44 (0)1254 893900 (Main) Arm Blackburn Design Centre +44 (0)1254 893901 (Fax) Belthorn House Walker Rd mailto:mike.leach@arm.com Guide Blackburn BB1 2QE
-----Original Message----- From: CoreSight [mailto:coresight-bounces@lists.linaro.org] On Behalf Of Mathieu Poirier Sent: 10 February 2016 18:16 To: Michael Williams Cc: coresight@lists.linaro.org Subject: Re: [PATCH V2 0/6] Introduce CoreSight STM support
Hey Michael,
Have you had time to look into this? I'm afraid that without the information upstreaming of the STM driver can't move ahead.
Thanks, Mathieu.
[...]
This isn't guaranteed to work on the ARM 32 bit architectures. The STM
might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system.
This is because the code interrogates the STM to decide whether it supports 64- bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
Hey Michael,
The above two paragraphs have been running around in my head all weekend long and I thought it best to clarify things before going any further with regards to upstreaming the driver.
First I did some soul searching in the documentation and found the following:
On page 32 of document [1], it is mentioned that an STM's fundamental data size can be either 32 or 64 bit. On page 3-12 of document [2], it is mentioned that an STM's fundamental data size can only be 32 bit.
From there have the following questions:
- Can an STM fitted on a 32 bit system have a fundamental data size
of 64 bit? 2) Can an STM fitted on a 64 bit system have a fundamental data size of 32
bit?
- Can an STM fitted on a 64 bit system have a fundamental data size of 64
bit?
- In all of the above cases will STMPIDR0[7:0] still read 0x962?
Clarifying the above 4 questions will go a long way.
Many thanks, Mathieu
[1]. ARM IHI 0054B (ID092613) [2]. ARM DDI 0444B (ID010111)
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-
October/297379.html
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.