On 4 February 2016 at 08:24, Michael Williams Michael.Williams@arm.com wrote:
Hi Mathieu,
In case you don't remember, I work in the ARM architecture team with responsibility for debug and CoreSight. We have met before at Linaro Connect and discussed the Linaro support for CoreSight.
I have to say it's all looking very encouraging, with progress being made on CoreSight and trace support. So that's all good!
Now, I don't normally read the Linux mailing lists, but Al Grant pointed me at this patch to add CoreSight STM support [1].
[1] http://www.spinics.net/lists/arm-kernel/msg479457.html
There're some points that confuse me about this patch. Since I don't follow the mailing lists, Will Deacon suggested I mail you directly. In particular, its behaviour is different for STMv1 vs. STM-500, as well as being different to the Intel driver [2].
[2] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/intel_th/sth...
E.g. consider a call for the following:
uint64 data[128]; // A 64b aligned pointer stm.packet(16, &((char *)data)[1]); // Send 16 bytes, unaligned
The Intel 32bit driver (sth_stm_driver) will:
- Send a D32 packet consisting of data[4..1] (assuming they don't fault misaligned addresses). Because size > 8, it rounds size down to 4.
- Ignore the other data.
- Only ever generates a single packet.
- For other sizes, rounds size down to power of 2 and returns number of bytes written.
The Intel 64bit driver (sth_stm_driver) will:
- Do nothing because size > 8.
- Only ever generates a zero or one packets.
- For size <= 8, rounds size down to power of 2 and returns number of bytes written.
The CoreSight 32bit driver (stm_send) will:
- Send a D1=data[1], D2=data[3..2], D4=data[7..4], D4=data[11..8], D4=data[15..12], D1=data[16] stream.
- This is very inefficient use of bandwidth.
The CoreSight 64bit driver (stm_send_64bit) will:
- Send a D8=ZeroExtend(data[7..1]), D8=data[15..8] D8=ZeroExtend(data[16])
- This function only ever sends D8 packets.
- There is no way for the decoder to work out what the original data was.
I think this function is only called from within the generic driver [3], though. It looks like stm_write() calls it a chunk at a time, with a chunk being at most 4/8 bytes, depending on the capabilities of the STM, and is expecting it to send only a single packet. This means that the code in stm_send/stm_send_64bit to deal with odd sized packets and misaligned addresses looks redundant/wrong.
[3] https://github.com/torvalds/linux/blob/master/drivers/hwtracing/stm/core.c
It looks like there might be support for other sources to link to the driver, but I could only find the stm_console when I looked.
Of course, this is based on my limited understanding of how this is used, and the current generic STM and Intel drivers. It might be that these changes have been agreed and Intel plan to change their driver to match (as I said, I don't generally follow the mailing lists). However, the different 64b and 32b behaviors on the ARM version are weird, and the unaligned pointer handling looks wrong too.
(The different behaviors on the Intel version isn't my problem. :-)
It might also be that I am reverse engineering the behaviour incorrectly.
The other point (which Will has raised on the mailing lists in the past [4]) is this code:
#ifndef CONFIG_64BIT static inline void __raw_writeq(u64 val, volatile void __iomem *addr) { asm volatile("strd %1, %0" : "+Qo" (*(volatile u64 __force *)addr) : "r" (val)); } #undef writeq_relaxed #define writeq_relaxed(v, c)__raw_writeq((__force u64) cpu_to_le64(v), c) #endif
This isn't guaranteed to work on the ARM 32 bit architectures. The STM might receive a 64-bit write, or might receive a pair of 32-bit writes to the two addressed words *in either order*. The upshot is that this is not a valid way of writing to the STM. (The data reordering is a killer.)
The driver appears to use this if there is an STM-500 in an AArch32 system. This is because the code interrogates the STM to decide whether it supports 64-bit accesses. It should either (a) not do so, and refuse 64-bit data if AArch32, or (b) use some property of the system to decide. I would still frown on (b) because the architecture makes it clear that this is UNPREDICTABLE, meaning you're not supposed to rely on it and the device isn't allowed to advertise its behaviour.
Hey Michael,
The above two paragraphs have been running around in my head all weekend long and I thought it best to clarify things before going any further with regards to upstreaming the driver.
First I did some soul searching in the documentation and found the following:
On page 32 of document [1], it is mentioned that an STM's fundamental data size can be either 32 or 64 bit. On page 3-12 of document [2], it is mentioned that an STM's fundamental data size can only be 32 bit.
From there have the following questions:
1) Can an STM fitted on a 32 bit system have a fundamental data size of 64 bit? 2) Can an STM fitted on a 64 bit system have a fundamental data size of 32 bit? 3) Can an STM fitted on a 64 bit system have a fundamental data size of 64 bit? 4) In all of the above cases will STMPIDR0[7:0] still read 0x962?
Clarifying the above 4 questions will go a long way.
Many thanks, Mathieu
[1]. ARM IHI 0054B (ID092613) [2]. ARM DDI 0444B (ID010111)
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/297379.ht...
I hope this is useful.
With kind regards,
Mike.
Michael Williams Principal Engineer ARM Limited www.arm.com The Architecture For The Digital World IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.