On Mon, Jan 24, 2011 at 05:54:18PM -0000, Pawel Moll wrote:
$ git am 0001-arm-Improve-MMC-performance-on-Versatile-Express.patch
/ # uname -a Linux (none) 2.6.38-rc2+ #12 SMP Mon Jan 24 17:41:53 GMT 2011 armv7l GNU/Linux / # cat /dev/sda > /dev/null & / # dd if=/dev/mmcblk0 of=/dev/null bs=64k count=128 128+0 records in 128+0 records out 8388608 bytes (8.0MB) copied, 1.501529 seconds, 5.3MB/s
THERE IS NO ISSUE WITH MMCI. THE HARDWARE DOES NOT NEED "FIXING".
I'm really pleased you think so. It's always nice to have happy customers :-)
If you're flooding the system with USB traffic, enlargening the FIFO size won't help. Making the FIFO larger just decreases the _interrupt_ _latency_ requirements. It doesn't mean you can cope with the amount of data being transferred.
if (cycles-to-transfer-usb-packets + cycles-to-transfer-MMC-data > available-cpu-cycles) you_are_stuffed(even_with_larger_fifo);
So I'm not surprised that running USB and MMC at full speed results in MMC losing out. You will find that even with a larger FIFO, MMC will still lose out.
Why? The ISP1761 can store the packets without complaint, and the host CPU can read them when it's ready. As soon as one packet is read off the host, the next packet is probably sitting there waiting for the host CPU to read it. When the ISP1761 buffer becomes full, it can tell the device to stop sending data.
The result of that is the ISP1761 will be able to transfer data as fast as the host CPU can possibly go - to the exclusion of the MMC interface. No amount of FIFO (well, except several KB to cover the _largest_ size of MMC transfer we request) will resolve the problem of USB and MMC sharing the same CPU.
If you're flooding the system with USB traffic, enlargening the FIFO size won't help. Making the FIFO larger just decreases the _interrupt_ _latency_ requirements. It doesn't mean you can cope with the amount of data being transferred.
On VE both ISP and MMCI are sharing the same static memory interface, which naturally limit their throughput. The ISP has a limited transfer buffer (about 60kB) and will not start new transfers before there is some space available. And of course USB devices are unable to initiate _any_ transactions on their own, so host can simply stall the bus on its discretion.
In my example, when I'm reading as much data from an USB device as possible, the ISP buffer will be filled at some point and the interrupt asserted. Now the handler will perform its duties. I'm guessing (not sure, I'm just unable to read this code today) it gathers all the received transfers and schedules new descriptors, then return. And now it's the MMC opportunity - if it can "survive", not being starved, till this moment (and, as experiments suggest it's about 1.3ms in the worst case scenario), it has a chance to grab CPU's attention, before the USB transfers are finished and the USB interrupt asserted again. If not that - I'll try to bump up MMC interrupt priority.
But it's all theory right now, just what I'm hoping to achieve, and - will all due respect - it's not your decision whether or not we perform this exercise. And there are virtually no changes to the driver required - all we have to do is to add additional "struct variant_data", so you may not be concerned about horrible driver hacks.
Of course you may be absolutely right and we won't succeed to make it much better. But even in this case we should be able to clock the card faster, which will bring performance gain on its own.
Good night!
Paweł
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Mon, Jan 24, 2011 at 07:59:03PM +0000, Pawel Moll wrote:
If you're flooding the system with USB traffic, enlargening the FIFO size won't help. Making the FIFO larger just decreases the _interrupt_ _latency_ requirements. It doesn't mean you can cope with the amount of data being transferred.
On VE both ISP and MMCI are sharing the same static memory interface,
What has that to do with it? If the static memory controller was the bottleneck, don't you think that two CPUs running in parallel, one reading data from the ISP1761 and the other reading the MMCI would suffer bus starvation? Your "HACK HACK HACK" patch shows that clearly isn't the case.
You've already told me that you've measured the ISP1761 interrupt handler taking 1.3ms to empty data off of the chip. If that's 60K of data, that's a data rate of around 47MiB/s.
At 521kHz transfer rate, it takes about 490us for MMCI to half-fill its FIFO, or 980us to fully fill it. It takes (measured) about 6-9us to unload 32 bytes of data from the FIFO.
Translating the CPU read rate, that's a data rate of around 4MiB/s.
So I put it to you that there's plenty of bus bandwidth present to service both the ISP1761 and MMCI. What we're lacking is CPU bandwidth.
I guess you haven't thought about moving MMCI to an adaptive clocking solution? What I'm suggesting is halve the clock rate on FIFO error and retry. Increase the clock rate on each successful transfer up to the maximum provided by the core MMC code.
That should _significantly_ increase the achievable PIO data rate way beyond what a deeper FIFO could ever hope to achieve, and will allow it to adapt to situations where you load the system up beyond the maximum latency which the MMCI can stand.
This would benefit a whole range of platforms, improving performance across the board, which as you've already indicated merely going for a deeper FIFO would be unable to do.
On Mon, Jan 24, 2011 at 11:10:50PM +0000, Russell King - ARM Linux wrote:
I guess you haven't thought about moving MMCI to an adaptive clocking solution? What I'm suggesting is halve the clock rate on FIFO error and retry. Increase the clock rate on each successful transfer up to the maximum provided by the core MMC code.
That should _significantly_ increase the achievable PIO data rate way beyond what a deeper FIFO could ever hope to achieve, and will allow it to adapt to situations where you load the system up beyond the maximum latency which the MMCI can stand.
And to prove the point, I have MMCI running at up to 4Mbps, an 8 fold increase over what the current fixed upper-rate implementation does. The adaptive rate implementation is just a proof of concept at the moment and requires further work to improve the rate selection algorithm.
The real solution to this is for there to be proper working DMA support implemented on ARM platforms, rather than requiring every peripheral to be driven via CPU intensive PIO all the time.
And to prove the point, I have MMCI running at up to 4Mbps, an 8 fold increase over what the current fixed upper-rate implementation does. The adaptive rate implementation is just a proof of concept at the moment and requires further work to improve the rate selection algorithm.
Great, I've terribly glad you managed to have a go at this (I honestly wanted to, but simply had no time). I'm looking forward to see the patches and will be more than happy to backport them for the sake of the Linaro guys using 2.6.35 and 2.6.37 right now.
On our side we did extend the FIFO and performed some tests (not very extensive yet though). The change seems not to break anything and help in the pathological (heavy USB traffic) scenario.
When I get your changes and some official FPGA release, I'll try to push the bandwidth limits even further - hopefully changes will complement. Of course there is absolutely no need of using the modified version if one opts not to do so ;-) It's just a line in the board configuration file, selecting one FPGA image or the other - they have different "configuration value" in the peripheral ID so the driver can distinguish them.
The real solution to this is for there to be proper working DMA support implemented on ARM platforms,
In case of VE this is all about getting an engine into the test chips, what didn't happen for A9 (the request lines are routed between the motherboard and the tile and IO FPGA can - theoretically - use the MMCI requests). As far as I'm told this cell is simply huge (silicon-wise) and therefore it's the first candidate to cut down when area is scarce... Anyway, I've spoken to guys around and asked them to keep the problem in mind, so we may get something with the next releases.
Cheers!
Paweł
On Tue, Feb 01, 2011 at 01:34:59PM +0000, Pawel Moll wrote:
And to prove the point, I have MMCI running at up to 4Mbps, an 8 fold increase over what the current fixed upper-rate implementation does. The adaptive rate implementation is just a proof of concept at the moment and requires further work to improve the rate selection algorithm.
Great, I've terribly glad you managed to have a go at this (I honestly wanted to, but simply had no time). I'm looking forward to see the patches and will be more than happy to backport them for the sake of the Linaro guys using 2.6.35 and 2.6.37 right now.
On our side we did extend the FIFO and performed some tests (not very extensive yet though). The change seems not to break anything and help in the pathological (heavy USB traffic) scenario.
When I get your changes and some official FPGA release, I'll try to push the bandwidth limits even further - hopefully changes will complement.
You can't push it any further without increasing the CPU/bus clock rates. My measurements show that it takes the CPU in the region of 6-9us to unload 32 bytes from the FIFO, which gives a theoretical limit of 2.8 to 4.2Mbps, depending on how the platform booted (some reboots its consistently in the order of 6us, some boots its consistently around 9us.)
The real solution to this is for there to be proper working DMA support implemented on ARM platforms,
In case of VE this is all about getting an engine into the test chips, what didn't happen for A9 (the request lines are routed between the motherboard and the tile and IO FPGA can - theoretically - use the MMCI requests). As far as I'm told this cell is simply huge (silicon-wise) and therefore it's the first candidate to cut down when area is scarce... Anyway, I've spoken to guys around and asked them to keep the problem in mind, so we may get something with the next releases.
Bear in mind that PL18x + PL08x doesn't work. Catalin forwarded my concerns over this to ARM Support - where I basically ask how to program the hardware up to DMA a single 64K transfer off a MMC card into a set of scattered memory locations.
I've yet to have a response, so I'll take it that it's just possible (the TRMs say as much).
The problem is that for a transfer, the MMCI produces BREQ * n + LBREQ, and the DMAC will only listen for a LBREQ if it's in peripheral flow control. If it's in peripheral flow control, then it ignores the transfer length field in the control register, only moving to the next LLI when it sees LBREQ or LSREQ.
It ignores LBREQ and LSREQ in DMAC flow control mode.. You can DMA almost all the data too/from the MMCI, but you miss the last half-fifo-size worth of data. While you can unload that manually for a read, you can't load it manually for a write.
With peripheral flow control, you can only DMA the requested data to a single contiguous buffer without breaking the MMC request into much smaller chunks. As Peter Pearse's PL08x code seems to suggest, the maximum size of those chunks is 1K.
This seems to be a fundamental problem with the way each primecell has been designed.
So, I do hope that someone decides to implement something more reasonable if Versatile Express were to get a DMA controller. If it's another PL08x then it isn't worth it - it won't work.
So, I do hope that someone decides to implement something more reasonable if Versatile Express were to get a DMA controller. If it's another PL08x then it isn't worth it - it won't work.
More likely it will be DMA-330 formerly known as PL330. But as I said - it's about the chip that Core Tile carries (whether it's feasible to squeeze it inside).
Alternatively one could use Logic Tile (FPGA based) in the second daughterboard site with an engine implementation, but I don't expect many users willing to pay $$$$ just for this reason :-)
Cheers!
Paweł