On Tue, Feb 01, 2011 at 01:34:59PM +0000, Pawel Moll wrote:
And to prove the point, I have MMCI running at up to 4Mbps, an 8 fold increase over what the current fixed upper-rate implementation does. The adaptive rate implementation is just a proof of concept at the moment and requires further work to improve the rate selection algorithm.
Great, I've terribly glad you managed to have a go at this (I honestly wanted to, but simply had no time). I'm looking forward to see the patches and will be more than happy to backport them for the sake of the Linaro guys using 2.6.35 and 2.6.37 right now.
On our side we did extend the FIFO and performed some tests (not very extensive yet though). The change seems not to break anything and help in the pathological (heavy USB traffic) scenario.
When I get your changes and some official FPGA release, I'll try to push the bandwidth limits even further - hopefully changes will complement.
You can't push it any further without increasing the CPU/bus clock rates. My measurements show that it takes the CPU in the region of 6-9us to unload 32 bytes from the FIFO, which gives a theoretical limit of 2.8 to 4.2Mbps, depending on how the platform booted (some reboots its consistently in the order of 6us, some boots its consistently around 9us.)
The real solution to this is for there to be proper working DMA support implemented on ARM platforms,
In case of VE this is all about getting an engine into the test chips, what didn't happen for A9 (the request lines are routed between the motherboard and the tile and IO FPGA can - theoretically - use the MMCI requests). As far as I'm told this cell is simply huge (silicon-wise) and therefore it's the first candidate to cut down when area is scarce... Anyway, I've spoken to guys around and asked them to keep the problem in mind, so we may get something with the next releases.
Bear in mind that PL18x + PL08x doesn't work. Catalin forwarded my concerns over this to ARM Support - where I basically ask how to program the hardware up to DMA a single 64K transfer off a MMC card into a set of scattered memory locations.
I've yet to have a response, so I'll take it that it's just possible (the TRMs say as much).
The problem is that for a transfer, the MMCI produces BREQ * n + LBREQ, and the DMAC will only listen for a LBREQ if it's in peripheral flow control. If it's in peripheral flow control, then it ignores the transfer length field in the control register, only moving to the next LLI when it sees LBREQ or LSREQ.
It ignores LBREQ and LSREQ in DMAC flow control mode.. You can DMA almost all the data too/from the MMCI, but you miss the last half-fifo-size worth of data. While you can unload that manually for a read, you can't load it manually for a write.
With peripheral flow control, you can only DMA the requested data to a single contiguous buffer without breaking the MMC request into much smaller chunks. As Peter Pearse's PL08x code seems to suggest, the maximum size of those chunks is 1K.
This seems to be a fundamental problem with the way each primecell has been designed.
So, I do hope that someone decides to implement something more reasonable if Versatile Express were to get a DMA controller. If it's another PL08x then it isn't worth it - it won't work.