On Thursday 28 April 2011, Per Forlin wrote:
For reads on the other hand it look like this root@(none):/ dd if=/dev/mmcblk0 of=/dev/null bs=4k count=256 256+0 records in 256+0 records out root@(none):/ dmesg [mmc_queue_thread] req d954cec0 blocks 32 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 64 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0
There are never more than one read request in the mmc block queue. All the mmc request preparations will be serialized and the cost for this is roughly 10% lower bandwidth (verified on ARM platforms ux500 and Pandaboard).
After some offline discussions, I went back to look at your mail, and I think the explanation is much simpler than you expected:
You have only a single process reading blocks synchronously, so the round trip goes all the way to user space. The block layer does some readahead, so it will start reading 32 blocks instead of just 8 (4KB) for the first read, but then the user process just sits waiting for data. After the mmc driver has finished reading the entire 32 blocks, the user needs a little time to read them from the page cache in 4 KB chunks (8 syscalls), during which the block layer has no clue about what the user wants to do next.
The readahead scales up to 256 blocks, but there is still only one reader, so you never have additional requests in the queue.
Try running multiple readers in parallel, e.g.
for i in 1 2 3 4 5 ; do dd if=/dev/mmcblk0 bs=16k count=256 iflag=direct skip=$[$i * 1024] & done
Arnd