Hi,
I am working on minimizing the latency between two block requests in the mmc framework. The approach is simple. If there are more than one request in the block queue the 2nd request will be prepared while the 1st request is being transfered. When the 1 st request is completed the 2nd request will start with minimal latency cost.
For writes this work fine: root@(none):/ dd of=/dev/mmcblk0p2 if=/dev/zero bs=4k count=2560 2560+0 records in 2560+0 records out root@(none):/ dmesg [mmc_queue_thread] req d97a2ac8 blocks 1024 [mmc_queue_thread] req d97a2ba0 blocks 1024 [mmc_queue_thread] req d97a2c78 blocks 1024 [mmc_queue_thread] req d97a2d50 blocks 1024 [mmc_queue_thread] req d97a2e28 blocks 1024 [mmc_queue_thread] req d97a2f00 blocks 1024 [mmc_queue_thread] req d954c9b0 blocks 1024 [mmc_queue_thread] req d954c800 blocks 1024 [mmc_queue_thread] req d954c728 blocks 1024 [mmc_queue_thread] req d954c650 blocks 1024 [mmc_queue_thread] req d954c578 blocks 1024 [mmc_queue_thread] req d954c4a0 blocks 1024 [mmc_queue_thread] req d954c8d8 blocks 1024 [mmc_queue_thread] req d954c3c8 blocks 1024 [mmc_queue_thread] req d954c2f0 blocks 1024 [mmc_queue_thread] req d954c218 blocks 1024 [mmc_queue_thread] req d954c140 blocks 1024 [mmc_queue_thread] req d954c068 blocks 1024 [mmc_queue_thread] req d954cde8 blocks 1024 [mmc_queue_thread] req d954cec0 blocks 1024 mmc block queue is never empty. All the mmc request preparations can run in parallel with an ongoing transfer.
[mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 "req (null)" indicates there are no requests pending in the mmc block queue. This is expected since there are no more requests to process.
For reads on the other hand it look like this root@(none):/ dd if=/dev/mmcblk0 of=/dev/null bs=4k count=256 256+0 records in 256+0 records out root@(none):/ dmesg [mmc_queue_thread] req d954cec0 blocks 32 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 64 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 There are never more than one read request in the mmc block queue. All the mmc request preparations will be serialized and the cost for this is roughly 10% lower bandwidth (verified on ARM platforms ux500 and Pandaboard).
page_not_up_to_date: /* Get exclusive access to the page ... */ error = lock_page_killable(page);
I looked at the code in do_generic_file_read(). lock_page_killable waits until the current read ahead is completed. Is it possible to configure the read ahead to push multiple read request to the block device queue?
Thanks, Per
On Thursday 28 April 2011, Per Forlin wrote:
For reads on the other hand it look like this root@(none):/ dd if=/dev/mmcblk0 of=/dev/null bs=4k count=256 256+0 records in 256+0 records out root@(none):/ dmesg [mmc_queue_thread] req d954cec0 blocks 32 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 64 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0
There are never more than one read request in the mmc block queue. All the mmc request preparations will be serialized and the cost for this is roughly 10% lower bandwidth (verified on ARM platforms ux500 and Pandaboard).
After some offline discussions, I went back to look at your mail, and I think the explanation is much simpler than you expected:
You have only a single process reading blocks synchronously, so the round trip goes all the way to user space. The block layer does some readahead, so it will start reading 32 blocks instead of just 8 (4KB) for the first read, but then the user process just sits waiting for data. After the mmc driver has finished reading the entire 32 blocks, the user needs a little time to read them from the page cache in 4 KB chunks (8 syscalls), during which the block layer has no clue about what the user wants to do next.
The readahead scales up to 256 blocks, but there is still only one reader, so you never have additional requests in the queue.
Try running multiple readers in parallel, e.g.
for i in 1 2 3 4 5 ; do dd if=/dev/mmcblk0 bs=16k count=256 iflag=direct skip=$[$i * 1024] & done
Arnd
On 3 May 2011 15:16, Arnd Bergmann arnd@arndb.de wrote:
On Thursday 28 April 2011, Per Forlin wrote:
For reads on the other hand it look like this root@(none):/ dd if=/dev/mmcblk0 of=/dev/null bs=4k count=256 256+0 records in 256+0 records out root@(none):/ dmesg [mmc_queue_thread] req d954cec0 blocks 32 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 64 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cde8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d954cec0 blocks 256 [mmc_queue_thread] req (null) blocks 0
There are never more than one read request in the mmc block queue. All the mmc request preparations will be serialized and the cost for this is roughly 10% lower bandwidth (verified on ARM platforms ux500 and Pandaboard).
After some offline discussions, I went back to look at your mail, and I think the explanation is much simpler than you expected:
You have only a single process reading blocks synchronously, so the round trip goes all the way to user space. The block layer does some readahead, so it will start reading 32 blocks instead of just 8 (4KB) for the first read, but then the user process just sits waiting for data. After the mmc driver has finished reading the entire 32 blocks, the user needs a little time to read them from the page cache in 4 KB chunks (8 syscalls), during which the block layer has no clue about what the user wants to do next.
The readahead scales up to 256 blocks, but there is still only one reader, so you never have additional requests in the queue.
Try running multiple readers in parallel, e.g.
for i in 1 2 3 4 5 ; do dd if=/dev/mmcblk0 bs=16k count=256 iflag=direct skip=$[$i * 1024] & done
Yes you are right about this. If I run with multiple read threads there are multiple request waiting in the mmc block queue.
page_not_up_to_date: /* Get exclusive access to the page ... */ error = lock_page_killable(page);
I looked at the code in do_generic_file_read(). lock_page_killable waits until the current read ahead is completed. Is it possible to configure the read ahead to push multiple read request to the block device queue?add
When I first looked at this I used dd if=/dev/mmcblk0 of=/dev/null bs=1M count=4 If bs is larger than read ahead, this will make the execution loop in do_generic_file_read() reading 512 until 1M is read. The second time in this loop it will wait on lock_page_killable.
If bs=16k the execution wont stuck at lock_page_killable.
Arnd
Thanks, Per
On Tuesday 03 May 2011 20:54:43 Per Forlin wrote:
page_not_up_to_date: /* Get exclusive access to the page ... */ error = lock_page_killable(page);
I looked at the code in do_generic_file_read(). lock_page_killable waits until the current read ahead is completed. Is it possible to configure the read ahead to push multiple read request to the block device queue?add
I believe sleeping in __lock_page_killable is the best possible scenario. Most cards I've seen work best when you use at least 64KB reads, so it will be faster to wait there than to read smaller units.
When I first looked at this I used dd if=/dev/mmcblk0 of=/dev/null bs=1M count=4 If bs is larger than read ahead, this will make the execution loop in do_generic_file_read() reading 512 until 1M is read. The second time in this loop it will wait on lock_page_killable.
If bs=16k the execution wont stuck at lock_page_killable.
submitting small 512 byte read requests is a real problem when the underlying page size is 16 KB. If your interpretation is right, we should probably find a way to make it read larger chunks on flash media.
Arnd
On 3 May 2011 22:02, Arnd Bergmann arnd@arndb.de wrote:
On Tuesday 03 May 2011 20:54:43 Per Forlin wrote:
page_not_up_to_date: /* Get exclusive access to the page ... */ error = lock_page_killable(page);
I looked at the code in do_generic_file_read(). lock_page_killable waits until the current read ahead is completed. Is it possible to configure the read ahead to push multiple read request to the block device queue?add
I believe sleeping in __lock_page_killable is the best possible scenario. Most cards I've seen work best when you use at least 64KB reads, so it will be faster to wait there than to read smaller units.
When I first looked at this I used dd if=/dev/mmcblk0 of=/dev/null bs=1M count=4 If bs is larger than read ahead, this will make the execution loop in do_generic_file_read() reading 512 until 1M is read. The second time in this loop it will wait on lock_page_killable.
If bs=16k the execution wont stuck at lock_page_killable.
submitting small 512 byte read requests is a real problem when the underlying page size is 16 KB. If your interpretation is right, we should probably find a way to make it read larger chunks on flash media.
Sorry a typo. I missed out a "k" :) It reads 512k until 1M.
Arnd
Per
On Tuesday 03 May 2011, Per Forlin wrote:
submitting small 512 byte read requests is a real problem when the underlying page size is 16 KB. If your interpretation is right, we should probably find a way to make it read larger chunks on flash media.
Sorry a typo. I missed out a "k" :) It reads 512k until 1M.
Ok, much better then.
Arnd
On 3 May 2011 22:02, Arnd Bergmann arnd@arndb.de wrote:
On Tuesday 03 May 2011 20:54:43 Per Forlin wrote:
page_not_up_to_date: /* Get exclusive access to the page ... */ error = lock_page_killable(page);
I looked at the code in do_generic_file_read(). lock_page_killable waits until the current read ahead is completed. Is it possible to configure the read ahead to push multiple read request to the block device queue?add
I believe sleeping in __lock_page_killable is the best possible scenario. Most cards I've seen work best when you use at least 64KB reads, so it will be faster to wait there than to read smaller units.
Sleeping is ok but I don't wont the read execution to stop (mmc going to idle when there is actually more to read). I did an interesting discovery when I forced host mmc_req_size to 64k The reads now look like: dd if=/dev/mmcblk0 of=/dev/null bs=4k count=256 [mmc_queue_thread] req d955f9b0 blocks 32 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d955f9b0 blocks 64 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d955f8d8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d955f9b0 blocks 128 [mmc_queue_thread] req d955f800 blocks 128 [mmc_queue_thread] req d955f8d8 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7811230 [mmc_queue_thread] req d955fec0 blocks 128 [mmc_queue_thread] req d955f800 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7811492 [mmc_queue_thread] req d955f9b0 blocks 128 [mmc_queue_thread] req d967cd30 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810848 [mmc_queue_thread] req d967cc58 blocks 128 [mmc_queue_thread] req d967cb80 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810654 [mmc_queue_thread] req d967caa8 blocks 128 [mmc_queue_thread] req d967c9d0 blocks 128 [mmc_queue_thread] req d967c8f8 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810652 [mmc_queue_thread] req d967c820 blocks 128 [mmc_queue_thread] req d967c748 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810952 [mmc_queue_thread] req d967c670 blocks 128 [mmc_queue_thread] req d967c598 blocks 128 [mmc_queue_thread] req d967c4c0 blocks 128 [mmc_queue_thread] req d967c3e8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 The mmc queue never runs empty until end of transfer.. The requests are 128 blocks (64k limit set in mmc host driver) compared to 256 blocks before. This will not improve performance much since the transfer now are smaller than before. The latency is minimal but instead there extra number of transfer cause more mmc cmd overhead. I added prints to print the wait time in lock_page_killable too. I wonder if I can achieve a none empty mmc block queue without compromising the mmc host driver performance.
Regards, Per
On 4 May 2011 21:13, Per Forlin per.forlin@linaro.org wrote:
On 3 May 2011 22:02, Arnd Bergmann arnd@arndb.de wrote:
On Tuesday 03 May 2011 20:54:43 Per Forlin wrote:
page_not_up_to_date: /* Get exclusive access to the page ... */ error = lock_page_killable(page);
I looked at the code in do_generic_file_read(). lock_page_killable waits until the current read ahead is completed. Is it possible to configure the read ahead to push multiple read request to the block device queue?add
I believe sleeping in __lock_page_killable is the best possible scenario. Most cards I've seen work best when you use at least 64KB reads, so it will be faster to wait there than to read smaller units.
Sleeping is ok but I don't wont the read execution to stop (mmc going to idle when there is actually more to read). I did an interesting discovery when I forced host mmc_req_size to 64k The reads now look like: dd if=/dev/mmcblk0 of=/dev/null bs=4k count=256 [mmc_queue_thread] req d955f9b0 blocks 32 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d955f9b0 blocks 64 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d955f8d8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req d955f9b0 blocks 128 [mmc_queue_thread] req d955f800 blocks 128 [mmc_queue_thread] req d955f8d8 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7811230 [mmc_queue_thread] req d955fec0 blocks 128 [mmc_queue_thread] req d955f800 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7811492 [mmc_queue_thread] req d955f9b0 blocks 128 [mmc_queue_thread] req d967cd30 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810848 [mmc_queue_thread] req d967cc58 blocks 128 [mmc_queue_thread] req d967cb80 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810654 [mmc_queue_thread] req d967caa8 blocks 128 [mmc_queue_thread] req d967c9d0 blocks 128 [mmc_queue_thread] req d967c8f8 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810652 [mmc_queue_thread] req d967c820 blocks 128 [mmc_queue_thread] req d967c748 blocks 128 [do_generic_file_read] lock_page_killable-wait sec 0 nsec 7810952 [mmc_queue_thread] req d967c670 blocks 128 [mmc_queue_thread] req d967c598 blocks 128 [mmc_queue_thread] req d967c4c0 blocks 128 [mmc_queue_thread] req d967c3e8 blocks 128 [mmc_queue_thread] req (null) blocks 0 [mmc_queue_thread] req (null) blocks 0 The mmc queue never runs empty until end of transfer.. The requests are 128 blocks (64k limit set in mmc host driver) compared to 256 blocks before. This will not improve performance much since the transfer now are smaller than before. The latency is minimal but instead there extra number of transfer cause more mmc cmd overhead. I added prints to print the wait time in lock_page_killable too. I wonder if I can achieve a none empty mmc block queue without compromising the mmc host driver performance.
There is actually a performance increase from 16.5 MB/s to 18.4 MB/s when lowering the max_req_size to 64k. I run a dd test on a pandaboard using 2.6.39-rc5 kernel.
First case when block queue gets empty after every request: root@(none):/ dd if=/dev/mmcblk0p3 of=/dev/null bs=4k count=25600 25600+0 records in 25600+0 records out 104857600 bytes (100.0MB) copied, 6.061107 seconds, 16.5MB/s
Second case when modifying omap_hsmmc to force request size is to half (128 instead of 256). This results in queue is never empty dd if=/dev/mmcblk0p3 of=/dev/null bs=4k count=25600 25600+0 records in 25600+0 records out 104857600 bytes (100.0MB) copied, 5.423362 seconds, 18.4MB/s
Regards, Per
On Saturday 07 May 2011, Per Forlin wrote:
The mmc queue never runs empty until end of transfer.. The requests are 128 blocks (64k limit set in mmc host driver) compared to 256 blocks before. This will not improve performance much since the transfer now are smaller than before. The latency is minimal but instead there extra number of transfer cause more mmc cmd overhead. I added prints to print the wait time in lock_page_killable too. I wonder if I can achieve a none empty mmc block queue without compromising the mmc host driver performance.
There is actually a performance increase from 16.5 MB/s to 18.4 MB/s when lowering the max_req_size to 64k. I run a dd test on a pandaboard using 2.6.39-rc5 kernel.
I've noticed with a number of cards that using 64k writes is faster than any other size. What I could not figure out yet is whether this is a common hardware optimization for MS Windows (which always uses 64K I/O when it can), or if it's a software effect and we can actually make it go faster with Linux by tuning for other sizes.
Arnd
On 8 May 2011 17:09, Arnd Bergmann arnd@arndb.de wrote:
On Saturday 07 May 2011, Per Forlin wrote:
The mmc queue never runs empty until end of transfer.. The requests are 128 blocks (64k limit set in mmc host driver) compared to 256 blocks before. This will not improve performance much since the transfer now are smaller than before. The latency is minimal but instead there extra number of transfer cause more mmc cmd overhead. I added prints to print the wait time in lock_page_killable too. I wonder if I can achieve a none empty mmc block queue without compromising the mmc host driver performance.
There is actually a performance increase from 16.5 MB/s to 18.4 MB/s when lowering the max_req_size to 64k. I run a dd test on a pandaboard using 2.6.39-rc5 kernel.
I've noticed with a number of cards that using 64k writes is faster than any other size. What I could not figure out yet is whether this is a common hardware optimization for MS Windows (which always uses 64K I/O when it can), or if it's a software effect and we can actually make it go faster with Linux by tuning for other sizes.
Thanks for the tip I will keep that in mind. In this case the increase in performance is due to parallel cache handling. I did a test and set the mmc_max_req to 128k (same size as the first test with low performance) and increase the read_ahead to 256k. root@(none):/ echo 256 > sys/devices/platform/omap/omap_hsmmc.0/mmc_host/mmc0/mmc0:80ca/block/mmcblk0/queue/read_ahead_kb root@(none):/ dd if=/dev/mmcblk0p3 of=/dev/null bs=4k count=25600 25600+0 records in 25600+0 records out 104857600 bytes (100.0MB) copied, 5.138585 seconds, 19.5MB/s
Regards, Per
linaro-kernel@lists.linaro.org