On Sunday 27 March 2011 20:54:58 Michael Monnerie wrote:
On Sonntag, 27. März 2011 Arnd Bergmann wrote:
# Write 256 MB using 4 MB blocks dd if=/dev/zero of=/dev/sde bs=4M oflag=direct count=64
# Write 256 MB using 32 Kb blocks dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192
My guess is that in the first case, you get to around 8 MB/s, while the second one should be a bit better.
"A bit" is underestimation: # dd if=/dev/zero of=/dev/sde1 bs=4M oflag=direct count=64 64+0 records in 64+0 records out 268435456 bytes (268 MB) copied, 33.054 s, 8.1 MB/s # dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192 8192+0 records in 8192+0 records out 268435456 bytes (268 MB) copied, 17.7888 s, 15.1 MB/s
Ok, I see. The big difference is an indication that the theory about the cache was really wrong, and it's actually using the non-power-of-two segments.
./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024]
If this is the case, the stick will frequently have to do garbage collection, which makes it slower than it could be.
Why would it be so? You mean because nobody actually writes in 12MB chunks?
When flashbench assumes that all erase blocks are power-of-two aligned while they are really not, you sometimes get into the case where it writes into two erase blocks at once. This requires the stick to garbage-collect both erase blocks as soon the next time that the user writes to another one.
It seems like writing in at least 768KiB chunks gives best performance, which are 24x 32KB. Maybe there are 3x8bit chips on it?
Actually, the most likely explanation is that there is just one or two chips, but those use 3-bit MLC NAND (also called TLC). In this flash memory, each transistor can store three bits. Since there is a power-of-two number of usable transistors in each erase block in most flash chips, you get 3*2^n bytes.
See http://www.centon.com/flash-products/chiptype for an explanation.
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24M/s 23.6M/s 23.5M/s 23.7M/s 23.5M/s 23.6M/s 6MiB 23.4M/s 23.4M/s 23.5M/s 23.6M/s 23.2M/s 23.2M/s 3MiB 23.4M/s 23.4M/s 23.3M/s 23.4M/s 23.4M/s 23.4M/s 1.5MiB 24M/s 23.8M/s 23.6M/s 23.8M/s 24.1M/s 23.8M/s 768KiB 24.1M/s 24.2M/s 24.6M/s 24.7M/s 24M/s 24M/s 384KiB 21.6M/s 21.7M/s 21.7M/s 21.6M/s 22.1M/s 22.3M/s 192KiB 20.5M/s 20.4M/s 20.2M/s 20.6M/s 20.9M/s 20.9M/s 96KiB 25.8M/s 26.4M/s 26.2M/s 25.9M/s 25.7M/s 26.3M/s 48KiB 20.5M/s 21M/s 19.9M/s 19.8M/s 20M/s 20.6M/s 24KiB 7.85M/s 7.68M/s 8.06M/s 7.57M/s 7.62M/s 7.56M/s 12KiB 2.04M/s 2.16M/s 2.15M/s 2.17M/s 2.16M/s 2.15M/s 6KiB 1.19M/s 1.17M/s 1.14M/s 1.13M/s 1.16M/s 1.17M/s 3KiB 643K/s 642K/s 625K/s 634K/s 653K/s 650K/s
Ok, very nice. I consider this a proof that what I explained is actually what's going on here. So we know that the underlying erase block size is really not 4 MB but rather 12 MB or a smaller value of 3*2^n.
Note that the number for 96 KB is actually higher than the one for 768 KB that you pointed out. We already know that 32 KB is much faster than 16 KB here, and 96 is the lowest multiple of 32 in 3*2^n.
I've seen a similar behaviour on another drive, but did not think it was common enough to handle it in flashbench yet. I am now updating the tool to work with this case.
So I re-tested with dd: # dd if=/dev/zero of=/dev/sde1 bs=768K oflag=direct count=341 320+0 records in 320+0 records out 251658240 bytes (252 MB) copied, 11.4378 s, 22.0 MB/s # dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192 8192+0 records in 8192+0 records out 268435456 bytes (268 MB) copied, 16.545 s, 16.2 MB/s # dd if=/dev/zero of=/dev/sde1 bs=768K oflag=direct count=346 346+0 records in 346+0 records out 272105472 bytes (272 MB) copied, 11.1504 s, 24.4 MB/s
Yes, writing 768KiB chunks seems the best idea.
Ok. 96kb shouldn't be much slower than this, but there is a little additional overhead for the extra write accesses when you try that.
Given the new data, I'd like to ask you to do one more test run of the --open-au test, if you still have the energy.
Please run
flashbench --open-au --random --open-au-nr=5 --erasesize=$[3 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
flashbench --open-au --random --open-au-nr=5 --erasesize=$[6 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
flashbench --open-au --random --open-au-nr=5 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
flashbench --open-au --random --open-au-nr=6 --erasesize=$[3 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
The goal is to find the minimum value for --erasesize= and the maximum value for --open-au-nr= that can give the full bandwidth of 20+MB/s, which I'm guessing will be 12 MB and 5, but it would be helpful to know for sure.
You will need to pull the latest version of flashbench, in which I have added the --offset argument parsing.
In any case, it's a good idea to align the partition to the start of the erase block (6 or 12 MB), not the 4 MB that I told you before.
Arnd