On Monday 28 March 2011, Michael Monnerie wrote:
OK, here the latest results:
Very nice. Some interpretations from me:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 7.71M/s 3MiB 13.1M/s 1.5MiB 13.3M/s 768KiB 13.4M/s 384KiB 10.2M/s 192KiB 20.3M/s 96KiB 24.6M/s 48KiB 19.4M/s 24KiB 7.53M/s
It seems that there are only two cases where the stick can reach the maximum throughput, writing whole 12 MB blocks and writing small numbers of 96 KB blocks at once.
# ./flashbench --open-au --random --open-au-nr=1 --erasesize=$[12 * 1024
- 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted 12MiB 24M/s 6MiB 10.2M/s 3MiB 16.1M/s 1.5MiB 5.89M/s 768KiB 4.49M/s 384KiB 4.45M/s 192KiB 4.36M/s 96KiB 4.06M/s 48KiB 3.44M/s 24KiB 2.84M/s
There is a very noticeable degradation compared to linear access, but it's not devestating. Your earlier tests have shown that it can do 4*4MB random access patterns, so I assume that treats them differently by writing random data to another buffer that occasionally gets written back, hence the slowdown.
# ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024
- 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted 12MiB 9.93M/s 6MiB 13.8M/s 3MiB 10.4M/s 1.5MiB 5.34M/s 768KiB 3M/s 384KiB 2.04M/s 192KiB 1.37M/s 96KiB 666K/s ^C
The first line includes the garbage-collection from the previous run, so it's slower than 25 MB/s. It's an unfortunate side-effect that each measurement depends on what the previous one was.
What you can see very clearly here is that smaller sizes get exponentially slower, so this hits the worst-case scenario. Obviously, the stick can not do random access to more than one erase block. It would be possible to find the exact parameters, but I think this is enough knowledge for now.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 1.5MiB 6M/s 768KiB 16.8M/s 384KiB 3.17M/s 192KiB 4.97M/s 96KiB 19.4M/s 48KiB 3.43M/s 24KiB 5.99M/s
Ok, pretty random behavior, as expected: this uses 1.5 MB erase blocks, so it only needs to do the garbage collection sometimes, but then it has to copy the rest of the old erase block (10.5 MB) into the new one.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 5.9M/s 1.5MiB 5.94M/s 768KiB 5.81M/s 384KiB 15.6M/s 192KiB 5.62M/s 96KiB 5.88M/s 48KiB 4.15M/s 24KiB 3.22M/s
Same for 3 MB, but here it seems that it copies the 9 MB most of the time. It's very possible that it uses 4 MB of the 12 MB for random access, so it's fast in one out of four cases here.
This would be a very smart strategy, because each 12 MB erase block will have 4 MB that are very fast and 8 MB that are slow, based on the way that MLC flash works.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 23.9M/s 1.5MiB 12M/s 768KiB 5.87M/s 384KiB 4.87M/s 192KiB 10.7M/s 96KiB 5.74M/s 48KiB 9.82M/s 24KiB 3.71M/s
I should really implement better parsing for the command line in flashbench. The notation $[6 * 512 * 1024] is really just the bash way of computing the number 3145728, and it's exactly the same as $[3 * 1024 * 1024]. I always wanted to be able to say --erasesize=3M, similar to how dd works, but I haven't done that yet.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 4.4M/s 3MiB 14.6M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.8M/s 192KiB 12.9M/s 96KiB 14.9M/s 48KiB 12.1M/s 24KiB 6.41M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 14.4M/s 3MiB 14.4M/s 1.5MiB 14.7M/s 768KiB 14.6M/s 384KiB 13.6M/s 192KiB 12.9M/s 96KiB 14.9M/s 48KiB 12M/s 24KiB 6.35M/s
These two are also the same. As you can see here, you get almost exactly half the speed as writing the full 12 MB. This happens because you write new data to half the erase block, and it copies the other half from the old data.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 18.1M/s 6MiB 23.6M/s 3MiB 23.8M/s 1.5MiB 24.2M/s 768KiB 24.6M/s 384KiB 22.2M/s 192KiB 20.8M/s 96KiB 24.9M/s 48KiB 19.4M/s 24KiB 7.52M/s
So this last result seems like the clear winner.
Excellent. Interestingly, the command line is the same as in the very first test where it got the good performance only sometimes. Sometimes, these sticks need a few runs before they get into the optimum case, especially when you are alternating between random and linear access. It may have remembered that a specific erase block is typically used for random access and optimized for that. Only when it's written linearly a few times it gets into the fast case.
Testing with it:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 9.87M/s 3MiB 13M/s 1.5MiB 13.4M/s 768KiB 13.5M/s 384KiB 10.4M/s 192KiB 20.5M/s 96KiB 25.1M/s 48KiB 20M/s 24KiB 7.73M/s
relatively slow again, just like the first measurement.
# ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 12.9M/s 6MiB 23.9M/s 3MiB 24M/s 1.5MiB 24.4M/s 768KiB 25.1M/s 384KiB 21.8M/s 192KiB 20.3M/s 96KiB 24.7M/s 48KiB 19.2M/s 24KiB 7.71M/s # ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 23.9M/s 3MiB 24.1M/s 1.5MiB 24.5M/s 768KiB 24.6M/s 384KiB 21.8M/s 192KiB 20.4M/s 96KiB 25.8M/s 48KiB 19.7M/s 24KiB 7.77M/s
Ok.
# ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24.1M/s 6MiB 18M/s 3MiB 9.08M/s 1.5MiB 4.55M/s 768KiB 2.28M/s ^C
Seems au-nr=3 is quite good.
Yes, the drop is extremely obvious here, as expected the performance drops in half with every row.
I also retestet for erasesize 512: # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 15.9M/s 3MiB 11.1M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.8M/s 192KiB 13.3M/s 96KiB 15.1M/s 48KiB 13.1M/s 24KiB 6.54M/s # ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 15.2M/s 3MiB 14.4M/s 1.5MiB 14.5M/s 768KiB 14.5M/s 384KiB 13.7M/s 192KiB 13.2M/s 96KiB 14.7M/s 48KiB 12.4M/s 24KiB 6.31M/s # ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 16.6M/s 3MiB 14.4M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.7M/s 192KiB 13.2M/s 96KiB 15.3M/s 48KiB 12.6M/s 24KiB 6.22M/s # ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 14.4M/s 3MiB 8.05M/s 1.5MiB 4.28M/s 768KiB 2.2M/s 384KiB 1.11M/s ^C
This is not so good.
As I explained above, this just means that it assumes a 6 MB erase block, so it basically halves the performance. It's still valuable information that the drop-off happens at 4 erase blocks as well, as expected.
I'm sorry that this is all so complicated, but I don't have a stick with this behaviour myself.
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
So that is the best result. Funny, we're at au-nr=3 now. Those stick really likes a threesome ;-)
And what does all this mean now? Is there any FAT/NTFS/other FS layout I could optimize it with? Does the 12MB partition start still hold? I guess so.
Yes, definitely. If there are any file system characteristics that depend on multi-megabyte sizes, they should be aligned with 12 MB.
I guess with XFS I would use mount option logbufs=8,logbsize=128k,largeio,delaylog and make the filesystem with mkfs.xfs -s size=4k -b size=4k -d su=96k,sw=1 In a lot of options I can only use power of 2 increments, so I benched with this:
With XFS, I would recommend using 32 KB blocks (-b size=32k) and sectors (-s size=32k), possibly 16 kb sectors if space is valuable, but not below.
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[16 * 1024 * 1024] /dev/sdd --blocksize=$[32 * 1024] --offset=$[32 * 1024 * 1024] sched_setscheduler: Operation not permitted 16MiB 24M/s 8MiB 23.9M/s 4MiB 10.2M/s 2MiB 5.62M/s 1MiB 13.4M/s 512KiB 12.2M/s 256KiB 4.48M/s 128KiB 3.26M/s 64KiB 9.93M/s 32KiB 16.9M/s
Not too bad in the 32K size. I wanted to format NTFS with 64k clusters, but from this results I'd say 32K would be better. Or can't I tell from those results?
The results are meaningless if you pass the incorrect erasesize, they will be different with every run.
The blocksize you pass to the --open-au test run is just the point where it stops.
Still, 32 KB is the optimimum size here, you want the smallest possible size that still gives you good performance, and all tests above have shown that 16 KB performs worse than 32 KB.
Arnd