Re: [Flashbench] USB Stick: Corsair Flash Voyager GT

29 Mar 2011

      On Monday 28 March 2011, Michael Monnerie wrote:
...
OK, here the latest results:
Very nice. Some interpretations from me:
...
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
12MiB   23.9M/s 
6MiB    7.71M/s 
3MiB    13.1M/s 
1.5MiB  13.3M/s 
768KiB  13.4M/s 
384KiB  10.2M/s 
192KiB  20.3M/s 
96KiB   24.6M/s 
48KiB   19.4M/s 
24KiB   7.53M/s
It seems that there are only two cases where the stick can reach
the maximum throughput, writing whole 12 MB blocks and writing
small numbers of 96 KB blocks at once.
...
# ./flashbench --open-au --random --open-au-nr=1 --erasesize=$[12 * 1024

1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]

sched_setscheduler: Operation not permitted
12MiB   24M/s   
6MiB    10.2M/s 
3MiB    16.1M/s 
1.5MiB  5.89M/s 
768KiB  4.49M/s 
384KiB  4.45M/s 
192KiB  4.36M/s 
96KiB   4.06M/s 
48KiB   3.44M/s 
24KiB   2.84M/s
There is a very noticeable degradation compared to linear access, but it's
not devestating. Your earlier tests have shown that it can do 4*4MB
random access patterns, so I assume that treats them differently by writing
random data to another buffer that occasionally gets written back, hence
the slowdown.
...
# ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024

1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]

sched_setscheduler: Operation not permitted
12MiB   9.93M/s 
6MiB    13.8M/s 
3MiB    10.4M/s 
1.5MiB  5.34M/s 
768KiB  3M/s    
384KiB  2.04M/s 
192KiB  1.37M/s 
96KiB   666K/s
^C
The first line includes the garbage-collection from the previous
run, so it's slower than 25 MB/s. It's an unfortunate side-effect
that each measurement depends on what the previous one was.
What you can see very clearly here is that smaller sizes get
exponentially slower, so this hits the worst-case scenario.
Obviously, the stick can not do random access to more than
one erase block. It would be possible to find the exact parameters,
but I think this is enough knowledge for now.
...
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
1.5MiB  6M/s    
768KiB  16.8M/s 
384KiB  3.17M/s 
192KiB  4.97M/s 
96KiB   19.4M/s 
48KiB   3.43M/s 
24KiB   5.99M/s
Ok, pretty random behavior, as expected: this uses 1.5 MB erase blocks,
so it only needs to do the garbage collection sometimes, but then it
has to copy the rest of the old erase block (10.5 MB) into the new one.
...
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
3MiB    5.9M/s  
1.5MiB  5.94M/s 
768KiB  5.81M/s 
384KiB  15.6M/s 
192KiB  5.62M/s 
96KiB   5.88M/s 
48KiB   4.15M/s 
24KiB   3.22M/s
Same for 3 MB, but here it seems that it copies the 9 MB most of the
time. It's very possible that it uses 4 MB of the 12 MB for random access,
so it's fast in one out of four cases here.
This would be a very smart strategy, because each 12 MB erase block will
have 4 MB that are very fast and 8 MB that are slow, based on the way
that MLC flash works.
...
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
3MiB    23.9M/s 
1.5MiB  12M/s   
768KiB  5.87M/s 
384KiB  4.87M/s 
192KiB  10.7M/s 
96KiB   5.74M/s 
48KiB   9.82M/s 
24KiB   3.71M/s
I should really implement better parsing for the command line
in flashbench. The notation $[6 * 512 * 1024] is really just the
bash way of computing the number 3145728, and it's exactly the
same as $[3 * 1024 * 1024]. I always wanted to be able to say
--erasesize=3M, similar to how dd works, but I haven't done that yet.
...
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
6MiB    4.4M/s  
3MiB    14.6M/s 
1.5MiB  14.6M/s 
768KiB  14.7M/s 
384KiB  13.8M/s 
192KiB  12.9M/s 
96KiB   14.9M/s 
48KiB   12.1M/s 
24KiB   6.41M/s 
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
6MiB    14.4M/s 
3MiB    14.4M/s 
1.5MiB  14.7M/s 
768KiB  14.6M/s 
384KiB  13.6M/s 
192KiB  12.9M/s 
96KiB   14.9M/s 
48KiB   12M/s   
24KiB   6.35M/s
These two are also the same. As you can see here, you get almost
exactly half the speed as writing the full 12 MB. This happens
because you write new data to half the erase block, and it copies
the other half from the old data.
...
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
12MiB   18.1M/s 
6MiB    23.6M/s 
3MiB    23.8M/s 
1.5MiB  24.2M/s 
768KiB  24.6M/s 
384KiB  22.2M/s 
192KiB  20.8M/s 
96KiB   24.9M/s 
48KiB   19.4M/s 
24KiB   7.52M/s
So this last result seems like the clear winner.
Excellent. Interestingly, the command line is the same as in the
very first test where it got the good performance only sometimes.
Sometimes, these sticks need a few runs before they get into
the optimum case, especially when you are alternating between
random and linear access. It may have remembered that a specific
erase block is typically used for random access and optimized
for that. Only when it's written linearly a few times it gets into
the fast case.
...
Testing with it:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
12MiB   23.9M/s 
6MiB    9.87M/s 
3MiB    13M/s   
1.5MiB  13.4M/s 
768KiB  13.5M/s 
384KiB  10.4M/s 
192KiB  20.5M/s 
96KiB   25.1M/s 
48KiB   20M/s   
24KiB   7.73M/s
relatively slow again, just like the first measurement.
...
# ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
12MiB   12.9M/s 
6MiB    23.9M/s 
3MiB    24M/s   
1.5MiB  24.4M/s 
768KiB  25.1M/s 
384KiB  21.8M/s 
192KiB  20.3M/s 
96KiB   24.7M/s 
48KiB   19.2M/s 
24KiB   7.71M/s 
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
12MiB   23.9M/s 
6MiB    23.9M/s 
3MiB    24.1M/s 
1.5MiB  24.5M/s 
768KiB  24.6M/s 
384KiB  21.8M/s 
192KiB  20.4M/s 
96KiB   25.8M/s 
48KiB   19.7M/s 
24KiB   7.77M/s
Ok.
...
# ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
12MiB   24.1M/s 
6MiB    18M/s   
3MiB    9.08M/s 
1.5MiB  4.55M/s 
768KiB  2.28M/s 
^C
Seems au-nr=3 is quite good.
Yes, the drop is extremely obvious here, as expected the performance
drops in half with every row.
...
I also retestet for erasesize 512:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
6MiB    15.9M/s 
3MiB    11.1M/s 
1.5MiB  14.6M/s 
768KiB  14.7M/s 
384KiB  13.8M/s 
192KiB  13.3M/s 
96KiB   15.1M/s 
48KiB   13.1M/s 
24KiB   6.54M/s 
# ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
6MiB    15.2M/s 
3MiB    14.4M/s 
1.5MiB  14.5M/s 
768KiB  14.5M/s 
384KiB  13.7M/s 
192KiB  13.2M/s 
96KiB   14.7M/s 
48KiB   12.4M/s 
24KiB   6.31M/s 
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
6MiB    16.6M/s 
3MiB    14.4M/s 
1.5MiB  14.6M/s 
768KiB  14.7M/s 
384KiB  13.7M/s 
192KiB  13.2M/s 
96KiB   15.3M/s 
48KiB   12.6M/s 
24KiB   6.22M/s 
# ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 512 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted
6MiB    14.4M/s 
3MiB    8.05M/s 
1.5MiB  4.28M/s 
768KiB  2.2M/s  
384KiB  1.11M/s 
^C
This is not so good.
As I explained above, this just means that it assumes a 6 MB erase
block, so it basically halves the performance. It's still valuable
information that the drop-off happens at 4 erase blocks as well,
as expected.
...
...
I'm sorry that this is all so complicated, but I don't have a stick
with this behaviour myself.
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] 
/dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
So that is the best result. Funny, we're at au-nr=3 now. Those stick 
really likes a threesome ;-)
And what does all this mean now? Is there any FAT/NTFS/other FS layout I 
could optimize it with? 
Does the 12MB partition start still hold? I guess so.
Yes, definitely. If there are any file system characteristics that
depend on multi-megabyte sizes, they should be aligned with 12 MB.
...
I guess with XFS I would use mount option
logbufs=8,logbsize=128k,largeio,delaylog
and make the filesystem with mkfs.xfs -s size=4k -b size=4k -d 
su=96k,sw=1
In a lot of options I can only use power of 2 increments, so I benched 
with this:
With XFS, I would recommend using 32 KB blocks (-b size=32k) and sectors
(-s size=32k), possibly 16 kb sectors if space is valuable, but not below.
...
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[16 * 1024 * 1024] 
/dev/sdd --blocksize=$[32 * 1024] --offset=$[32 * 1024 * 1024]
sched_setscheduler: Operation not permitted
16MiB   24M/s   
8MiB    23.9M/s 
4MiB    10.2M/s 
2MiB    5.62M/s 
1MiB    13.4M/s 
512KiB  12.2M/s 
256KiB  4.48M/s 
128KiB  3.26M/s 
64KiB   9.93M/s 
32KiB   16.9M/s
Not too bad in the 32K size. I wanted to format NTFS with 64k clusters, 
but from this results I'd say 32K would be better. Or can't I tell from 
those results?
The results are meaningless if you pass the incorrect erasesize, they
will be different with every run.
The blocksize you pass to the --open-au test run is just the point where
it stops.
Still, 32 KB is the optimimum size here, you want the smallest possible
size that still gives you good performance, and all tests above have
shown that 16 KB performs worse than 32 KB.
Arnd

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Flashbench] USB Stick: Corsair Flash Voyager GT