On Thursday 24 March 2011, Michael Monnerie wrote:
We could speak german, I guess?
Yes, but I'd prefer to keep the discussion on the mailing list if interesting results come up.
On Donnerstag, 24. März 2011 Arnd Bergmann wrote:
Are these results reproducible? It's not at all clear to me that the erase size is really 4 MB, it could also be 1 MB for instance, especially if your stick is from before 2010. You can rerun the command with --count=100 or more, or with larger block sizes to get a better feeling.
# ./flashbench -a /dev/sde --blocksize=4096 --count=100 sched_setscheduler: Operation not permitted align 536870912 pre 703µs on 798µs post 666µs diff 114µs align 268435456 pre 700µs on 753µs post 681µs diff 62.8µs align 134217728 pre 786µs on 783µs post 645µs diff 67.4µs align 67108864 pre 723µs on 772µs post 675µs diff 73.3µs align 33554432 pre 701µs on 762µs post 674µs diff 74.4µs align 16777216 pre 701µs on 741µs post 685µs diff 48.3µs align 8388608 pre 695µs on 749µs post 668µs diff 66.8µs align 4194304 pre 706µs on 791µs post 671µs diff 102µs align 2097152 pre 674µs on 707µs post 692µs diff 23.9µs align 1048576 pre 683µs on 726µs post 701µs diff 34.3µs align 524288 pre 671µs on 726µs post 717µs diff 32.3µs align 262144 pre 698µs on 713µs post 687µs diff 20.7µs align 131072 pre 682µs on 740µs post 704µs diff 46.9µs align 65536 pre 687µs on 727µs post 697µs diff 35.1µs align 32768 pre 712µs on 722µs post 692µs diff 19.9µs align 16384 pre 667µs on 699µs post 674µs diff 27.9µs align 8192 pre 702µs on 770µs post 686µs diff 75.8µs
Ok, this is much clearer: I'm pretty sure that it's either 4 MB or 8 MB, based on this result. Note how all diff values below 4 MB are smaller than all diff values above 4 MB. Something strange is going at at 4 MB, so it's not clear whether it belongs to the upper or lower half.
# ./flashbench -a /dev/sde --blocksize=8192 --count=50 sched_setscheduler: Operation not permitted align 536870912 pre 879µs on 899µs post 858µs diff 30.9µs align 268435456 pre 847µs on 906µs post 842µs diff 61.1µs align 134217728 pre 996µs on 968µs post 836µs diff 52µs align 67108864 pre 811µs on 839µs post 757µs diff 55µs align 33554432 pre 871µs on 908µs post 847µs diff 48.7µs align 16777216 pre 854µs on 914µs post 818µs diff 78µs align 8388608 pre 851µs on 908µs post 850µs diff 58.2µs align 4194304 pre 892µs on 880µs post 908µs diff -20511n align 2097152 pre 828µs on 849µs post 885µs diff -7708ns align 1048576 pre 874µs on 886µs post 862µs diff 17.9µs align 524288 pre 852µs on 869µs post 915µs diff -15025n align 262144 pre 844µs on 895µs post 940µs diff 2.73µs align 131072 pre 848µs on 884µs post 907µs diff 6.1µs align 65536 pre 837µs on 857µs post 840µs diff 18µs align 32768 pre 831µs on 864µs post 861µs diff 17.7µs align 16384 pre 855µs on 841µs post 826µs diff 201ns
Could it be some linux cache is in the way?
No, all I/O is done with O_DIRECT, which completely bypasses the page cache. In theory, there could be a cache on the stick, but that's rarely the case. The USB protocol adds some jitter here, and for some reason you cannot use real-time scheduling, which makes the results less accurate.
Another helpful indication would be the output of 'fdisk -lu /dev/sde': It will show the start of the partition and the size of the drive, both should be a multiple of the erase block size.
# fdisk -lu /dev/sde
Disk /dev/sde: 16.2 GB, 16240345088 bytes 255 heads, 63 sectors/track, 1974 cylinders, total 31719424 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x925df597
Ok, 16240345088 bytes is 121 times 128 MiB, so it's certainly not using 4128 KiB blocks or some other strange unit.
Device Boot Start End Blocks Id System /dev/sde1 63 31712309 15856123+ 7 HPFS/NTFS/exFAT
I didn't use sde1 but sde for the tests, so it shouldn't have a meaning. I'll change the partition to start at 2048 or whatever I find the erasesize to be.
Be careful here. The original FAT layout is typically made specifically for this stick, so it's probably a good idea to make a backup of the first few blocks.
Finally, a third way is to look at a gnuplot chart on the output of
flashbench -s -o output.plot /dev/sde --scatter-order=10 --scatter-span=2 --blocksize=8192 gnuplot -p -e 'plot "output.plot"'
On many drives, the boundaries between erase blocks show up as spikes in the chart.
Phew, two lines, more or less, and spikes everwhere (see attached).
Right, nothing to see here yet. It only shows the first 8 MB, so if the spike is every 8 MB, it won't show up here. Use --scatter-order=12 to show more. Also, the jitter from the USB protocol may hide some details, which might get better with a larger --count= value, but that would make the test run much longer.
It's probably good enough to assume that the size is actually 8 MB or 4 MB, and keep going from there.
Also, please post the USB ID and name output from 'lsusb' for reference.
# lsusb -v Bus 001 Device 005: ID 1b1c:1a90
Ok.
Now with 16 I got it slower, still no visible border:
# ./flashbench -O --erasesize=$[4 * 1024 * 1024] --blocksize=$[64 * 1024] /dev/sde --open-au-nr=4 sched_setscheduler: Operation not permitted 4MiB 7.3M/s 2MiB 11.2M/s 1MiB 15.7M/s 512KiB 19.2M/s 256KiB 14.5M/s 128KiB 12.8M/s 64KiB 18.8M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] --blocksize=$[64 * 1024] /dev/sde --open-au-nr=16 sched_setscheduler: Operation not permitted 4MiB 9.77M/s 2MiB 3.41M/s 1MiB 7.38M/s 512KiB 5.69M/s ^C (took too long)
If it takes too long, that's a good indication that something interesting is happening ;-)
Seems that stick wants to hide it's internals?
Not completely unusual, but somewhat harder than most.
I've committed some updates now that might help make the --open-au results a bit clearer, and faster.
What I would recommend now is to update to the latest git version (just uploaded), and then try increasing numbers of --open-au-nr= with 4 MB erasesize, until you hit the cutoff. Try the same with --random, for the last fast one and the first slow one:
./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=5 ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=6 ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=7 ... ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=N
=> N is the number of 4 MB blocks that can not be handled
./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=(N-1) --random ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=N --random
=> Verify that the (N-1) --random case is still fast
./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au-nr=((N-1)/2) --random ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au-nr=(N-1) --random ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au-nr=(N-1) --random
Arnd
On Donnerstag, 24. März 2011 Arnd Bergmann wrote:
On Thursday 24 March 2011, Michael Monnerie wrote:
We could speak german, I guess?
Yes, but I'd prefer to keep the discussion on the mailing list if interesting results come up.
OK, so I guess I'll anwer to the list too.
and for some reason you cannot use real-time scheduling, which makes the results less accurate.
This is openSUSE 11.4 with kernel 2.6.37.1-1.2-desktop I could retry with a standard 2.6.37.4 if that would help.
Be careful here. The original FAT layout is typically made specifically for this stick, so it's probably a good idea to make a backup of the first few blocks.
Too late. It's NTFS since a long time, I converted it as FAT32 only stores files of 2GB, which is too small for my files. I use this stick to copy DVDs on it, then stick it in the TV and look from there. No DVD player available.
Right, nothing to see here yet. It only shows the first 8 MB, so if the spike is every 8 MB, it won't show up here. Use --scatter-order=12 to show more. Also, the jitter from the USB protocol may hide some details, which might get better with a larger --count= value, but that would make the test run much longer.
It's probably good enough to assume that the size is actually 8 MB or 4 MB, and keep going from there.
OK
I've committed some updates now that might help make the --open-au results a bit clearer, and faster.
What I would recommend now is to update to the latest git version (just uploaded), and then try increasing numbers of --open-au-nr= with 4 MB erasesize, until you hit the cutoff. Try the same with --random, for the last fast one and the first slow one:
OK I did, but with your new version I get seriously different results than before: # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=5 sched_setscheduler: Operation not permitted 4MiB 7.66M/s 2MiB 3.42M/s 1MiB 8.15M/s 512KiB 7.99M/s 256KiB 381K/s 128KiB 189K/s 64KiB 7.6M/s 32KiB 6.75M/s 16KiB 46.8K/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=4 sched_setscheduler: Operation not permitted 4MiB 7.6M/s 2MiB 6.75M/s 1MiB 6.32M/s 512KiB 8.42M/s 256KiB 5.97M/s 128KiB 4.17M/s 64KiB 6.07M/s 32KiB 7.13M/s 16KiB 4.67M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=3 sched_setscheduler: Operation not permitted 4MiB 8.94M/s 2MiB 6.95M/s 1MiB 7.02M/s 512KiB 6.91M/s 256KiB 6.82M/s 128KiB 5.37M/s 64KiB 6.96M/s 32KiB 5.92M/s 16KiB 4.4M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=2 sched_setscheduler: Operation not permitted 4MiB 10.5M/s 2MiB 6.91M/s 1MiB 7.03M/s 512KiB 6.92M/s 256KiB 6.75M/s 128KiB 5.02M/s 64KiB 6.91M/s 32KiB 6.01M/s 16KiB 4.68M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=1 sched_setscheduler: Operation not permitted 4MiB 23.5M/s 2MiB 6.73M/s 1MiB 7.05M/s 512KiB 6.93M/s 256KiB 6.73M/s 128KiB 4.19M/s 64KiB 6.71M/s 32KiB 6.08M/s 16KiB 4.6M/s
=> N is the number of 4 MB blocks that can not be handled
So I guess 5 would be my number now?
./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=(N-1) --random ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au -nr=N --random => Verify that the (N-1) --random case is still fast
OK, looks good: # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=4 --random sched_setscheduler: Operation not permitted 4MiB 8.28M/s 2MiB 6.98M/s 1MiB 7.04M/s 512KiB 6.99M/s 256KiB 6.82M/s 128KiB 5.59M/s 64KiB 7.01M/s 32KiB 4.42M/s 16KiB 4.69M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=5 --random sched_setscheduler: Operation not permitted 4MiB 7.17M/s 2MiB 5.17M/s 1MiB 6.77M/s 512KiB 3.66M/s 256KiB 746K/s 128KiB 373K/s 64KiB 647K/s 32KiB 328K/s 16KiB 93.3K/s
./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=((N-1)/2) --random ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=(N-1) --random ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=(N-1) --random
I guess the last should have been open-au-nr=(N)? Anyway, I didn't run the last test, as already the second shows drastic reduction:
# ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=2 --random sched_setscheduler: Operation not permitted 8MiB 14.4M/s 4MiB 7.56M/s 2MiB 6.7M/s 1MiB 6.97M/s 512KiB 6.86M/s 256KiB 6.67M/s 128KiB 4.53M/s 64KiB 6.61M/s 32KiB 5.8M/s 16KiB 4.62M/s # ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=4 --random sched_setscheduler: Operation not permitted 8MiB 11.2M/s 4MiB 9.27M/s 2MiB 3.42M/s 1MiB 5.18M/s 512KiB 2.76M/s 256KiB 747K/s 128KiB 373K/s 64KiB 344K/s (cancelled here, confirmed it's slow)
So would it be right to guess it's an 8M erasesize with 2 au? If so, what should an ideal partition table look like? Is NTFS tuneable for this stick? I've used a block size of 64k to keep performance up. Too bad my TV doesn't understand XFS or similar FS.
On Friday 25 March 2011 02:07:46 Michael Monnerie wrote:
On Donnerstag, 24. März 2011 Arnd Bergmann wrote:
On Thursday 24 March 2011, Michael Monnerie wrote: Be careful here. The original FAT layout is typically made specifically for this stick, so it's probably a good idea to make a backup of the first few blocks.
Too late. It's NTFS since a long time, I converted it as FAT32 only stores files of 2GB, which is too small for my files. I use this stick to copy DVDs on it, then stick it in the TV and look from there. No DVD player available.
Ok, I see. In that case, I'd recommend creating a new partition table with the first partition aligned to the erase block size.
I've committed some updates now that might help make the --open-au results a bit clearer, and faster.
What I would recommend now is to update to the latest git version (just uploaded), and then try increasing numbers of --open-au-nr= with 4 MB erasesize, until you hit the cutoff. Try the same with --random, for the last fast one and the first slow one:
OK I did, but with your new version I get seriously different results than before: # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=5 sched_setscheduler: Operation not permitted 4MiB 7.66M/s 2MiB 3.42M/s 1MiB 8.15M/s 512KiB 7.99M/s 256KiB 381K/s 128KiB 189K/s 64KiB 7.6M/s 32KiB 6.75M/s 16KiB 46.8K/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=4 sched_setscheduler: Operation not permitted 4MiB 7.6M/s 2MiB 6.75M/s 1MiB 6.32M/s 512KiB 8.42M/s 256KiB 5.97M/s 128KiB 4.17M/s 64KiB 6.07M/s 32KiB 7.13M/s 16KiB 4.67M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=3 sched_setscheduler: Operation not permitted 4MiB 8.94M/s 2MiB 6.95M/s 1MiB 7.02M/s 512KiB 6.91M/s 256KiB 6.82M/s 128KiB 5.37M/s 64KiB 6.96M/s 32KiB 5.92M/s 16KiB 4.4M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=2 sched_setscheduler: Operation not permitted 4MiB 10.5M/s 2MiB 6.91M/s 1MiB 7.03M/s 512KiB 6.92M/s 256KiB 6.75M/s 128KiB 5.02M/s 64KiB 6.91M/s 32KiB 6.01M/s 16KiB 4.68M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=1 sched_setscheduler: Operation not permitted 4MiB 23.5M/s 2MiB 6.73M/s 1MiB 7.05M/s 512KiB 6.93M/s 256KiB 6.73M/s 128KiB 4.19M/s 64KiB 6.71M/s 32KiB 6.08M/s 16KiB 4.6M/s
=> N is the number of 4 MB blocks that can not be handled
So I guess 5 would be my number now?
Exactly.
./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=(N-1) --random ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au -nr=N --random => Verify that the (N-1) --random case is still fast
OK, looks good: # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=4 --random sched_setscheduler: Operation not permitted 4MiB 8.28M/s 2MiB 6.98M/s 1MiB 7.04M/s 512KiB 6.99M/s 256KiB 6.82M/s 128KiB 5.59M/s 64KiB 7.01M/s 32KiB 4.42M/s 16KiB 4.69M/s # ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au- nr=5 --random sched_setscheduler: Operation not permitted 4MiB 7.17M/s 2MiB 5.17M/s 1MiB 6.77M/s 512KiB 3.66M/s 256KiB 746K/s 128KiB 373K/s 64KiB 647K/s 32KiB 328K/s 16KiB 93.3K/s
Ok, so random access is just as fast as linear access. Good!
./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=((N-1)/2) --random ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=(N-1) --random ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=(N-1) --random
I guess the last should have been open-au-nr=(N)? Anyway, I didn't run the last test, as already the second shows drastic reduction:
# ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=2 --random sched_setscheduler: Operation not permitted 8MiB 14.4M/s 4MiB 7.56M/s 2MiB 6.7M/s 1MiB 6.97M/s 512KiB 6.86M/s 256KiB 6.67M/s 128KiB 4.53M/s 64KiB 6.61M/s 32KiB 5.8M/s 16KiB 4.62M/s # ./flashbench -O --erasesize=$[8 * 1024 * 1024] /dev/sde --open-au- nr=4 --random sched_setscheduler: Operation not permitted 8MiB 11.2M/s 4MiB 9.27M/s 2MiB 3.42M/s 1MiB 5.18M/s 512KiB 2.76M/s 256KiB 747K/s 128KiB 373K/s 64KiB 344K/s (cancelled here, confirmed it's slow)
So would it be right to guess it's an 8M erasesize with 2 au?
Actually, it confirms that the erase size is 4 MB. We've already established that it can have 4 but not 5 erase blocks open.
The last test showed that when flashbench tries to write to 2*8MB, it's fast, which was expected because that can be done using four 4MB erase blocks. If an erase block was 8 MB, you would also be able to write four of them efficiently.
If so, what should an ideal partition table look like?
# create the partititon table echo 8192,,7 | sudo sfdisk -uS -L -f /dev/sde
# show the new table sudo fdisk /dev/sdd -l -u
====>
Device Boot Start End Blocks Id System /dev/sdd1 8192 XXXXXX XXXXXX 7 NTFS/HPFS Warning: Partition 1 does not end on cylinder boundary.
<===
Before you do that, could I ask you to run
flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512
and post the results here?
That will add the two missing pieces of information for the survey, the page size and the location of the FAT. I've already entered the other data into the wiki.
Is NTFS tuneable for this stick? I've used a block size of 64k to keep performance up. Too bad my TV doesn't understand XFS or similar FS.
I don't know anything about NTFS, sorry.
Arnd
On Freitag, 25. März 2011 Arnd Bergmann wrote:
If so, what should an ideal partition table look like?
# create the partititon table echo 8192,,7 | sudo sfdisk -uS -L -f /dev/sde
Thanks, did that.
Before you do that, could I ask you to run
flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512
and post the results here?
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512 sched_setscheduler: Operation not permitted 4MiB 7.94M/s 7.97M/s 7.95M/s 7.98M/s 7.85M/s 7.95M/s 2MiB 5.57M/s 5.55M/s 5.56M/s 5.53M/s 5.57M/s 5.55M/s 1MiB 24.3M/s 24M/s 7M/s 6.93M/s 6.96M/s 6.88M/s 512KiB 23.6M/s 23.9M/s 6.09M/s 6.06M/s 6.11M/s 6.07M/s 256KiB 4.41M/s 4.43M/s 4.44M/s 4.42M/s 4.45M/s 4.39M/s 128KiB 4.17M/s 4.15M/s 4.16M/s 4.16M/s 3.09M/s 3.09M/s 64KiB 21.5M/s 4.59M/s 4.52M/s 22M/s 4.57M/s 4.67M/s 32KiB 16.8M/s 16M/s 16.4M/s 16.1M/s 16.2M/s 16.4M/s 16KiB 4.62M/s 4.49M/s 4.45M/s 4.41M/s 4.49M/s 4.43M/s 8KiB 1.93M/s 1.94M/s 1.96M/s 2.01M/s 1.95M/s 2.01M/s 4KiB 970K/s 971K/s 1.01M/s 973K/s 990K/s 1.02M/s 2KiB 467K/s 481K/s 468K/s 467K/s 468K/s 472K/s 1KiB 235K/s 234K/s 236K/s 232K/s 230K/s 225K/s 512B 120K/s 118K/s 118K/s 118K/s 117K/s 120K/s
I was wondering why the 1M,512K,64K was faster, and rerun the test twice for the bigger areas:
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512 sched_setscheduler: Operation not permitted 4MiB 23.1M/s 23M/s 22.8M/s 22.9M/s 7.85M/s 7.91M/s 2MiB 6.67M/s 5.58M/s 5.58M/s 5.56M/s 5.57M/s 5.55M/s 1MiB 23.3M/s 24M/s 6.98M/s 7.02M/s 6.95M/s 6.89M/s 512KiB 23.6M/s 23.5M/s 6.11M/s 6.13M/s 6.07M/s 6.08M/s 256KiB 4.46M/s 4.44M/s 4.41M/s 4.41M/s 4.47M/s 4.42M/s 128KiB 4.15M/s 4.1M/s 4.12M/s 4.14M/s 3.09M/s 3.09M/s 64KiB 21.4M/s 21.9M/s 4.49M/s 4.58M/s 4.57M/s 4.52M/s 32KiB 16.2M/s 17M/s 16.4M/s 17.1M/s 17.1M/s 17M/s ^C # ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512 sched_setscheduler: Operation not permitted 4MiB 7.21M/s 22.8M/s 23.4M/s 22.5M/s 7.88M/s 7.88M/s 2MiB 7.24M/s 6.72M/s 5.57M/s 5.58M/s 6.68M/s 6.68M/s 1MiB 24.2M/s 24.7M/s 6.91M/s 6.88M/s 6.95M/s 7M/s 512KiB 23.6M/s 23.8M/s 6.14M/s 6.04M/s 6.9M/s 6.94M/s 256KiB 4.48M/s 4.45M/s 4.45M/s 4.45M/s 6.75M/s 6.61M/s 128KiB 4.15M/s 4.15M/s 3.09M/s 3.08M/s 4.13M/s 4.14M/s 64KiB 22.7M/s 22.3M/s 4.54M/s 4.53M/s 6.72M/s 6.59M/s 32KiB 15.5M/s 15.7M/s 16.1M/s 16.5M/s 5.93M/s 5.71M/s ^C
Seems there's a big variation in the 4M, 1M, 512K and 64K tests. Is that normal?
That will add the two missing pieces of information for the survey, the page size and the location of the FAT. I've already entered the other data into the wiki.
And what would I read from those results? Is the 1M,512K and 64K result so much faster because the stick is FAT optimized?
On Sunday 27 March 2011 11:15:38 Michael Monnerie wrote:
On Freitag, 25. März 2011 Arnd Bergmann wrote:
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512 sched_setscheduler: Operation not permitted 4MiB 7.94M/s 7.97M/s 7.95M/s 7.98M/s 7.85M/s 7.95M/s 2MiB 5.57M/s 5.55M/s 5.56M/s 5.53M/s 5.57M/s 5.55M/s 1MiB 24.3M/s 24M/s 7M/s 6.93M/s 6.96M/s 6.88M/s 512KiB 23.6M/s 23.9M/s 6.09M/s 6.06M/s 6.11M/s 6.07M/s 256KiB 4.41M/s 4.43M/s 4.44M/s 4.42M/s 4.45M/s 4.39M/s 128KiB 4.17M/s 4.15M/s 4.16M/s 4.16M/s 3.09M/s 3.09M/s 64KiB 21.5M/s 4.59M/s 4.52M/s 22M/s 4.57M/s 4.67M/s 32KiB 16.8M/s 16M/s 16.4M/s 16.1M/s 16.2M/s 16.4M/s 16KiB 4.62M/s 4.49M/s 4.45M/s 4.41M/s 4.49M/s 4.43M/s 8KiB 1.93M/s 1.94M/s 1.96M/s 2.01M/s 1.95M/s 2.01M/s 4KiB 970K/s 971K/s 1.01M/s 973K/s 990K/s 1.02M/s 2KiB 467K/s 481K/s 468K/s 467K/s 468K/s 472K/s 1KiB 235K/s 234K/s 236K/s 232K/s 230K/s 225K/s 512B 120K/s 118K/s 118K/s 118K/s 117K/s 120K/s
Ok. The important part here is the effective page size, which is almost certainly 32 KB. Writing any smaller blocks is always much worse here. Writing larger blocks has interesting side-effects, so they may be more to it than I thought at first.
I was wondering why the 1M,512K,64K was faster, and rerun the test twice for the bigger areas:
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512 sched_setscheduler: Operation not permitted 4MiB 23.1M/s 23M/s 22.8M/s 22.9M/s 7.85M/s 7.91M/s 2MiB 6.67M/s 5.58M/s 5.58M/s 5.56M/s 5.57M/s 5.55M/s 1MiB 23.3M/s 24M/s 6.98M/s 7.02M/s 6.95M/s 6.89M/s 512KiB 23.6M/s 23.5M/s 6.11M/s 6.13M/s 6.07M/s 6.08M/s 256KiB 4.46M/s 4.44M/s 4.41M/s 4.41M/s 4.47M/s 4.42M/s 128KiB 4.15M/s 4.1M/s 4.12M/s 4.14M/s 3.09M/s 3.09M/s 64KiB 21.4M/s 21.9M/s 4.49M/s 4.58M/s 4.57M/s 4.52M/s 32KiB 16.2M/s 17M/s 16.4M/s 17.1M/s 17.1M/s 17M/s ^C # ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=512 sched_setscheduler: Operation not permitted 4MiB 7.21M/s 22.8M/s 23.4M/s 22.5M/s 7.88M/s 7.88M/s 2MiB 7.24M/s 6.72M/s 5.57M/s 5.58M/s 6.68M/s 6.68M/s 1MiB 24.2M/s 24.7M/s 6.91M/s 6.88M/s 6.95M/s 7M/s 512KiB 23.6M/s 23.8M/s 6.14M/s 6.04M/s 6.9M/s 6.94M/s 256KiB 4.48M/s 4.45M/s 4.45M/s 4.45M/s 6.75M/s 6.61M/s 128KiB 4.15M/s 4.15M/s 3.09M/s 3.08M/s 4.13M/s 4.14M/s 64KiB 22.7M/s 22.3M/s 4.54M/s 4.53M/s 6.72M/s 6.59M/s 32KiB 15.5M/s 15.7M/s 16.1M/s 16.5M/s 5.93M/s 5.71M/s ^C
Seems there's a big variation in the 4M, 1M, 512K and 64K tests. Is that normal?
No, it's not normal, but it can be explained by cache effects:
The data transfer rate on the USB interface into the cache is probably 23 MB/s, but the stick can only write about 8 MB/s continuously using >32 KB blocks, so as soon as the cache is full, any further block gets much slower.
You can verify if this is the case by writing a lot of data to the stick:
# Write 256 MB using 4 MB blocks dd if=/dev/zero of=/dev/sde bs=4M oflag=direct count=64
# Write 256 MB using 32 Kb blocks dd if=/dev/zero of=/dev/sde bs=32K oflag=direct count=8192
My guess is that in the first case, you get to around 8 MB/s, while the second one should be a bit better.
This kind of caching obviously makes it harder to get good data out of the stick, but on the other hand greatly improves the performance in real-world scenarios. Most other sticks I've seen don't have it.
Another possible explanation is that the stick actually uses 12 MB erase blocks, not 4 MB (flashbench -a only checks power-of-two values). You can test this by using a multiple of three for both erase and block size in the tests:
./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024]
If this is the case, the stick will frequently have to do garbage collection, which makes it slower than it could be.
That will add the two missing pieces of information for the survey, the page size and the location of the FAT. I've already entered the other data into the wiki.
And what would I read from those results? Is the 1M,512K and 64K result so much faster because the stick is FAT optimized?
No. However, the 32 KiB effect is definitely for the FAT optimization, because that is the largest block size supported by most FAT32 implementations.
Arnd
On Sonntag, 27. März 2011 Arnd Bergmann wrote:
# Write 256 MB using 4 MB blocks dd if=/dev/zero of=/dev/sde bs=4M oflag=direct count=64
# Write 256 MB using 32 Kb blocks dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192
My guess is that in the first case, you get to around 8 MB/s, while the second one should be a bit better.
"A bit" is underestimation: # dd if=/dev/zero of=/dev/sde1 bs=4M oflag=direct count=64 64+0 records in 64+0 records out 268435456 bytes (268 MB) copied, 33.054 s, 8.1 MB/s # dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192 8192+0 records in 8192+0 records out 268435456 bytes (268 MB) copied, 17.7888 s, 15.1 MB/s
./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024]
If this is the case, the stick will frequently have to do garbage collection, which makes it slower than it could be.
Why would it be so? You mean because nobody actually writes in 12MB chunks?
It seems like writing in at least 768KiB chunks gives best performance, which are 24x 32KB. Maybe there are 3x8bit chips on it?
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24M/s 23.6M/s 23.5M/s 23.7M/s 23.5M/s 23.6M/s 6MiB 23.4M/s 23.4M/s 23.5M/s 23.6M/s 23.2M/s 23.2M/s 3MiB 23.4M/s 23.4M/s 23.3M/s 23.4M/s 23.4M/s 23.4M/s 1.5MiB 24M/s 23.8M/s 23.6M/s 23.8M/s 24.1M/s 23.8M/s 768KiB 24.1M/s 24.2M/s 24.6M/s 24.7M/s 24M/s 24M/s 384KiB 21.6M/s 21.7M/s 21.7M/s 21.6M/s 22.1M/s 22.3M/s 192KiB 20.5M/s 20.4M/s 20.2M/s 20.6M/s 20.9M/s 20.9M/s 96KiB 25.8M/s 26.4M/s 26.2M/s 25.9M/s 25.7M/s 26.3M/s 48KiB 20.5M/s 21M/s 19.9M/s 19.8M/s 20M/s 20.6M/s 24KiB 7.85M/s 7.68M/s 8.06M/s 7.57M/s 7.62M/s 7.56M/s 12KiB 2.04M/s 2.16M/s 2.15M/s 2.17M/s 2.16M/s 2.15M/s 6KiB 1.19M/s 1.17M/s 1.14M/s 1.13M/s 1.16M/s 1.17M/s 3KiB 643K/s 642K/s 625K/s 634K/s 653K/s 650K/s
So I re-tested with dd: # dd if=/dev/zero of=/dev/sde1 bs=768K oflag=direct count=341 320+0 records in 320+0 records out 251658240 bytes (252 MB) copied, 11.4378 s, 22.0 MB/s # dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192 8192+0 records in 8192+0 records out 268435456 bytes (268 MB) copied, 16.545 s, 16.2 MB/s # dd if=/dev/zero of=/dev/sde1 bs=768K oflag=direct count=346 346+0 records in 346+0 records out 272105472 bytes (272 MB) copied, 11.1504 s, 24.4 MB/s
Yes, writing 768KiB chunks seems the best idea.
On Sunday 27 March 2011 20:54:58 Michael Monnerie wrote:
On Sonntag, 27. März 2011 Arnd Bergmann wrote:
# Write 256 MB using 4 MB blocks dd if=/dev/zero of=/dev/sde bs=4M oflag=direct count=64
# Write 256 MB using 32 Kb blocks dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192
My guess is that in the first case, you get to around 8 MB/s, while the second one should be a bit better.
"A bit" is underestimation: # dd if=/dev/zero of=/dev/sde1 bs=4M oflag=direct count=64 64+0 records in 64+0 records out 268435456 bytes (268 MB) copied, 33.054 s, 8.1 MB/s # dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192 8192+0 records in 8192+0 records out 268435456 bytes (268 MB) copied, 17.7888 s, 15.1 MB/s
Ok, I see. The big difference is an indication that the theory about the cache was really wrong, and it's actually using the non-power-of-two segments.
./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024]
If this is the case, the stick will frequently have to do garbage collection, which makes it slower than it could be.
Why would it be so? You mean because nobody actually writes in 12MB chunks?
When flashbench assumes that all erase blocks are power-of-two aligned while they are really not, you sometimes get into the case where it writes into two erase blocks at once. This requires the stick to garbage-collect both erase blocks as soon the next time that the user writes to another one.
It seems like writing in at least 768KiB chunks gives best performance, which are 24x 32KB. Maybe there are 3x8bit chips on it?
Actually, the most likely explanation is that there is just one or two chips, but those use 3-bit MLC NAND (also called TLC). In this flash memory, each transistor can store three bits. Since there is a power-of-two number of usable transistors in each erase block in most flash chips, you get 3*2^n bytes.
See http://www.centon.com/flash-products/chiptype for an explanation.
# ./flashbench --findfat --fat-nr=6 /dev/sde --blocksize=$[3072] --erasesize=$[12 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24M/s 23.6M/s 23.5M/s 23.7M/s 23.5M/s 23.6M/s 6MiB 23.4M/s 23.4M/s 23.5M/s 23.6M/s 23.2M/s 23.2M/s 3MiB 23.4M/s 23.4M/s 23.3M/s 23.4M/s 23.4M/s 23.4M/s 1.5MiB 24M/s 23.8M/s 23.6M/s 23.8M/s 24.1M/s 23.8M/s 768KiB 24.1M/s 24.2M/s 24.6M/s 24.7M/s 24M/s 24M/s 384KiB 21.6M/s 21.7M/s 21.7M/s 21.6M/s 22.1M/s 22.3M/s 192KiB 20.5M/s 20.4M/s 20.2M/s 20.6M/s 20.9M/s 20.9M/s 96KiB 25.8M/s 26.4M/s 26.2M/s 25.9M/s 25.7M/s 26.3M/s 48KiB 20.5M/s 21M/s 19.9M/s 19.8M/s 20M/s 20.6M/s 24KiB 7.85M/s 7.68M/s 8.06M/s 7.57M/s 7.62M/s 7.56M/s 12KiB 2.04M/s 2.16M/s 2.15M/s 2.17M/s 2.16M/s 2.15M/s 6KiB 1.19M/s 1.17M/s 1.14M/s 1.13M/s 1.16M/s 1.17M/s 3KiB 643K/s 642K/s 625K/s 634K/s 653K/s 650K/s
Ok, very nice. I consider this a proof that what I explained is actually what's going on here. So we know that the underlying erase block size is really not 4 MB but rather 12 MB or a smaller value of 3*2^n.
Note that the number for 96 KB is actually higher than the one for 768 KB that you pointed out. We already know that 32 KB is much faster than 16 KB here, and 96 is the lowest multiple of 32 in 3*2^n.
I've seen a similar behaviour on another drive, but did not think it was common enough to handle it in flashbench yet. I am now updating the tool to work with this case.
So I re-tested with dd: # dd if=/dev/zero of=/dev/sde1 bs=768K oflag=direct count=341 320+0 records in 320+0 records out 251658240 bytes (252 MB) copied, 11.4378 s, 22.0 MB/s # dd if=/dev/zero of=/dev/sde1 bs=32K oflag=direct count=8192 8192+0 records in 8192+0 records out 268435456 bytes (268 MB) copied, 16.545 s, 16.2 MB/s # dd if=/dev/zero of=/dev/sde1 bs=768K oflag=direct count=346 346+0 records in 346+0 records out 272105472 bytes (272 MB) copied, 11.1504 s, 24.4 MB/s
Yes, writing 768KiB chunks seems the best idea.
Ok. 96kb shouldn't be much slower than this, but there is a little additional overhead for the extra write accesses when you try that.
Given the new data, I'd like to ask you to do one more test run of the --open-au test, if you still have the energy.
Please run
flashbench --open-au --random --open-au-nr=5 --erasesize=$[3 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
flashbench --open-au --random --open-au-nr=5 --erasesize=$[6 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
flashbench --open-au --random --open-au-nr=5 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
flashbench --open-au --random --open-au-nr=6 --erasesize=$[3 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
The goal is to find the minimum value for --erasesize= and the maximum value for --open-au-nr= that can give the full bandwidth of 20+MB/s, which I'm guessing will be 12 MB and 5, but it would be helpful to know for sure.
You will need to pull the latest version of flashbench, in which I have added the --offset argument parsing.
In any case, it's a good idea to align the partition to the start of the erase block (6 or 12 MB), not the 4 MB that I told you before.
Arnd
On Sonntag, 27. März 2011 Arnd Bergmann wrote:
Note that the number for 96 KB is actually higher than the one for 768 KB that you pointed out. We already know that 32 KB is much faster than 16 KB here, and 96 is the lowest multiple of 32 in 3*2^n.
Right: # dd if=/dev/zero of=/dev/sde1 bs=96K oflag=direct count=2730 2730+0 records in 2730+0 records out 268369920 bytes (268 MB) copied, 10.4203 s, 25.8 MB/s
So this is the Red Bull for my stick ;-)
Please run
# ./flashbench --open-au --random --open-au-nr=5 --erasesize=$[3 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 17.2M/s 1.5MiB 6.3M/s 768KiB 2.53M/s 384KiB 1.17M/s 192KiB 568K/s 96KiB 283K/s 48KiB 140K/s 24KiB 69.5K/s # ./flashbench --open-au --random --open-au-nr=5 --erasesize=$[6 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 8.99M/s 3MiB 7.43M/s 1.5MiB 4.47M/s 768KiB 2.22M/s 384KiB 1.11M/s 192KiB 558K/s 96KiB 282K/s 48KiB 141K/s 24KiB 70.1K/s # ./flashbench --open-au --random --open-au-nr=5 --erasesize=$[12 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 13.1M/s 6MiB 14.9M/s 3MiB 7.74M/s 1.5MiB 4.39M/s 768KiB 2.39M/s 384KiB 1.12M/s 192KiB 560K/s 96KiB 283K/s 48KiB 142K/s 24KiB 70.6K/s # nice --20 ./flashbench --open-au --random --open-au-nr=6 --erasesize=$[3 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 9.01M/s 1.5MiB 6.26M/s 768KiB 2.56M/s 384KiB 1.19M/s 192KiB 576K/s 96KiB 286K/s 48KiB 142K/s 24KiB 70.3K/s
Then I tried some other values too:
# nice --20 ./flashbench --open-au --random --open-au-nr=4 --erasesize=$[3 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 7.64M/s 1.5MiB 6.73M/s 768KiB 2.59M/s ^C # nice --20 ./flashbench --open-au --random --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 22.8M/s 6MiB 15.7M/s 3MiB 7.76M/s 1.5MiB 4.45M/s 768KiB 2.46M/s 384KiB 1.13M/s 192KiB 562K/s 96KiB 284K/s 48KiB 142K/s 24KiB 70.6K/s # nice --20 ./flashbench --open-au --random --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 12.5M/s 6MiB 15.4M/s 3MiB 9.6M/s 1.5MiB 5.11M/s 768KiB 2.75M/s ^C # nice --20 ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 9.28M/s 6MiB 11.7M/s 3MiB 10.4M/s ^C
The goal is to find the minimum value for --erasesize= and the maximum value for --open-au-nr= that can give the full bandwidth of 20+MB/s, which I'm guessing will be 12 MB and 5, but it would be helpful to know for sure.
OK, seems au=4 and erasesize=12M is the best.
In any case, it's a good idea to align the partition to the start of the erase block (6 or 12 MB), not the 4 MB that I told you before.
Recreated with 12MB offset, thx.
On Monday 28 March 2011, Michael Monnerie wrote:
# ./flashbench --open-au --random --open-au-nr=5 --erasesize=$[3 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 17.2M/s 1.5MiB 6.3M/s 768KiB 2.53M/s 384KiB 1.17M/s 192KiB 568K/s 96KiB 283K/s 48KiB 140K/s 24KiB 69.5K/s
D'oh. I thought your stick can do 5 chunks, but looking at the earlier results, it can only do 4. Thankfully, you've tested that as well.
# nice --20 ./flashbench --open-au --random --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 22.8M/s 6MiB 15.7M/s 3MiB 7.76M/s 1.5MiB 4.45M/s 768KiB 2.46M/s 384KiB 1.13M/s 192KiB 562K/s 96KiB 284K/s 48KiB 142K/s 24KiB 70.6K/s
The goal is to find the minimum value for --erasesize= and the maximum value for --open-au-nr= that can give the full bandwidth of 20+MB/s, which I'm guessing will be 12 MB and 5, but it would be helpful to know for sure.
OK, seems au=4 and erasesize=12M is the best.
No, unfortunately my prediction was wrong. What I meant was that we should look for the case where it does not get slower in the later rows, i.e. the 96 KiB row should be closer to 25 MB/s, where here it is only 284K/s -- clearly not "fast".
I'm pretty sure that you will see the fast case with
# one 12-MB chunk ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
and probably also for
# one 12-MB chunk in randomized order ./flashbench --open-au --random --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
but apparently not for
# two 12-MB chunks ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] 12MiB 9.28M/s 6MiB 11.7M/s 3MiB 10.4M/s
Plus, you've shown earlier that the card can do four 4MB-chunks with
# four 4-MB chunks ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=4 4MiB 7.6M/s 2MiB 6.75M/s 1MiB 6.32M/s 512KiB 8.42M/s ...
(this was slower than 10 MB/s, but did not degrade with smaller blocks)
For your information, in the 4x4MB case, the respective blocks are
|0 |16 |32 |48 |64 |72 | | | | |XX| | | |XX| | | |XX| | | |XX| | | | | | |
While 2X12MB is
|0 |16 |32 |48 |64 |80 | | | | | | |XX|XX|XX| | | | | | | | | |XX|XX|XX| | |
The 4*6MB test is
|0 |16 |32 |48 |64 |80 | | | | | | |XX|X | | | | |XX|X | | | | |XX|X | | | | |XX|X
I think it would still be good to have more data points, trying to get the stick into the 20-25MB/s range with multiple erase blocks open, either in random or linear mode. If this stick uses a form of log-structured writes, the linear numbers will be better than the random ones, as long as you use the correct erasesize.
Are you still motivated?
I'd suggest starting out with one erase block (--open-au-nr=1) and without --random, trying all possible --erasesize values from $[3 * 512 * 1024] to $[12 * 1024 * 1024], to see if one or more get you the best-case performance. Out of those that are fast, try increasing the number of erase blocks until you hit the limit, and/or add --random.
I'm sorry that this is all so complicated, but I don't have a stick with this behaviour myself.
Arnd
On Montag, 28. März 2011 Arnd Bergmann wrote:
On Monday 28 March 2011, Michael Monnerie wrote:
# ./flashbench --open-au --random --open-au-nr=5 --erasesize=$[3 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 17.2M/s 1.5MiB 6.3M/s 768KiB 2.53M/s 384KiB 1.17M/s 192KiB 568K/s 96KiB 283K/s 48KiB 140K/s 24KiB 69.5K/s
D'oh. I thought your stick can do 5 chunks, but looking at the earlier results, it can only do 4. Thankfully, you've tested that as well.
# nice --20 ./flashbench --open-au --random --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 22.8M/s 6MiB 15.7M/s 3MiB 7.76M/s 1.5MiB 4.45M/s 768KiB 2.46M/s 384KiB 1.13M/s 192KiB 562K/s 96KiB 284K/s 48KiB 142K/s 24KiB 70.6K/s
The goal is to find the minimum value for --erasesize= and the maximum value for --open-au-nr= that can give the full bandwidth of 20+MB/s, which I'm guessing will be 12 MB and 5, but it would be helpful to know for sure.
OK, seems au=4 and erasesize=12M is the best.
No, unfortunately my prediction was wrong. What I meant was that we should look for the case where it does not get slower in the later rows, i.e. the 96 KiB row should be closer to 25 MB/s, where here it is only 284K/s -- clearly not "fast".
I'm pretty sure that you will see the fast case with
# one 12-MB chunk ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
and probably also for
# one 12-MB chunk in randomized order ./flashbench --open-au --random --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
but apparently not for
# two 12-MB chunks ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] \ /dev/sde --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] 12MiB 9.28M/s 6MiB 11.7M/s 3MiB 10.4M/s
Plus, you've shown earlier that the card can do four 4MB-chunks with
# four 4-MB chunks ./flashbench -O --erasesize=$[4 * 1024 * 1024] /dev/sde --open-au-nr=4 4MiB 7.6M/s 2MiB 6.75M/s 1MiB 6.32M/s 512KiB 8.42M/s ...
(this was slower than 10 MB/s, but did not degrade with smaller blocks)
OK, here the latest results: # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 7.71M/s 3MiB 13.1M/s 1.5MiB 13.3M/s 768KiB 13.4M/s 384KiB 10.2M/s 192KiB 20.3M/s 96KiB 24.6M/s 48KiB 19.4M/s 24KiB 7.53M/s # ./flashbench --open-au --random --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24M/s 6MiB 10.2M/s 3MiB 16.1M/s 1.5MiB 5.89M/s 768KiB 4.49M/s 384KiB 4.45M/s 192KiB 4.36M/s 96KiB 4.06M/s 48KiB 3.44M/s 24KiB 2.84M/s # ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 9.93M/s 6MiB 13.8M/s 3MiB 10.4M/s 1.5MiB 5.34M/s 768KiB 3M/s 384KiB 2.04M/s 192KiB 1.37M/s 96KiB 666K/s ^C
Are you still motivated?
Yeah, performance tuning is always fun.
I'd suggest starting out with one erase block (--open-au-nr=1) and without --random, trying all possible --erasesize values from $[3 * 512 * 1024] to $[12 * 1024 * 1024], to see if one or more get you the best-case performance. Out of those that are fast, try increasing the number of erase blocks until you hit the limit, and/or add --random.
OK, I scripted this, and removed those which aren't possible:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 1.5MiB 6M/s 768KiB 16.8M/s 384KiB 3.17M/s 192KiB 4.97M/s 96KiB 19.4M/s 48KiB 3.43M/s 24KiB 5.99M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 5.9M/s 1.5MiB 5.94M/s 768KiB 5.81M/s 384KiB 15.6M/s 192KiB 5.62M/s 96KiB 5.88M/s 48KiB 4.15M/s 24KiB 3.22M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 23.9M/s 1.5MiB 12M/s 768KiB 5.87M/s 384KiB 4.87M/s 192KiB 10.7M/s 96KiB 5.74M/s 48KiB 9.82M/s 24KiB 3.71M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 4.4M/s 3MiB 14.6M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.8M/s 192KiB 12.9M/s 96KiB 14.9M/s 48KiB 12.1M/s 24KiB 6.41M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 14.4M/s 3MiB 14.4M/s 1.5MiB 14.7M/s 768KiB 14.6M/s 384KiB 13.6M/s 192KiB 12.9M/s 96KiB 14.9M/s 48KiB 12M/s 24KiB 6.35M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 18.1M/s 6MiB 23.6M/s 3MiB 23.8M/s 1.5MiB 24.2M/s 768KiB 24.6M/s 384KiB 22.2M/s 192KiB 20.8M/s 96KiB 24.9M/s 48KiB 19.4M/s 24KiB 7.52M/s
So this last result seems like the clear winner. Testing with it:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 9.87M/s 3MiB 13M/s 1.5MiB 13.4M/s 768KiB 13.5M/s 384KiB 10.4M/s 192KiB 20.5M/s 96KiB 25.1M/s 48KiB 20M/s 24KiB 7.73M/s # ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 12.9M/s 6MiB 23.9M/s 3MiB 24M/s 1.5MiB 24.4M/s 768KiB 25.1M/s 384KiB 21.8M/s 192KiB 20.3M/s 96KiB 24.7M/s 48KiB 19.2M/s 24KiB 7.71M/s # ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 23.9M/s 3MiB 24.1M/s 1.5MiB 24.5M/s 768KiB 24.6M/s 384KiB 21.8M/s 192KiB 20.4M/s 96KiB 25.8M/s 48KiB 19.7M/s 24KiB 7.77M/s # ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24.1M/s 6MiB 18M/s 3MiB 9.08M/s 1.5MiB 4.55M/s 768KiB 2.28M/s ^C
Seems au-nr=3 is quite good. I also retestet for erasesize 512:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 15.9M/s 3MiB 11.1M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.8M/s 192KiB 13.3M/s 96KiB 15.1M/s 48KiB 13.1M/s 24KiB 6.54M/s # ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 15.2M/s 3MiB 14.4M/s 1.5MiB 14.5M/s 768KiB 14.5M/s 384KiB 13.7M/s 192KiB 13.2M/s 96KiB 14.7M/s 48KiB 12.4M/s 24KiB 6.31M/s # ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 16.6M/s 3MiB 14.4M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.7M/s 192KiB 13.2M/s 96KiB 15.3M/s 48KiB 12.6M/s 24KiB 6.22M/s # ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 14.4M/s 3MiB 8.05M/s 1.5MiB 4.28M/s 768KiB 2.2M/s 384KiB 1.11M/s ^C
This is not so good.
I'm sorry that this is all so complicated, but I don't have a stick with this behaviour myself.
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
So that is the best result. Funny, we're at au-nr=3 now. Those stick really likes a threesome ;-)
And what does all this mean now? Is there any FAT/NTFS/other FS layout I could optimize it with? Does the 12MB partition start still hold? I guess so.
I guess with XFS I would use mount option logbufs=8,logbsize=128k,largeio,delaylog and make the filesystem with mkfs.xfs -s size=4k -b size=4k -d su=96k,sw=1 In a lot of options I can only use power of 2 increments, so I benched with this:
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[16 * 1024 * 1024] /dev/sdd --blocksize=$[32 * 1024] --offset=$[32 * 1024 * 1024] sched_setscheduler: Operation not permitted 16MiB 24M/s 8MiB 23.9M/s 4MiB 10.2M/s 2MiB 5.62M/s 1MiB 13.4M/s 512KiB 12.2M/s 256KiB 4.48M/s 128KiB 3.26M/s 64KiB 9.93M/s 32KiB 16.9M/s
Not too bad in the 32K size. I wanted to format NTFS with 64k clusters, but from this results I'd say 32K would be better. Or can't I tell from those results?
On Monday 28 March 2011, Michael Monnerie wrote:
OK, here the latest results:
Very nice. Some interpretations from me:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 7.71M/s 3MiB 13.1M/s 1.5MiB 13.3M/s 768KiB 13.4M/s 384KiB 10.2M/s 192KiB 20.3M/s 96KiB 24.6M/s 48KiB 19.4M/s 24KiB 7.53M/s
It seems that there are only two cases where the stick can reach the maximum throughput, writing whole 12 MB blocks and writing small numbers of 96 KB blocks at once.
# ./flashbench --open-au --random --open-au-nr=1 --erasesize=$[12 * 1024
- 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted 12MiB 24M/s 6MiB 10.2M/s 3MiB 16.1M/s 1.5MiB 5.89M/s 768KiB 4.49M/s 384KiB 4.45M/s 192KiB 4.36M/s 96KiB 4.06M/s 48KiB 3.44M/s 24KiB 2.84M/s
There is a very noticeable degradation compared to linear access, but it's not devestating. Your earlier tests have shown that it can do 4*4MB random access patterns, so I assume that treats them differently by writing random data to another buffer that occasionally gets written back, hence the slowdown.
# ./flashbench --open-au --random --open-au-nr=2 --erasesize=$[12 * 1024
- 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
sched_setscheduler: Operation not permitted 12MiB 9.93M/s 6MiB 13.8M/s 3MiB 10.4M/s 1.5MiB 5.34M/s 768KiB 3M/s 384KiB 2.04M/s 192KiB 1.37M/s 96KiB 666K/s ^C
The first line includes the garbage-collection from the previous run, so it's slower than 25 MB/s. It's an unfortunate side-effect that each measurement depends on what the previous one was.
What you can see very clearly here is that smaller sizes get exponentially slower, so this hits the worst-case scenario. Obviously, the stick can not do random access to more than one erase block. It would be possible to find the exact parameters, but I think this is enough knowledge for now.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 1.5MiB 6M/s 768KiB 16.8M/s 384KiB 3.17M/s 192KiB 4.97M/s 96KiB 19.4M/s 48KiB 3.43M/s 24KiB 5.99M/s
Ok, pretty random behavior, as expected: this uses 1.5 MB erase blocks, so it only needs to do the garbage collection sometimes, but then it has to copy the rest of the old erase block (10.5 MB) into the new one.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[3 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 5.9M/s 1.5MiB 5.94M/s 768KiB 5.81M/s 384KiB 15.6M/s 192KiB 5.62M/s 96KiB 5.88M/s 48KiB 4.15M/s 24KiB 3.22M/s
Same for 3 MB, but here it seems that it copies the 9 MB most of the time. It's very possible that it uses 4 MB of the 12 MB for random access, so it's fast in one out of four cases here.
This would be a very smart strategy, because each 12 MB erase block will have 4 MB that are very fast and 8 MB that are slow, based on the way that MLC flash works.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 3MiB 23.9M/s 1.5MiB 12M/s 768KiB 5.87M/s 384KiB 4.87M/s 192KiB 10.7M/s 96KiB 5.74M/s 48KiB 9.82M/s 24KiB 3.71M/s
I should really implement better parsing for the command line in flashbench. The notation $[6 * 512 * 1024] is really just the bash way of computing the number 3145728, and it's exactly the same as $[3 * 1024 * 1024]. I always wanted to be able to say --erasesize=3M, similar to how dd works, but I haven't done that yet.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[6 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 4.4M/s 3MiB 14.6M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.8M/s 192KiB 12.9M/s 96KiB 14.9M/s 48KiB 12.1M/s 24KiB 6.41M/s # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 14.4M/s 3MiB 14.4M/s 1.5MiB 14.7M/s 768KiB 14.6M/s 384KiB 13.6M/s 192KiB 12.9M/s 96KiB 14.9M/s 48KiB 12M/s 24KiB 6.35M/s
These two are also the same. As you can see here, you get almost exactly half the speed as writing the full 12 MB. This happens because you write new data to half the erase block, and it copies the other half from the old data.
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 18.1M/s 6MiB 23.6M/s 3MiB 23.8M/s 1.5MiB 24.2M/s 768KiB 24.6M/s 384KiB 22.2M/s 192KiB 20.8M/s 96KiB 24.9M/s 48KiB 19.4M/s 24KiB 7.52M/s
So this last result seems like the clear winner.
Excellent. Interestingly, the command line is the same as in the very first test where it got the good performance only sometimes. Sometimes, these sticks need a few runs before they get into the optimum case, especially when you are alternating between random and linear access. It may have remembered that a specific erase block is typically used for random access and optimized for that. Only when it's written linearly a few times it gets into the fast case.
Testing with it:
# ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 9.87M/s 3MiB 13M/s 1.5MiB 13.4M/s 768KiB 13.5M/s 384KiB 10.4M/s 192KiB 20.5M/s 96KiB 25.1M/s 48KiB 20M/s 24KiB 7.73M/s
relatively slow again, just like the first measurement.
# ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 12.9M/s 6MiB 23.9M/s 3MiB 24M/s 1.5MiB 24.4M/s 768KiB 25.1M/s 384KiB 21.8M/s 192KiB 20.3M/s 96KiB 24.7M/s 48KiB 19.2M/s 24KiB 7.71M/s # ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 23.9M/s 6MiB 23.9M/s 3MiB 24.1M/s 1.5MiB 24.5M/s 768KiB 24.6M/s 384KiB 21.8M/s 192KiB 20.4M/s 96KiB 25.8M/s 48KiB 19.7M/s 24KiB 7.77M/s
Ok.
# ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 12MiB 24.1M/s 6MiB 18M/s 3MiB 9.08M/s 1.5MiB 4.55M/s 768KiB 2.28M/s ^C
Seems au-nr=3 is quite good.
Yes, the drop is extremely obvious here, as expected the performance drops in half with every row.
I also retestet for erasesize 512: # ./flashbench --open-au --open-au-nr=1 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 15.9M/s 3MiB 11.1M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.8M/s 192KiB 13.3M/s 96KiB 15.1M/s 48KiB 13.1M/s 24KiB 6.54M/s # ./flashbench --open-au --open-au-nr=2 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 15.2M/s 3MiB 14.4M/s 1.5MiB 14.5M/s 768KiB 14.5M/s 384KiB 13.7M/s 192KiB 13.2M/s 96KiB 14.7M/s 48KiB 12.4M/s 24KiB 6.31M/s # ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 16.6M/s 3MiB 14.4M/s 1.5MiB 14.6M/s 768KiB 14.7M/s 384KiB 13.7M/s 192KiB 13.2M/s 96KiB 15.3M/s 48KiB 12.6M/s 24KiB 6.22M/s # ./flashbench --open-au --open-au-nr=4 --erasesize=$[12 * 512 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024] sched_setscheduler: Operation not permitted 6MiB 14.4M/s 3MiB 8.05M/s 1.5MiB 4.28M/s 768KiB 2.2M/s 384KiB 1.11M/s ^C
This is not so good.
As I explained above, this just means that it assumes a 6 MB erase block, so it basically halves the performance. It's still valuable information that the drop-off happens at 4 erase blocks as well, as expected.
I'm sorry that this is all so complicated, but I don't have a stick with this behaviour myself.
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[12 * 1024 * 1024] /dev/sdd --blocksize=$[24 * 1024] --offset=$[24 * 1024 * 1024]
So that is the best result. Funny, we're at au-nr=3 now. Those stick really likes a threesome ;-)
And what does all this mean now? Is there any FAT/NTFS/other FS layout I could optimize it with? Does the 12MB partition start still hold? I guess so.
Yes, definitely. If there are any file system characteristics that depend on multi-megabyte sizes, they should be aligned with 12 MB.
I guess with XFS I would use mount option logbufs=8,logbsize=128k,largeio,delaylog and make the filesystem with mkfs.xfs -s size=4k -b size=4k -d su=96k,sw=1 In a lot of options I can only use power of 2 increments, so I benched with this:
With XFS, I would recommend using 32 KB blocks (-b size=32k) and sectors (-s size=32k), possibly 16 kb sectors if space is valuable, but not below.
# ./flashbench --open-au --open-au-nr=3 --erasesize=$[16 * 1024 * 1024] /dev/sdd --blocksize=$[32 * 1024] --offset=$[32 * 1024 * 1024] sched_setscheduler: Operation not permitted 16MiB 24M/s 8MiB 23.9M/s 4MiB 10.2M/s 2MiB 5.62M/s 1MiB 13.4M/s 512KiB 12.2M/s 256KiB 4.48M/s 128KiB 3.26M/s 64KiB 9.93M/s 32KiB 16.9M/s
Not too bad in the 32K size. I wanted to format NTFS with 64k clusters, but from this results I'd say 32K would be better. Or can't I tell from those results?
The results are meaningless if you pass the incorrect erasesize, they will be different with every run.
The blocksize you pass to the --open-au test run is just the point where it stops.
Still, 32 KB is the optimimum size here, you want the smallest possible size that still gives you good performance, and all tests above have shown that 16 KB performs worse than 32 KB.
Arnd
On Dienstag, 29. März 2011 Arnd Bergmann wrote:
With XFS, I would recommend using 32 KB blocks (-b size=32k) and sectors (-s size=32k), possibly 16 kb sectors if space is valuable, but not below.
The problem is: XFS on Linux currently only supports pagesize or smaller blocks
So 4K blocksize is the maximum. :-(
flashbench-results@lists.linaro.org