On Tuesday 26 June 2012, Andrew Bradford wrote:
This is a bit of a long stream of conciousness flashbench mail. Sorry for the length :)
Not a problem.
I've seen a bunch of devices that try to guess the type of I/O that is being done on a given erase block and then optimize for that kind of access at later times. It seems your device belongs into that category. This is a good thing in theory, but it makes it much harder to find out what it does.
The last time I had one of these, I could usually reset the behavior for each erase block by doing long linear writes across those blocks, e.g. doing "dd if=/dev/zero of=/dev/sdc bs=8M count=100".
Ok. Here's some interesting results from dd, I would have expected the transfer rate to be fairly constant between runs of dd, but it's not always.
[andrew@mythdvr flashbench]$ sudo dd if=/dev/zero of=/dev/sdb bs=8M count=100 100+0 records in 100+0 records out 838860800 bytes (839 MB) copied, 80.2316 s, 10.5 MB/s [andrew@mythdvr flashbench]$ sudo dd if=/dev/zero of=/dev/sdb bs=8M count=50 50+0 records in 50+0 records out 419430400 bytes (419 MB) copied, 18.1989 s, 23.0 MB/s [andrew@mythdvr flashbench]$ sudo dd if=/dev/zero of=/dev/sdb bs=8M count=50 50+0 records in 50+0 records out 419430400 bytes (419 MB) copied, 67.1159 s, 6.2 MB/s [andrew@mythdvr flashbench]$ sudo dd if=/dev/zero of=/dev/sdb bs=8M count=50 50+0 records in 50+0 records out 419430400 bytes (419 MB) copied, 21.5482 s, 19.5 MB/s [andrew@mythdvr flashbench]$ sudo dd if=/dev/zero of=/dev/sdb bs=8M count=100 100+0 records in 100+0 records out 838860800 bytes (839 MB) copied, 67.2624 s, 12.5 MB/s
With 'dd' you have to be careful to use the 'oflag=direct' argument. Otherwise part of the data may still be in the page cache waiting for writeback.
There is another effect that I've seen before but don't think is happening here, which is that writing only zeroes to the device is faster than writing other data because it just marks the erase block as unused, as an erase command would do.
Another way to get around it is to frequently change the --offset value. In case the device remembers the last 10 blocks that had random I/O patterns in the past, you could cycle through 24MB, 48MB, 72MB, 96MB, ...
Even after the above dd's (and a handful more I didn't paste), I still can't get back to the fast-slow-fast-slow performance I saw before when testing with an offset of 1MiB before an assumed erase block bound:
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[23*1024*1024] 8MiB 12.1M/s 4MiB 7.72M/s 2MiB 7.68M/s 1MiB 5.52M/s 512KiB 4.19M/s 256KiB 4.34M/s 128KiB 4.87M/s 64KiB 3.14M/s 32KiB 3.31M/s 16KiB 1.56M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[47*1024*1024] 8MiB 18.9M/s 4MiB 10.4M/s 2MiB 9.15M/s 1MiB 5.27M/s 512KiB 3.8M/s ^C [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[71*1024*1024] 8MiB 29.3M/s 4MiB 10.2M/s 2MiB 8.48M/s 1MiB 4.95M/s 512KiB 4.35M/s 256KiB 3.84M/s 128KiB 3.04M/s ^C [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[479*1024*1024] 8MiB 16.4M/s 4MiB 12.5M/s 2MiB 9.53M/s 1MiB 5.14M/s 512KiB 3.94M/s 256KiB 4.12M/s 128KiB 3.78M/s 64KiB 3.17M/s ^C
Maybe it's learning write styles for the entire device rather than per erase block?
Possible but not all that likely I think.
I'm going to start over with testing, forget what I've done before, and see what conclusions I can make.
[andrew@mythdvr flashbench]$ sudo ./flashbench -a /dev/sdb --blocksize=2048 align 8589934592 pre 1.25ms on 1.74ms post 1.31ms diff 463µs align 4294967296 pre 1.29ms on 1.87ms post 1.36ms diff 540µs align 2147483648 pre 1.28ms on 1.87ms post 1.36ms diff 545µs align 1073741824 pre 1.26ms on 1.67ms post 1.3ms diff 393µs align 536870912 pre 1.29ms on 1.72ms post 1.36ms diff 399µs align 268435456 pre 1.25ms on 1.72ms post 1.36ms diff 413µs align 134217728 pre 1.26ms on 1.72ms post 1.36ms diff 410µs align 67108864 pre 1.29ms on 1.72ms post 1.36ms diff 395µs align 33554432 pre 1.29ms on 1.72ms post 1.36ms diff 395µs align 16777216 pre 1.29ms on 1.72ms post 1.33ms diff 411µs align 8388608 pre 1.29ms on 1.72ms post 1.36ms diff 399µs align 4194304 pre 1.31ms on 1.36ms post 1.36ms diff 28.4µs align 2097152 pre 1.34ms on 1.36ms post 1.36ms diff 10.7µs align 1048576 pre 1.36ms on 1.36ms post 1.36ms diff 3.88µs align 524288 pre 1.36ms on 1.37ms post 1.37ms diff 6.75µs align 262144 pre 1.36ms on 1.36ms post 1.37ms diff -1465ns align 131072 pre 1.33ms on 1.37ms post 1.37ms diff 20µs align 65536 pre 1.36ms on 1.37ms post 1.36ms diff 4.85µs align 32768 pre 1.32ms on 1.37ms post 1.37ms diff 21.5µs align 16384 pre 1.31ms on 1.44ms post 1.36ms diff 103µs align 8192 pre 1.27ms on 1.53ms post 1.33ms diff 232µs align 4096 pre 1.32ms on 1.4ms post 1.32ms diff 77.9µs
Very clear 8MiB erase block there.
Yes.
[andrew@mythdvr flashbench]$ sudo ./flashbench -a /dev/sdb --blocksize=$[3*1024] align 6442450944 pre 1.26ms on 1.86ms post 1.29ms diff 588µs align 3221225472 pre 1.2ms on 1.74ms post 1.26ms diff 512µs align 1610612736 pre 1.27ms on 1.82ms post 1.37ms diff 496µs align 805306368 pre 1.2ms on 1.74ms post 1.29ms diff 495µs align 402653184 pre 1.19ms on 1.73ms post 1.29ms diff 487µs align 201326592 pre 1.21ms on 1.74ms post 1.3ms diff 492µs align 100663296 pre 1.21ms on 1.75ms post 1.3ms diff 494µs align 50331648 pre 1.21ms on 1.75ms post 1.3ms diff 493µs align 25165824 pre 1.19ms on 1.75ms post 1.3ms diff 506µs align 12582912 pre 1.26ms on 1.3ms post 1.3ms diff 17.7µs align 6291456 pre 1.26ms on 1.29ms post 1.3ms diff 12.5µs align 3145728 pre 1.23ms on 1.48ms post 1.3ms diff 214µs align 1572864 pre 1.27ms on 1.3ms post 1.3ms diff 18µs align 786432 pre 1.26ms on 1.3ms post 1.3ms diff 19.5µs align 393216 pre 1.25ms on 1.3ms post 1.3ms diff 26.8µs align 196608 pre 1.29ms on 1.3ms post 1.3ms diff 4.7µs align 98304 pre 1.25ms on 1.3ms post 1.3ms diff 24.8µs align 49152 pre 1.26ms on 1.3ms post 1.3ms diff 19.9µs align 24576 pre 1.26ms on 1.46ms post 1.3ms diff 178µs align 12288 pre 1.26ms on 1.29ms post 1.24ms diff 43.2µs align 6144 pre 1.3ms on 1.24ms post 1.42ms diff -114594
24MiB and 3MiB both show. But as 6MiB and 12MiB are both faster than 3MiB, makes me confused.
Right, it's not completely clear. I would still assume 8MB erase blocks from this run.
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --findfat --erasesize=$[8*1024*1024] --blocksize=$[8*1024] --fat-nr=6 --count=100 8MiB 33.1M/s 32.8M/s 32.6M/s 32.8M/s 32.8M/s 32.7M/s 4MiB 32.7M/s 32.6M/s 32.8M/s 32.6M/s 32.8M/s 32.8M/s 2MiB 32.5M/s 32.5M/s 32.5M/s 32.8M/s 32.6M/s 32.6M/s 1MiB 33.2M/s 33.1M/s 33M/s 33.1M/s 33.1M/s 33.3M/s 512KiB 32.5M/s 32.7M/s 32.4M/s 32M/s 30.9M/s 32.2M/s 256KiB 31.2M/s 31M/s 31.2M/s 30.9M/s 30.8M/s 30.4M/s 128KiB 30.1M/s 30M/s 30.1M/s 29.9M/s 30.2M/s 29.7M/s 64KiB 32.1M/s 31.9M/s 31.9M/s 31.5M/s 31.6M/s 32M/s 32KiB 28.5M/s 28.4M/s 26.8M/s 27.2M/s 27.8M/s 28.1M/s 16KiB 20.5M/s 20.8M/s 21.1M/s 8.8M/s 12M/s 12M/s 8KiB 6.13M/s 6.14M/s 7.67M/s 5.66M/s 6.41M/s 6.4M/s
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --findfat --erasesize=$[3*1024*1024] --blocksize=$[12*1024] --fat-nr=9 --count=100 3MiB 32.1M/s 32.7M/s 31.2M/s 32.6M/s 32.8M/s 31.4M/s 32M/s 32.8M/s 31.4M/s 1.5MiB 32.2M/s 33.3M/s 31.9M/s 33.1M/s 33.2M/s 31.9M/s 33.2M/s 33.3M/s 32.7M/s 768KiB 32.6M/s 32.9M/s 31.5M/s 32.5M/s 32.6M/s 31.3M/s 33M/s 33.2M/s 31.7M/s 384KiB 31.1M/s 32M/s 30.1M/s 30.6M/s 30.7M/s 29.4M/s 30.7M/s 30.9M/s 31.9M/s 192KiB 31.2M/s 32.1M/s 30.7M/s 32.4M/s 32.5M/s 30.6M/s 32M/s 32.1M/s 30.7M/s 96KiB 31.9M/s 32.6M/s 31M/s 33.4M/s 33.6M/s 31.7M/s 32.6M/s 32.9M/s 30.9M/s 48KiB 12.7M/s 28.8M/s 12.6M/s 28.9M/s 28.9M/s 12.4M/s 29.1M/s 28.3M/s 12.4M/s 24KiB 22.1M/s 22.6M/s 21.9M/s 22.2M/s 22.4M/s 20.8M/s 21.7M/s 22.2M/s 21.1M/s 12KiB 17.6M/s 17.3M/s 16.3M/s 18.2M/s 17.9M/s 18M/s 9.6M/s 18.4M/s 3.74M/s
Seems like 8MiB erase block is consistent for the special fat area, assuming I'm reading these right.
Well, look at the 48KB row, which is slow in columns 3 and 6.
|0 |3 |6 |9 |12 |15 |18 |21 | |0 |8 |16 |
The slow ones are those that cross an 8MB boundary. I don't think there is any special area on this device.
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=1 8MiB 21.5M/s 4MiB 22.1M/s 2MiB 32.6M/s 1MiB 33M/s 512KiB 32.3M/s 256KiB 31M/s 128KiB 29.7M/s 64KiB 30.9M/s 32KiB 25.7M/s 16KiB 19.3M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=2 8MiB 16.7M/s 4MiB 21.1M/s 2MiB 32.2M/s 1MiB 31.9M/s 512KiB 29.4M/s 256KiB 27.2M/s 128KiB 24.2M/s 64KiB 18.9M/s 32KiB 12.6M/s 16KiB 3.17M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=3 8MiB 32.9M/s 4MiB 33.7M/s 2MiB 33.4M/s 1MiB 19.2M/s 512KiB 18.8M/s 256KiB 17.7M/s 128KiB 25.3M/s 64KiB 13.3M/s 32KiB 4.74M/s 16KiB 1.99M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 8MiB 33M/s 4MiB 33.3M/s 2MiB 33.2M/s 1MiB 13.1M/s 512KiB 15.4M/s 256KiB 28.7M/s 128KiB 11M/s 64KiB 7.31M/s 32KiB 5.87M/s 16KiB 1.92M/s
But! Without an offset, I may be starting these --open-au measurements within the special fat area. In flashbench.c, starting line 517 is:
/* start 16 MB into the device, to skip FAT, round up to full erase blocks */ if (offset == -1ull) offset = (1024 * 1024 * 16 + erasesize - 1) / erasesize * erasesize;
Which I read as "if no offset, set offset to (in my case) 16MiB". And if my special fat area is 21 or 24MiB, that could be giving misleading results for open-au tests. So, it seems that any open-au tests I've done without an offset larger than 24MiB are possibly suspect.
correct.
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=1 --offset=$[240*1024*1024] [sudo] password for andrew: 8MiB 33M/s 4MiB 32.5M/s 2MiB 32.5M/s 1MiB 33.3M/s 512KiB 32.2M/s 256KiB 30.6M/s 128KiB 29.1M/s 64KiB 29.7M/s 32KiB 24.3M/s 16KiB 18.8M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=2 --offset=$[240*1024*1024] 8MiB 19.2M/s 4MiB 18.7M/s 2MiB 32M/s 1MiB 30.9M/s 512KiB 28.4M/s 256KiB 24.4M/s 128KiB 21.8M/s 64KiB 17.2M/s 32KiB 11.1M/s 16KiB 2.49M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=3 --offset=$[240*1024*1024] 8MiB 32.7M/s 4MiB 33.3M/s 2MiB 33.3M/s 1MiB 19.2M/s 512KiB 18.6M/s 256KiB 17.7M/s 128KiB 25.2M/s 64KiB 13.3M/s 32KiB 4.74M/s 16KiB 1.98M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[240*1024*1024] 8MiB 32.8M/s 4MiB 33.3M/s 2MiB 32.6M/s 1MiB 13M/s 512KiB 15.3M/s 256KiB 28.4M/s 128KiB 10.9M/s 64KiB 7.2M/s 32KiB 5.86M/s 16KiB 1.92M/s
But that doesn't look much different than without an offset. So maybe not that big of a deal. 3 open-au still looks good, 4 is falling fast.
Agreed.
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=1 --offset=$[240*1024*1024] --random 8MiB 32.3M/s 4MiB 32.6M/s 2MiB 13.5M/s 1MiB 6.36M/s 512KiB 9.29M/s 256KiB 28.9M/s 128KiB 11.1M/s 64KiB 6.05M/s 32KiB 12.4M/s 16KiB 2.48M/s [andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=2 --offset=$[240*1024*1024] --random 8MiB 32.6M/s 4MiB 33.5M/s 2MiB 32.7M/s 1MiB 10.8M/s 512KiB 6.68M/s 256KiB 7.03M/s 128KiB 3.73M/s 64KiB 4.41M/s 32KiB 3.36M/s 16KiB 1.7M/s
And just 1 open-au for random to get decent performance.
Right, but again it's sometimes fast for 1 erase block and sometimes slow, indicating that two physical erase blocks are used to back that one logical erase block in random access mode.
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[239*1024*1024] 8MiB 25.8M/s 4MiB 11.8M/s 2MiB 8.46M/s 1MiB 6.47M/s 512KiB 4.37M/s 256KiB 4.77M/s 128KiB 4.48M/s 64KiB 2.86M/s 32KiB 2.27M/s 16KiB 1.68M/s
[andrew@mythdvr flashbench]$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[240*1024*1024] 8MiB 21.8M/s 4MiB 33.3M/s 2MiB 32.9M/s 1MiB 13.1M/s 512KiB 15.4M/s 256KiB 28.8M/s 128KiB 11M/s 64KiB 7.3M/s 32KiB 5.88M/s 16KiB 1.92M/s
So 240MiB is looking like a boundary!
Yep.
Switched machines for further testing but was able to confirm previous results first.
andrew@bigbox:~/flashbench$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[8*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[241*1024*1024] 8MiB 11.8M/s 4MiB 14M/s 2MiB 6.48M/s 1MiB 8.71M/s 512KiB 4.9M/s 256KiB 4.71M/s 128KiB 5.6M/s 64KiB 2.93M/s 32KiB 3.18M/s 16KiB 811K/s
241MiB offset is slow, like 239MiB.
Try a smaller eraseblock sizes:
andrew@bigbox:~/flashbench$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[4*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[241*1024*1024] 4MiB 12.1M/s 2MiB 16M/s 1MiB 34.9M/s 512KiB 33.8M/s 256KiB 31.2M/s 128KiB 27.1M/s 64KiB 20.9M/s 32KiB 2.12M/s 16KiB 1.77M/s
andrew@bigbox:~/flashbench$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[2*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[241*1024*1024] 2MiB 34.9M/s 1MiB 34.6M/s 512KiB 33.6M/s 256KiB 31.9M/s 128KiB 28.1M/s 64KiB 23.4M/s 32KiB 10.5M/s 16KiB 1.25M/s
So both 4MiB and 2MiB look like they're still within that same erase block and not crossing a bound with a 241 offset. So erase block size should be larger than 4MiB, possibly ruling out the 3MiB theory. But the fact that at 241MiB offset and 4MiB eraseblock the 4MiB, 2MiB, and 32KiB are slow, makes me question how much I trust that 8MiB is really the eraseblock size.
I would assume that those are just random artifacts of the device having to clean up the physical erase blocks occasionally when you don't write the entire 8MB block all the time. Doing a 2 MB write on an 8MB physical erase block is semi-random I/O.
andrew@bigbox:~/flashbench$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[4*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[242*1024*1024] 4MiB 15.9M/s 2MiB 34.8M/s 1MiB 34.3M/s 512KiB 33.6M/s 256KiB 31.4M/s 128KiB 27.1M/s 64KiB 20.8M/s 32KiB 2.38M/s ^C
andrew@bigbox:~/flashbench$ sudo ./flashbench /dev/sdb --open-au --erasesize=$[4*1024*1024] --blocksize=$[16*1024] --open-au-nr=4 --offset=$[245*1024*1024] 4MiB 34M/s 2MiB 5.72M/s 1MiB 7.84M/s 512KiB 10.9M/s 256KiB 4.9M/s 128KiB 3.13M/s 64KiB 6.3M/s 32KiB 1.29M/s 16KiB 1.51M/s
And a 245MiB offset is slow again with 4MiB eraseblock.
I think I'm going to go with 8MiB erase blocks on this. It's going to be an ext4 disk and I'll partition at 24MiB bounds, just in case 3MiB is really the proper eraseblock.
Right.
There are a few other things you can do:
* Set the stride= and stripe-width= to 8MB. the RAID configuration has a lot in common with flash media, so it can only improve performance if you do that.
* Use a separate partition for an external journal and align that to 8MB as well. Otherwise the journal might not be erase block aligned (it might do that automatically when you set the stripe-width as above, don't know yet).
* Consider using btrfs instead of ext4. According to our research, 3 or 4 erase blocks is not really enough for ext4, but btrfs can cope with this unless you do a lot of sync() operations, which require more and would work better on ext4.
Arnd