FSG_NUM_BUFFERS is set to 2 as default. Usually 2 buffers are enough to establish a good double buffering pipeline. But when dealing with expensive request preparation (i.e. dma_map) there may be benefits of increasing the number of buffers. There is an extra cost for every first request, the others are prepared in parallell with an ongoing transfer. Every time all buffers are consumed there is an additional cost for the next first request. Increasing the number of buffers decreases the risk of running out of buffers.
Test set up * Running dd on the host reading from the mass storage device * cmdline: dd if=/dev/sdb of=/dev/null bs=4k count=$((256*100)) * Caches are dropped on the host and on the device before each run
Measurements on a Snowball board with ondemand_govenor active.
FSG_NUM_BUFFERS 2 104857600 bytes (105 MB) copied, 5.62173 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.61811 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.57817 s, 18.8 MB/s 104857600 bytes (105 MB) copied, 5.57769 s, 18.8 MB/s 104857600 bytes (105 MB) copied, 5.59654 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.58948 s, 18.8 MB/s
FSG_NUM_BUFFERS 4 104857600 bytes (105 MB) copied, 5.26839 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.2691 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.2711 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27174 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27261 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27135 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27249 s, 19.9 MB/s
There may not be one optimal number for all boards. That is the reason for adding the number to Kconfig,
Signed-off-by: Per Forlin per.forlin@linaro.org --- drivers/usb/gadget/Kconfig | 15 +++++++++++++++ drivers/usb/gadget/storage_common.c | 4 ++-- 2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig index 029e288..bbd17f3 100644 --- a/drivers/usb/gadget/Kconfig +++ b/drivers/usb/gadget/Kconfig @@ -96,6 +96,21 @@ config USB_GADGET_VBUS_DRAW This value will be used except for system-specific gadget drivers that have more specific information.
+config USB_GADGET_STORAGE_NUM_BUFFERS + int "Number of storage pipline buffers" + range 2 64 + default 2 + help + Usually 2 buffers are enough to establish a good + double buffering pipeline. But when dealing with expensive + request preparation (i.e. dma_map) there may be benefits of + increasing the number of buffers. There is an extra cost for + every first request, the others are prepared in parallell with + an ongoing transfer. Every time all buffers are consumed there is + an additional cost for the next first request. Increasing the number + of buffers decreases the risk of running out of buffers. + If unsure, say 2. + config USB_GADGET_SELECTED boolean
diff --git a/drivers/usb/gadget/storage_common.c b/drivers/usb/gadget/storage_common.c index 1fa4f70..512d9cf 100644 --- a/drivers/usb/gadget/storage_common.c +++ b/drivers/usb/gadget/storage_common.c @@ -262,8 +262,8 @@ static struct fsg_lun *fsg_lun_from_dev(struct device *dev) #define EP0_BUFSIZE 256 #define DELAYED_STATUS (EP0_BUFSIZE + 999) /* An impossibly large value */
-/* Number of buffers we will use. 2 is enough for double-buffering */ -#define FSG_NUM_BUFFERS 2 +/* Number of buffers we will use. 2 is usually enough for double-buffering */ +#define FSG_NUM_BUFFERS CONFIG_USB_GADGET_STORAGE_NUM_BUFFERS
/* Default size of buffer length. */ #define FSG_BUFLEN ((u32)16384)
On Mon, 08 Aug 2011 02:22:14 +0200, Per Forlin per.forlin@linaro.org wrote:
FSG_NUM_BUFFERS is set to 2 as default. Usually 2 buffers are enough to establish a good double buffering pipeline. But when dealing with expensive request preparation (i.e. dma_map) there may be benefits of increasing the number of buffers. There is an extra cost for every first request, the others are prepared in parallell with an ongoing transfer. Every time all buffers are consumed there is an additional cost for the next first request. Increasing the number of buffers decreases the risk of running out of buffers.
[... benchmark snip ...]
There may not be one optimal number for all boards. That is the reason for adding the number to Kconfig,
It could actually be turned into a run-time configuration, but that's a bit more intrusive, so adding Kconfig should be enough for now.
Signed-off-by: Per Forlin per.forlin@linaro.org
Acked-by: Michal Nazarewicz mina86@mina86.com
Some not very relevant comments below:
diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig index 029e288..bbd17f3 100644 --- a/drivers/usb/gadget/Kconfig +++ b/drivers/usb/gadget/Kconfig @@ -96,6 +96,21 @@ config USB_GADGET_VBUS_DRAW This value will be used except for system-specific gadget drivers that have more specific information. +config USB_GADGET_STORAGE_NUM_BUFFERS
- int "Number of storage pipline buffers"
“pipeline”, missing “e”.
- range 2 64
- default 2
- help
Usually 2 buffers are enough to establish a good
double buffering pipeline. But when dealing with expensive
s/double//. By definition, “double” buffering uses two buffers, isn't it? So if we change number of buffers, it's no longer “double”.
request preparation (i.e. dma_map) there may be benefits of
increasing the number of buffers. There is an extra cost for
every first request, the others are prepared in parallell with
“parallel”, too many “l”s.
an ongoing transfer. Every time all buffers are consumed there is
an additional cost for the next first request. Increasing the number
“Next first request” sounds a bit illogical to me.
of buffers decreases the risk of running out of buffers.
If unsure, say 2.
config USB_GADGET_SELECTED boolean diff --git a/drivers/usb/gadget/storage_common.c b/drivers/usb/gadget/storage_common.c index 1fa4f70..512d9cf 100644 --- a/drivers/usb/gadget/storage_common.c +++ b/drivers/usb/gadget/storage_common.c @@ -262,8 +262,8 @@ static struct fsg_lun *fsg_lun_from_dev(struct device *dev) #define EP0_BUFSIZE 256 #define DELAYED_STATUS (EP0_BUFSIZE + 999) /* An impossibly large value */ -/* Number of buffers we will use. 2 is enough for double-buffering */ -#define FSG_NUM_BUFFERS 2 +/* Number of buffers we will use. 2 is usually enough for double-buffering */
s/double-buffering/good buffering pipeline/?
+#define FSG_NUM_BUFFERS CONFIG_USB_GADGET_STORAGE_NUM_BUFFERS /* Default size of buffer length. */ #define FSG_BUFLEN ((u32)16384)
Hi Michal,
2011/8/8 Michal Nazarewicz mina86@mina86.com:
On Mon, 08 Aug 2011 02:22:14 +0200, Per Forlin per.forlin@linaro.org wrote:
FSG_NUM_BUFFERS is set to 2 as default. Usually 2 buffers are enough to establish a good double buffering pipeline. But when dealing with expensive request preparation (i.e. dma_map) there may be benefits of increasing the number of buffers. There is an extra cost for every first request, the others are prepared in parallell with an ongoing transfer. Every time all buffers are consumed there is an additional cost for the next first request. Increasing the number of buffers decreases the risk of running out of buffers.
[... benchmark snip ...]
There may not be one optimal number for all boards. That is the reason for adding the number to Kconfig,
It could actually be turned into a run-time configuration, but that's a bit more intrusive,
I consider this too but couldn't see how this would be done nicely.
Signed-off-by: Per Forlin per.forlin@linaro.org
Acked-by: Michal Nazarewicz mina86@mina86.com
Some not very relevant comments below:
Thanks for these pointers. I'll update.
diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig index 029e288..bbd17f3 100644 --- a/drivers/usb/gadget/Kconfig +++ b/drivers/usb/gadget/Kconfig @@ -96,6 +96,21 @@ config USB_GADGET_VBUS_DRAW This value will be used except for system-specific gadget drivers that have more specific information. +config USB_GADGET_STORAGE_NUM_BUFFERS
- int "Number of storage pipline buffers"
“pipeline”, missing “e”.
Will fix.
- range 2 64
- default 2
- help
- Usually 2 buffers are enough to establish a good
- double buffering pipeline. But when dealing with expensive
s/double//. By definition, “double” buffering uses two buffers, isn't it? So if we change number of buffers, it's no longer “double”.
Good point!
- request preparation (i.e. dma_map) there may be benefits of
- increasing the number of buffers. There is an extra cost for
- every first request, the others are prepared in parallell with
“parallel”, too many “l”s.
Will fix
- an ongoing transfer. Every time all buffers are consumed there
is
- an additional cost for the next first request. Increasing the
number
“Next first request” sounds a bit illogical to me.
I'll remove "next" to avoid confusion.
- of buffers decreases the risk of running out of buffers.
- If unsure, say 2.
config USB_GADGET_SELECTED boolean diff --git a/drivers/usb/gadget/storage_common.c b/drivers/usb/gadget/storage_common.c index 1fa4f70..512d9cf 100644 --- a/drivers/usb/gadget/storage_common.c +++ b/drivers/usb/gadget/storage_common.c @@ -262,8 +262,8 @@ static struct fsg_lun *fsg_lun_from_dev(struct device *dev) #define EP0_BUFSIZE 256 #define DELAYED_STATUS (EP0_BUFSIZE + 999) /* An impossibly large value */ -/* Number of buffers we will use. 2 is enough for double-buffering */ -#define FSG_NUM_BUFFERS 2 +/* Number of buffers we will use. 2 is usually enough for double-buffering */
s/double-buffering/good buffering pipeline/?
Will fix.
+#define FSG_NUM_BUFFERS CONFIG_USB_GADGET_STORAGE_NUM_BUFFERS /* Default size of buffer length. */ #define FSG_BUFLEN ((u32)16384)
-- Best regards, _ _ .o. | Liege of Serenely Enlightened Majesty of o' ,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--
On Mon, 08 Aug 2011 02:22:14 +0200, Per Forlin wrote:
There may not be one optimal number for all boards. That is the reason for adding the number to Kconfig,
2011/8/8 Michal Nazarewicz mina86@mina86.com:
It could actually be turned into a run-time configuration, but that's a bit more intrusive,
On Mon, 08 Aug 2011 12:50:01 +0200, Per Forlin wrote:
I consider this too but couldn't see how this would be done nicely.
In f_mass_storage.c, the buffers are stored in fsg_common structure as an fixed size array. Instead, this could be replaced by a pointer and buffhds_count field with number of buffers.
The buffhds_count would have to initialised in fsg_common_init() from a (new) field stored in fsg_config structure, which in turn in most cases is filled by values from fsg_module_parameters which correspond to module parameters defined in FSG_MODULE_PARAMATERS() macro.
The buffers are allocated in fsg_common_init() (see line 2815). All that would need to be change is to use value stored in fsg_common instead of FSG_NUM_BUFFERS and first allocate space for the common->buffhds array.
Similarly, fsg_common_realese() would have to get the number of buffers from fsg_common structure and free common->buffhds after freeing each individual buffers.
As far as I can tell, other places where FSG_NUM_BUFFERS is used would be a simple s/FSG_NUM_BUFFERS/common->buffhds_count/g.
If you also care about file_storage.c, almost the same expect the module parameter juggling is a bit easier.
But as I've said previously, it's not like I'm saying that's the way it should be done right now. I'm just describing how it could be done should someone need to be able to change number of buffers via module parameter rather then recompiling the module.
On Mon, 8 Aug 2011, Per Forlin wrote:
FSG_NUM_BUFFERS is set to 2 as default. Usually 2 buffers are enough to establish a good double buffering pipeline. But when dealing with expensive request preparation (i.e. dma_map) there may be benefits of increasing the number of buffers. There is an extra cost for every first request, the others are prepared in parallell with an ongoing transfer. Every time all buffers are consumed there is an additional cost for the next first request. Increasing the number of buffers decreases the risk of running out of buffers.
The reasoning here is not clear. Throughput is limited on the device side mostly by the time required to prepare the buffer, transfer the data, and copy the data to/from the backing storage. Using more buffers won't affect this directly. Putting this another way, if the time required to prepare a buffer and copy the data is longer than the time needed to transfer the data over the USB link, the throughput won't be optimal no matter how many buffers you use.
On the other hand, adding buffers will smooth out irregularities and latencies caused by scheduling delays when other tasks are running at the same time. Still, I would expect the benefit to be fairly small and to diminish rapidly as more buffers are added.
Test set up
- Running dd on the host reading from the mass storage device
- cmdline: dd if=/dev/sdb of=/dev/null bs=4k count=$((256*100))
- Caches are dropped on the host and on the device before each run
Measurements on a Snowball board with ondemand_govenor active.
FSG_NUM_BUFFERS 2 104857600 bytes (105 MB) copied, 5.62173 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.61811 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.57817 s, 18.8 MB/s 104857600 bytes (105 MB) copied, 5.57769 s, 18.8 MB/s 104857600 bytes (105 MB) copied, 5.59654 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.58948 s, 18.8 MB/s
FSG_NUM_BUFFERS 4 104857600 bytes (105 MB) copied, 5.26839 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.2691 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.2711 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27174 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27261 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27135 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27249 s, 19.9 MB/s
Okay, 6% is a worthwhile improvement, though not huge. Did you try 6 or 8 buffers? I bet going beyond 4 makes very little difference.
There may not be one optimal number for all boards. That is the reason for adding the number to Kconfig,
Signed-off-by: Per Forlin per.forlin@linaro.org
drivers/usb/gadget/Kconfig | 15 +++++++++++++++ drivers/usb/gadget/storage_common.c | 4 ++-- 2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig index 029e288..bbd17f3 100644 --- a/drivers/usb/gadget/Kconfig +++ b/drivers/usb/gadget/Kconfig @@ -96,6 +96,21 @@ config USB_GADGET_VBUS_DRAW This value will be used except for system-specific gadget drivers that have more specific information. +config USB_GADGET_STORAGE_NUM_BUFFERS
- int "Number of storage pipline buffers"
- range 2 64
64 is definitely way into the overkill range. I'd be very tempted to leave the maximum set at 4. You might even consider setting it to 4 always -- although each buffer requires 16 KB of contiguous kernel memory so minimizing the number of buffers is always desirable.
Alan Stern
On 8 August 2011 16:34, Alan Stern stern@rowland.harvard.edu wrote:
On Mon, 8 Aug 2011, Per Forlin wrote:
FSG_NUM_BUFFERS is set to 2 as default. Usually 2 buffers are enough to establish a good double buffering pipeline. But when dealing with expensive request preparation (i.e. dma_map) there may be benefits of increasing the number of buffers. There is an extra cost for every first request, the others are prepared in parallell with an ongoing transfer. Every time all buffers are consumed there is an additional cost for the next first request. Increasing the number of buffers decreases the risk of running out of buffers.
The reasoning here is not clear. Throughput is limited on the device side mostly by the time required to prepare the buffer, transfer the data, and copy the data to/from the backing storage. Using more buffers won't affect this directly. Putting this another way, if the time required to prepare a buffer and copy the data is longer than the time needed to transfer the data over the USB link, the throughput won't be optimal no matter how many buffers you use.
On the other hand, adding buffers will smooth out irregularities and latencies caused by scheduling delays when other tasks are running at the same time. Still, I would expect the benefit to be fairly small and to diminish rapidly as more buffers are added.
Test set up * Running dd on the host reading from the mass storage device * cmdline: dd if=/dev/sdb of=/dev/null bs=4k count=$((256*100)) * Caches are dropped on the host and on the device before each run
Measurements on a Snowball board with ondemand_govenor active.
FSG_NUM_BUFFERS 2 104857600 bytes (105 MB) copied, 5.62173 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.61811 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.57817 s, 18.8 MB/s 104857600 bytes (105 MB) copied, 5.57769 s, 18.8 MB/s 104857600 bytes (105 MB) copied, 5.59654 s, 18.7 MB/s 104857600 bytes (105 MB) copied, 5.58948 s, 18.8 MB/s
FSG_NUM_BUFFERS 4 104857600 bytes (105 MB) copied, 5.26839 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.2691 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.2711 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27174 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27261 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27135 s, 19.9 MB/s 104857600 bytes (105 MB) copied, 5.27249 s, 19.9 MB/s
Okay, 6% is a worthwhile improvement, though not huge. Did you try 6 or 8 buffers? I bet going beyond 4 makes very little difference.
On my board 4 buffers are enough. More buffers will make no difference.
Background study I started by running dd to measure performance on my target side. Simply to measure what would be the maximum bandwidth, 20MiB/s on my board. Then I started the gadget mass storage on the device and run the sae test from the PC host side, 18.7 MiB/s. I guessed this might be due to serialized cache handling. I tested to remove the dma_map call (replaced it with virt_to_dma). This is just a dummy test to see if this is causing the performance drop. Without dma_map I get 20MiB/s. It appears that the loss is due to dma_map call. The dma_map only adds latency for the first request.
Studying the flow of buffers I can see, both buffers are filled with data from vfs, then both are transmitted over USB, burst like behaviour. More buffers will add "smoothness" and prevent the bursts to affect performance negatively. With 4 buffers, there were very few cases when all 4 buffers were empty during a transfer
There may not be one optimal number for all boards. That is the reason for adding the number to Kconfig,
Signed-off-by: Per Forlin per.forlin@linaro.org
drivers/usb/gadget/Kconfig | 15 +++++++++++++++ drivers/usb/gadget/storage_common.c | 4 ++-- 2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig index 029e288..bbd17f3 100644 --- a/drivers/usb/gadget/Kconfig +++ b/drivers/usb/gadget/Kconfig @@ -96,6 +96,21 @@ config USB_GADGET_VBUS_DRAW This value will be used except for system-specific gadget drivers that have more specific information.
+config USB_GADGET_STORAGE_NUM_BUFFERS
- int "Number of storage pipline buffers"
- range 2 64
64 is definitely way into the overkill range. I'd be very tempted to leave the maximum set at 4. You might even consider setting it to 4 always -- although each buffer requires 16 KB of contiguous kernel memory so minimizing the number of buffers is always desirable.
I agree, 64 is not a realistic number. Max 4 is fine with me.
Alan Stern
On Mon, 8 Aug 2011, Per Forlin wrote:
Okay, 6% is a worthwhile improvement, though not huge. �Did you try 6 or 8 buffers? �I bet going beyond 4 makes very little difference.
On my board 4 buffers are enough. More buffers will make no difference.
Background study I started by running dd to measure performance on my target side. Simply to measure what would be the maximum bandwidth, 20MiB/s on my board. Then I started the gadget mass storage on the device and run the sae test from the PC host side, 18.7 MiB/s. I guessed this might be due to serialized cache handling. I tested to remove the dma_map call (replaced it with virt_to_dma). This is just a dummy test to see if this is causing the performance drop. Without dma_map I get 20MiB/s. It appears that the loss is due to dma_map call. The dma_map only adds latency for the first request.
What exactly do you mean by "first request"? g_usb_storage uses two usb_request structures for bulk-IN transfers; are they what you're talking about? Or are you talking about calls to usb_ep_queue()? Or something else?
Your test showed an extra latency of about 0.4 seconds when using dma_map. All that latency comes from a single request (the first one)? That doesn't sound right.
Studying the flow of buffers I can see, both buffers are filled with data from vfs, then both are transmitted over USB, burst like behaviour.
Actually the first buffer is filled from VFS, then it is transmitted over USB. While the transmission is in progress, the second buffer is filled from VFS. Each time a transmission ends, the next buffer is queued for transmission -- unless it hasn't been filled yet, in which case it is queued when it gets filled.
What do you mean by "burst like behavior"?
More buffers will add "smoothness" and prevent the bursts to affect performance negatively. With 4 buffers, there were very few cases when all 4 buffers were empty during a transfer
In what sense do the bursts affect performance? It makes sense that having all the buffers empty from time to time would reduce throughput. But as long as each buffer gets filled and queued before the previous buffer has finished transmitting, it shouldn't matter how "bursty" the behavior is.
Alan Stern
On 8 August 2011 20:45, Alan Stern stern@rowland.harvard.edu wrote:
On Mon, 8 Aug 2011, Per Forlin wrote:
Okay, 6% is a worthwhile improvement, though not huge. Did you try 6 or 8 buffers? I bet going beyond 4 makes very little difference.
On my board 4 buffers are enough. More buffers will make no difference.
Background study I started by running dd to measure performance on my target side. Simply to measure what would be the maximum bandwidth, 20MiB/s on my board. Then I started the gadget mass storage on the device and run the sae test from the PC host side, 18.7 MiB/s. I guessed this might be due to serialized cache handling. I tested to remove the dma_map call (replaced it with virt_to_dma). This is just a dummy test to see if this is causing the performance drop. Without dma_map I get 20MiB/s. It appears that the loss is due to dma_map call. The dma_map only adds latency for the first request.
What exactly do you mean by "first request"?
When both buffers are empty. The first request are filled up by vfs and then prepared. This first time there are no ongoing transfer over USB therefore this cost more, since it can't run in parallel with ongoing transfer. Every time the to buffers run empty there is an extra cost for the first request in the next series of requests. The reason for not getting data from vfs in time I don't know.
Your test showed an extra latency of about 0.4 seconds when using dma_map. All that latency comes from a single request (the first one)? That doesn't sound right.
This "first request scenario" I refer to happens very often.
Studying the flow of buffers I can see, both buffers are filled with data from vfs, then both are transmitted over USB, burst like behaviour.
Actually the first buffer is filled from VFS, then it is transmitted over USB. While the transmission is in progress, the second buffer is filled from VFS. Each time a transmission ends, the next buffer is queued for transmission -- unless it hasn't been filled yet, in which case it is queued when it gets filled.
What do you mean by "burst like behavior"?
Instead of getting refill from vfs smoothly the refills comes in burst. VFS fills up the two buffers, then USB manage to transmit all two buffers before VFS refills new data. Roughly 20% of the times VFS fills up data in time before USB has consumed it all. 80% of the times USB manage to consume all data before VFS refills data in the buffers. The reason for the burst like affects are probably due to power save, which add latency in the system.
More buffers will add "smoothness" and prevent the bursts to affect performance negatively. With 4 buffers, there were very few cases when all 4 buffers were empty during a transfer
In what sense do the bursts affect performance? It makes sense that having all the buffers empty from time to time would reduce throughput. But as long as each buffer gets filled and queued before the previous buffer has finished transmitting, it shouldn't matter how "bursty" the behavior is.
This is true. In my system the buffer doesn't get filled up before the previous has finished. With 4 buffers it works fine.
Thanks, Per
On Mon, 8 Aug 2011, Per Forlin wrote:
On 8 August 2011 20:45, Alan Stern stern@rowland.harvard.edu wrote:
On Mon, 8 Aug 2011, Per Forlin wrote:
Okay, 6% is a worthwhile improvement, though not huge. �Did you try 6 or 8 buffers? �I bet going beyond 4 makes very little difference.
On my board 4 buffers are enough. More buffers will make no difference.
Background study I started by running dd to measure performance on my target side. Simply to measure what would be the maximum bandwidth, 20MiB/s on my board. Then I started the gadget mass storage on the device and run the sae test from the PC host side, 18.7 MiB/s. I guessed this might be due to serialized cache handling. I tested to remove the dma_map call (replaced it with virt_to_dma). This is just a dummy test to see if this is causing the performance drop. Without dma_map I get 20MiB/s. It appears that the loss is due to dma_map call. The dma_map only adds latency for the first request.
What exactly do you mean by "first request"?
When both buffers are empty. The first request are filled up by vfs and then prepared. This first time there are no ongoing transfer over USB therefore this cost more, since it can't run in parallel with ongoing transfer. Every time the to buffers run empty there is an extra cost for the first request in the next series of requests. The reason for not getting data from vfs in time I don't know.
Okay, that makes sense. But what connection does this have with dma_map? Don't you also have situations where both buffers are empty when you replace dma_map with virt_to_dma?
Instead of getting refill from vfs smoothly the refills comes in burst. VFS fills up the two buffers, then USB manage to transmit all two buffers before VFS refills new data. Roughly 20% of the times VFS fills up data in time before USB has consumed it all. 80% of the times USB manage to consume all data before VFS refills data in the buffers. The reason for the burst like affects are probably due to power save, which add latency in the system.
This is true. In my system the buffer doesn't get filled up before the previous has finished. With 4 buffers it works fine.
Adding buffers to smooth out the bursty VFS behavior is a reasonable thing to do. But you should improve the patch description and the comments; the real reason for reduced throughput is VFS's behavior. Something like what you just wrote here would be fine.
And reduce the maximum number of buffers to 4. When you've made those changes, I'll Ack the patch.
Alan Stern
On 8 August 2011 22:23, Alan Stern stern@rowland.harvard.edu wrote:
On Mon, 8 Aug 2011, Per Forlin wrote:
On 8 August 2011 20:45, Alan Stern stern@rowland.harvard.edu wrote:
On Mon, 8 Aug 2011, Per Forlin wrote:
Okay, 6% is a worthwhile improvement, though not huge. Did you try 6 or 8 buffers? I bet going beyond 4 makes very little difference.
On my board 4 buffers are enough. More buffers will make no difference.
Background study I started by running dd to measure performance on my target side. Simply to measure what would be the maximum bandwidth, 20MiB/s on my board. Then I started the gadget mass storage on the device and run the sae test from the PC host side, 18.7 MiB/s. I guessed this might be due to serialized cache handling. I tested to remove the dma_map call (replaced it with virt_to_dma). This is just a dummy test to see if this is causing the performance drop. Without dma_map I get 20MiB/s. It appears that the loss is due to dma_map call. The dma_map only adds latency for the first request.
What exactly do you mean by "first request"?
When both buffers are empty. The first request are filled up by vfs and then prepared. This first time there are no ongoing transfer over USB therefore this cost more, since it can't run in parallel with ongoing transfer. Every time the to buffers run empty there is an extra cost for the first request in the next series of requests. The reason for not getting data from vfs in time I don't know.
Okay, that makes sense. But what connection does this have with dma_map? Don't you also have situations where both buffers are empty when you replace dma_map with virt_to_dma?
All I do here is to remove the cost of dma_map in order to verify if dma_map is the one responsible for the delay. If the performance is good without dma_map I know the extra cost is due to dma_map running in serial instead of parallel. If the performance would be bad even without dma_map something else is causing it. It doesn't affect how the VFS feeds data to the buffers.
Instead of getting refill from vfs smoothly the refills comes in burst. VFS fills up the two buffers, then USB manage to transmit all two buffers before VFS refills new data. Roughly 20% of the times VFS fills up data in time before USB has consumed it all. 80% of the times USB manage to consume all data before VFS refills data in the buffers. The reason for the burst like affects are probably due to power save, which add latency in the system.
This is true. In my system the buffer doesn't get filled up before the previous has finished. With 4 buffers it works fine.
Adding buffers to smooth out the bursty VFS behavior is a reasonable thing to do. But you should improve the patch description and the comments; the real reason for reduced throughput is VFS's behavior. Something like what you just wrote here would be fine.
I agree. The main reason is to compensate for bursty VFS behavior.
And reduce the maximum number of buffers to 4. When you've made those changes, I'll Ack the patch.
I'll do.
Thanks for your time and patience, Per