MMC double buffering

16 Dec 2010


      Hi,
I am working on the blueprint
https://blueprints.launchpad.net/linux-linaro/+spec/other-storage-performanc....
Currently I am investigating performance for DMA vs PIO on eMMC.
Pros and cons for DMA on MMC
+ Offloads CPU
+ Fewer interrupts, one single interrupt for each transfer compared to
100s or even 1000s
+ Power save, DMA consumes less power than CPU
- Less bandwidth / throughput compared to PIO-CPU
The reason for introducing double buffering in the MMC framework is to
address the throughput issue for DMA on MMC.
The assumption is that the CPU and DMA have higher throughput than the
MMC / SD-card.
My hypothesis is that the difference in performance between PIO-mode
and DMA-mode for MMC is due to latency for preparing a DMA-job.
If the next DMA-job could be prepared while the current job is ongoing
this latency would be reduced. The biggest part of preparing a DMA-job
is maintenance of caches.
In my case I run on U5500 (mach-ux500) which has both L1 and L2
caches. The host mmc driver in use is the mmci driver (PL180).
I have done a hack in both the MMC-framework and mmci in order to make
a prove of concept. I have run IOZone to get measurements to prove my
case worthy.
The next step, if the results are promising will be to clean up my
work and send out patches for review.
The DMAC in ux500 support to modes LOG and PHY.
LOG - Many logical channels are multiplex on top of one physical channel
PHY - Only one channel per physical channel
DMA mode LOG and PHY have different latency both HW and SW wise. One
could almost treat them as "two different DMACs. To get a wider test
scope I have tested using both modes.
Summary of the results.
* It is optional for the mmc host driver to utitlize the 2-buf
support. 2-buf in framework requires no change in the host drivers.
* IOZone shows no performance hit on existing drivers* if adding 2-buf
to the framework but not in the host driver.
  (* So far I have only test one driver)
* The performance gain for DMA using 2-buf is probably proportional to
the cache maintenance time.
  The faster the card is the more significant the cache maintenance
part becomes and vice versa.
* For U5500 with 2-buf performance for DMA is:
Throughput: DMA vanilla vs DMA 2-buf
  * read +5-10 %
  * write +0-3 %
CPU load: CPU vs DMA 2-buf
  * read large data: minus 10-20 units of %
  * read small data: same as PIO
  * write: same load as PIO ( why? )
Here follows two of the measurements from IOZones comparing MMC with
double buffering and without. The rest you can find in the text files
attached.
=== Performance CPU compared with DMA vanilla kernel ===
Absolute diff: MMC-VANILLA-CPU -> MMC-VANILLA-DMA-LOG
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       -14     -8      -1005   -988    -679    -1
        cpu:            -0.0    -0.1    -0.8    -0.9    -0.7    +0.0
51200   8       -35     -34     -1763   -1791   -1327   +0
        cpu:            +0.0    -0.1    -0.9    -1.2    -0.7    +0.0
51200   16      +6      -38     -2712   -2728   -2225   +0
        cpu:            -0.1    -0.0    -1.6    -1.2    -0.7    -0.0
51200   32      -10     -79     -3640   -3710   -3298   -1
        cpu:            -0.1    -0.2    -1.2    -1.2    -0.7    -0.0
51200   64      +31     -16     -4401   -4533   -4212   -1
        cpu:            -0.2    -0.2    -0.6    -1.2    -1.2    -0.0
51200   128     +58     -58     -4749   -4776   -4532   -4
        cpu:            -0.2    -0.0    -1.2    -1.1    -1.2    +0.1
51200   256     +192    +283    -5343   -5347   -5184   +13
        cpu:            +0.0    +0.1    -1.2    -0.6    -1.2    +0.0
51200   512     +232    +470    -4663   -4690   -4588   +171
        cpu:            +0.1    +0.1    -4.5    -3.9    -3.8    -0.1
51200   1024    +250    +68     -3151   -3318   -3303   +122
        cpu:            -0.1    -0.5    -14.0   -13.5   -14.0   -0.1
51200   2048    +224    +401    -2708   -2601   -2612   +161
        cpu:            -1.7    -1.3    -18.4   -19.5   -17.8   -0.5
51200   4096    +194    +417    -2380   -2361   -2520   +242
        cpu:            -1.3    -1.6    -19.4   -19.9   -19.4   -0.6
51200   8192    +228    +315    -2279   -2327   -2291   +270
        cpu:            -1.0    -0.9    -20.8   -20.3   -21.0   -0.6
51200   16384   +254    +289    -2260   -2232   -2269   +308
        cpu:            -0.8    -0.8    -20.5   -19.9   -21.5   -0.4
=== Performance CPU compared with DMA with MMC double buffering ===
Absolute diff: MMC-VANILLA-CPU -> MMC-MMCI-2-BUF-DMA-LOG
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       -7      -11     -533    -513    -365    +0
        cpu:            -0.0    -0.1    -0.5    -0.7    -0.4    +0.0
51200   8       -19     -28     -916    -932    -671    +0
        cpu:            -0.0    -0.0    -0.3    -0.6    -0.2    +0.0
51200   16      +14     -13     -1467   -1479   -1203   +1
        cpu:            +0.0    -0.1    -0.7    -0.7    -0.2    -0.0
51200   32      +61     +24     -2008   -2088   -1853   +4
        cpu:            -0.3    -0.2    -0.7    -0.7    -0.2    -0.0
51200   64      +130    +84     -2571   -2692   -2483   +5
        cpu:            +0.0    -0.4    -0.1    -0.7    -0.7    +0.0
51200   128     +275    +279    -2760   -2747   -2607   +19
        cpu:            -0.1    +0.1    -0.7    -0.6    -0.7    +0.1
51200   256     +558    +503    -3455   -3429   -3216   +55
        cpu:            -0.1    +0.1    -0.8    -0.1    -0.8    +0.0
51200   512     +608    +820    -2476   -2497   -2504   +154
        cpu:            +0.2    +0.5    -3.3    -2.1    -2.7    +0.0
51200   1024    +652    +493    -818    -977    -1023   +291
        cpu:            +0.0    -0.1    -13.2   -12.8   -13.3   +0.1
51200   2048    +654    +809    -241    -218    -242    +501
        cpu:            -1.5    -1.2    -16.9   -18.2   -17.0   -0.2
51200   4096    +482    +908    -80     +82     -154    +633
        cpu:            -1.4    -1.2    -19.1   -18.4   -18.6   -0.2
51200   8192    +643    +810    +199    +186    +182    +675
        cpu:            -0.8    -0.7    -19.8   -19.2   -19.5   -0.7
51200   16384   +684    +724    +275    +323    +269    +724
        cpu:            -0.6    -0.7    -19.2   -18.6   -19.8   -0.2

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

MMC double buffering