[Linaro-mm-sig] Memory region attribute bits and multiple mappings
daniel.vetter at ffwll.ch
Wed Apr 20 06:45:29 UTC 2011
On Wed, Apr 20, 2011 at 8:20 AM, Arnd Bergmann <arnd at arndb.de> wrote:
> I find it hard to believe that flushing cache lines for the entire buffer
> once you are done writing is more expensive than doing uncached accesses
> all the time. That would mean that the CPU is either really good at doing
> uncached writes or that the cache management operations are really bad.
> Has anyone actually measured this?
Random data-point from drm/i915: The first approach was to use cacheable
buffer objects and flush the before the gpu uses them (gpu bypasses cpu
caches). It was a royal pita, even though we've been tracking the dirtyness
of the object on a per-page basis.
The next approach was to use uc mappings of the gtt, which is fully coherent
with gpu access (minus texture/render caches on the gpu, but that's a different
issue). When doing many modify cycles we're using temporay malloced buffers
and memcpy the data to uc when we're done. The problem with that approach is
that we have to shoot down the ptes when switching between to gpu access
(one reason is simply correctness, the other that it's easier when the kernel
ensures coherency). For streaming stuff, this turns out to be way too expensive.
The current approach is to copy over the data with read/write ioctls.
On cpus that
support it, we can use special streaming memcpy (in the kernel), making uc
reads about as fast as cached reads. But even for writes it's up to 10
(depending upon circumstances) and actually shows up on opengl benchmarks
when eg. switching just vbo upload to it.
The next approach which is still in dev is to use the gpu to blit
and cached. This looks promising for tightly integrating sw fallbacks
(in e.g. X).
We're also going to great extends to cache objects (to avoid flushing caches and
because putting the mappings into the gtt is costly). Luckily we don't have to
change tha pte caching attribute (pat in x86 speak) for that. radeon and nouveau
have to do this and for that reasons there's a uc page allocator (and
cache) in drm.
Might be worthwhile to move that up into generic layers - arm seems to need
something like this, too.
daniel.vetter at ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch
More information about the Linaro-mm-sig