Re: [Linaro-mm-sig] Memory region attribute bits and multiple mappings

20 Apr 2011


      On Wed, Apr 20, 2011 at 8:20 AM, Arnd Bergmann arnd@arndb.de wrote:
...
I find it hard to believe that flushing cache lines for the entire buffer
once you are done writing is more expensive than doing uncached accesses
all the time. That would mean that the CPU is either really good at doing
uncached writes or that the cache management operations are really bad.
Has anyone actually measured this?
Random data-point from drm/i915: The first approach was to use cacheable
buffer objects and flush the before the gpu uses them (gpu bypasses cpu
caches). It was a royal pita, even though we've been tracking the dirtyness
of the object on a per-page basis.
The next approach was to use uc mappings of the gtt, which is fully coherent
with gpu access (minus texture/render caches on the gpu, but that's a different
issue). When doing many modify cycles we're using temporay malloced buffers
and memcpy the data to uc when we're done. The problem with that approach is
that we have to shoot down the ptes when switching between to gpu access
(one reason is simply correctness, the other that it's easier when the kernel
ensures coherency). For streaming stuff, this turns out to be way too expensive.
The current approach is to copy over the data with read/write ioctls.
On cpus that
support it, we can use special streaming memcpy (in the kernel), making uc
reads about as fast as cached reads. But even for writes it's up to 10
times faster
(depending upon circumstances) and actually shows up on opengl benchmarks
when eg. switching just vbo upload to it.
The next approach which is still in dev is to use the gpu to blit
between uncached
and cached. This looks promising for tightly integrating sw fallbacks
(in e.g. X).
We're also going to great extends to cache objects (to avoid flushing caches and
because putting the mappings into the gtt is costly). Luckily we don't have to
change tha pte caching attribute (pat in x86 speak) for that. radeon and nouveau
have to do this and for that reasons there's a uc page allocator (and
cache) in drm.
Might be worthwhile to move that up into generic layers - arm seems to need
something like this, too.
-Daniel
-- 
Daniel Vetter
daniel.vetter@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] Memory region attribute bits and multiple mappings