On Wed, Apr 20, 2011 at 8:20 AM, Arnd Bergmann arnd@arndb.de wrote:
I find it hard to believe that flushing cache lines for the entire buffer once you are done writing is more expensive than doing uncached accesses all the time. That would mean that the CPU is either really good at doing uncached writes or that the cache management operations are really bad. Has anyone actually measured this?
Random data-point from drm/i915: The first approach was to use cacheable buffer objects and flush the before the gpu uses them (gpu bypasses cpu caches). It was a royal pita, even though we've been tracking the dirtyness of the object on a per-page basis.
The next approach was to use uc mappings of the gtt, which is fully coherent with gpu access (minus texture/render caches on the gpu, but that's a different issue). When doing many modify cycles we're using temporay malloced buffers and memcpy the data to uc when we're done. The problem with that approach is that we have to shoot down the ptes when switching between to gpu access (one reason is simply correctness, the other that it's easier when the kernel ensures coherency). For streaming stuff, this turns out to be way too expensive.
The current approach is to copy over the data with read/write ioctls. On cpus that support it, we can use special streaming memcpy (in the kernel), making uc reads about as fast as cached reads. But even for writes it's up to 10 times faster (depending upon circumstances) and actually shows up on opengl benchmarks when eg. switching just vbo upload to it.
The next approach which is still in dev is to use the gpu to blit between uncached and cached. This looks promising for tightly integrating sw fallbacks (in e.g. X).
We're also going to great extends to cache objects (to avoid flushing caches and because putting the mappings into the gtt is costly). Luckily we don't have to change tha pte caching attribute (pat in x86 speak) for that. radeon and nouveau have to do this and for that reasons there's a uc page allocator (and cache) in drm.
Might be worthwhile to move that up into generic layers - arm seems to need something like this, too. -Daniel