Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached. However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain the problem in more detail?
Thanks, Rebecca
Rebecca,
We have some of the same issues with our Tiler implementation. We want to use uncached allocations to simplify coherency concerns and performance issues with flush/invalidate. However, as you stated, there really is not a solution out there that gives us what we need without serious drawbacks.
Looking around at the other architectures, I noticed that the ia64 arch has a uncached allocator where they convert pages from cached to uncached. If we had something similar it would keep us from having to resort to option 1 or 2 below. The ia64 uncached allocator is located in arch/ia64/kernel/uncached.c.
Regards,
Andy
On Tue, Apr 19, 2011 at 3:06 PM, Rebecca Schultz Zavin rebecca@android.comwrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached. However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
If we really need all mappings of physical memory to have the same cache
attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain the problem in more detail?
Thanks, Rebecca
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
I took a quick look at that, it seems to implement option 1, put aside memory at boot. Once we've got memory not in the unity map, I think it's pretty easy to stick an allocator in front of it, and as long as there's only the one mapping, you can safely swap that mapping between cached and uncached.
Rebecca
On Tue, Apr 19, 2011 at 1:25 PM, Gross, Andy andy.gross@ti.com wrote:
Rebecca,
We have some of the same issues with our Tiler implementation. We want to use uncached allocations to simplify coherency concerns and performance issues with flush/invalidate. However, as you stated, there really is not a solution out there that gives us what we need without serious drawbacks.
Looking around at the other architectures, I noticed that the ia64 arch has a uncached allocator where they convert pages from cached to uncached. If we had something similar it would keep us from having to resort to option 1 or 2 below. The ia64 uncached allocator is located in arch/ia64/kernel/uncached.c.
Regards,
Andy
On Tue, Apr 19, 2011 at 3:06 PM, Rebecca Schultz Zavin < rebecca@android.com> wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached. However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
If we really need all mappings of physical memory to have the same cache
attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain the problem in more detail?
Thanks, Rebecca
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
I took a quick look at that, it seems to implement option 1, put aside memory at boot. Once we've got memory not in the unity map, I think it's pretty easy to stick an allocator in front of it, and as long as there's only the one mapping, you can safely swap that mapping between cached and uncached.
Rebecca
On Tue, Apr 19, 2011 at 1:25 PM, Gross, Andy andy.gross@ti.com wrote:
Rebecca,
We have some of the same issues with our Tiler implementation. We want to use uncached allocations to simplify coherency concerns and performance issues with flush/invalidate. However, as you stated, there really is not a solution out there that gives us what we need without serious drawbacks.
Looking around at the other architectures, I noticed that the ia64 arch has a uncached allocator where they convert pages from cached to uncached. If we had something similar it would keep us from having to resort to option 1 or 2 below. The ia64 uncached allocator is located in arch/ia64/kernel/uncached.c.
Regards,
Andy
On Tue, Apr 19, 2011 at 3:06 PM, Rebecca Schultz Zavin < rebecca@android.com> wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached. However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
If we really need all mappings of physical memory to have the same cache
attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain the problem in more detail?
Thanks, Rebecca
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Am 19.04.2011 22:06, schrieb Rebecca Schultz Zavin:
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior.
I didn't know this is undefined behavior. But we at the Rockbox project do this on several targets without problems (cached and uncached, for e.g. simplified DMA-related code).
Best regards.
Until recently the dma api in the kernel was doing it too. Apparently it's related to speculative prefetching, so as far as my understand goes, not an issue on Cortex-A8 where the feature isn't present. However, there have been a couple of cases of data corruption due to it on A9. It's also possible that the architecture licensee's implementations -- like the qualcomm snapdragons -- don't actually have a bug when this occurs. I'm trying to collect the info on exactly which exhibit a problem. Be aware though, it's a race condition, just because you never saw it doesn't mean you couldn't have.
Rebecca
On Tue, Apr 19, 2011 at 1:53 PM, Thomas Martitz < thomas.martitz@student.htw-berlin.de> wrote:
Am 19.04.2011 22:06, schrieb Rebecca Schultz Zavin:
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior.
I didn't know this is undefined behavior. But we at the Rockbox project do this on several targets without problems (cached and uncached, for e.g. simplified DMA-related code).
Best regards.
Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Am 19.04.2011 23:11, schrieb Rebecca Schultz Zavin:
Until recently the dma api in the kernel was doing it too. Apparently it's related to speculative prefetching, so as far as my understand goes, not an issue on Cortex-A8 where the feature isn't present. However, there have been a couple of cases of data corruption due to it on A9.
A8 and A9 are too new for us :) We deal with ARMv4-v6 mostly.
Be aware though, it's a race condition, just because you never saw it doesn't mean you couldn't have.
That's true.
Best regards.
On Tue, 19 Apr 2011, Thomas Martitz wrote:
Am 19.04.2011 23:11, schrieb Rebecca Schultz Zavin:
Until recently the dma api in the kernel was doing it too. Apparently it's related to speculative prefetching, so as far as my understand goes, not an issue on Cortex-A8 where the feature isn't present. However, there have been a couple of cases of data corruption due to it on A9.
A8 and A9 are too new for us :) We deal with ARMv4-v6 mostly.
Anything before ARMv6 is fine and you don't have to worry.
I know at least one ARMv6 implementation (the Marvell one) where speculative prefetching might be expected. No idea about the ARMv6 cores from ARM Ltd. We may generally assume speculative prefetching on all ARMv7 implementations.
Nicolas
On Tue, Apr 19 2011, Rebecca Schultz Zavin wrote:
Until recently the dma api in the kernel was doing it too. Apparently it's related to speculative prefetching, so as far as my understand goes, not an issue on Cortex-A8 where the feature isn't present. However, there have been a couple of cases of data corruption due to it on A9. It's also possible that the architecture licensee's implementations -- like the qualcomm snapdragons -- don't actually have a bug when this occurs. I'm trying to collect the info on exactly which exhibit a problem. Be aware though, it's a race condition, just because you never saw it doesn't mean you couldn't have.
Newer Qualcomm cores will have a problem with this as well.
David
On Tuesday 19 April 2011 22:06:50 Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
Thanks for the summary and getting this started!
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached.
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
In general (not talking about ARM in particular), Linux does not support mapping RAM pages with conflicting cache attributes. E.g. on certain powerpc CPUs, you get a checkstop if you try to bypass the cache when there is already an active cache line for it.
This is a variant of the cache aliasing problem we see with virtually indexed caches: You may end up with multiple cache lines for the same physical address, with different contents. The results are unpredictable, so most CPU architectures explicitly forbid this.
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
Right, I believe this is what people generally do to avoid the problem, but I wouldn't call it a solution.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
We are very close to needing highmem on a lot of systems, and in Linaro we generally assume that it's there. For instance, Acer has announced an Android tablet that has a full gigabyte of RAM, so they are most likely using highmem already.
There is a significant overhead in simply enabling highmem on a system where you don't need it, but it also makes it possible to use the memory for page cache that would otherwise be wasted when there is no active user of the reserved memory.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
Would it get simpler if we only allow entire supersections to be moved into the uncached memory allocator?
Arnd
On Tue, Apr 19, 2011 at 2:23 PM, Arnd Bergmann arnd@arndb.de wrote:
On Tuesday 19 April 2011 22:06:50 Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
Thanks for the summary and getting this started!
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think
it's
going to be a requirement that some of the memory allocated via the
unified
memory manager is mapped uncached.
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
That was my original plan, but our graphics folks and those at our partner companies basically have me convinced that the common case is for userspace to stream data into memory, say copying an image into a texture, and never read from it or touch it again. The alternative will mean a lot of cache flushes for small memory regions, in and of itself this becomes a performance problem. I think we want to optimize for this case, rather than the much less likely case of read-modify-write to these buffers.
However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice.
I'd
also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain
to
me why this causes a problem. I have a theory, but it's mostly a guess.
I
especially want to understand if it's still a problem if we never access
the
memory via the mapping in the unity map. I know speculative prefetching
is
part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
In general (not talking about ARM in particular), Linux does not support mapping RAM pages with conflicting cache attributes. E.g. on certain powerpc CPUs, you get a checkstop if you try to bypass the cache when there is already an active cache line for it.
This is a variant of the cache aliasing problem we see with virtually indexed caches: You may end up with multiple cache lines for the same physical address, with different contents. The results are unpredictable, so most CPU architectures explicitly forbid this.
I think the extra wrinkle here is the presence of the unity mapping as cached, even if you never access it, causes a problem. I totally understand why you wouldn't want to access mappings with different attributes, but just having them hang around seems like it shouldn't in general be a problem. How does powerpc handle it when you need an uncached page for dma?
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a
time.
Obvious drawbacks here are that you have to statically partition your
system
into memory you want accessible to the unified memory manager and memory
you
don't. This may not be that big a deal, since most current solutions,
pmem,
cmem, et al basically do this. I can say that on Android devices running
on
a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
Right, I believe this is what people generally do to avoid the problem, but I wouldn't call it a solution.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata
in
highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift
in
the next couple of years.
We are very close to needing highmem on a lot of systems, and in Linaro we generally assume that it's there. For instance, Acer has announced an Android tablet that has a full gigabyte of RAM, so they are most likely using highmem already.
There is a significant overhead in simply enabling highmem on a system where you don't need it, but it also makes it possible to use the memory for page cache that would otherwise be wasted when there is no active user of the reserved memory.
3- fix up the unity mapping so the attribute bits match those desired by
the
unified memory manger. This could be done by removing pages from the
unity
map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
Would it get simpler if we only allow entire supersections to be moved into the uncached memory allocator?
I've thought about it. It adds the requirement that we need to be able to make a relatively high order, 16M (supsersection sized) allocation at runtime; one of the problems many of these memory managers were originally introduced to solve. Even if we could solve that cleanly with something like compaction, I'm not sure just modifying those attribute bits would be that much easier than rewriting the page table for that section. Either way we would want some way to put the cache attributes back under memory pressure or we're back to solution 1. I have some idea how to modify the ARM page tables in hardware to do this, but figuring out how that connects with the page tables in linux to do this safely at runtime is where things will get fun.
Rebecca
Arnd
On Tue, Apr 19, 2011 at 4:37 PM, Rebecca Schultz Zavin rebecca@android.com wrote:
On Tue, Apr 19, 2011 at 2:23 PM, Arnd Bergmann arnd@arndb.de wrote:
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
That was my original plan, but our graphics folks and those at our partner companies basically have me convinced that the common case is for userspace to stream data into memory, say copying an image into a texture, and never read from it or touch it again. The alternative will mean a lot of cache flushes for small memory regions, in and of itself this becomes a performance problem. I think we want to optimize for this case, rather than the much less likely case of read-modify-write to these buffers.
This was why I made the comment earlier about wanting fault handlers for these buffers, so we could track if the buffer is dirty, and potentially what parts of the buffer are dirty.. so in the optimal case where no sw is touching the buffer, you don't incur the penalty, but also at the same time you still get decent performance in case where userspace does want to touch the buffer. So I think in most cases where people are scared of cached, it is actually quite ok.
That said, in case of tiled buffers on o4, because of the virtual -> tiled physical -> physical address translation that goes on, some small distance of prefetch might be touching something far away.. or in hyperspace.. there are ways to deal with that so I am not ready to rule out cached.. but I'm a bit less comfortable with cached for these type of buffers without spending some time to think it through.. (we avoided the issue so far by just making them uncached).
BR, -R
On Tuesday 19 April 2011 23:37:48 Rebecca Schultz Zavin wrote:
On Tue, Apr 19, 2011 at 2:23 PM, Arnd Bergmann arnd@arndb.de wrote:
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
That was my original plan, but our graphics folks and those at our partner companies basically have me convinced that the common case is for userspace to stream data into memory, say copying an image into a texture, and never read from it or touch it again. The alternative will mean a lot of cache flushes for small memory regions, in and of itself this becomes a performance problem. I think we want to optimize for this case, rather than the much less likely case of read-modify-write to these buffers.
I find it hard to believe that flushing cache lines for the entire buffer once you are done writing is more expensive than doing uncached accesses all the time. That would mean that the CPU is either really good at doing uncached writes or that the cache management operations are really bad. Has anyone actually measured this?
This is a variant of the cache aliasing problem we see with virtually indexed caches: You may end up with multiple cache lines for the same physical address, with different contents. The results are unpredictable, so most CPU architectures explicitly forbid this.
I think the extra wrinkle here is the presence of the unity mapping as cached, even if you never access it, causes a problem. I totally understand why you wouldn't want to access mappings with different attributes, but just having them hang around seems like it shouldn't in general be a problem. How does powerpc handle it when you need an uncached page for dma?
You don't need uncached pages for DMA, Linux only supports systems that are coherent on powerpc, which has pretty much solved the problem by forcing hardware designers to do it the easy way, rather than requiring the software add extra overhead to work around it.
Arnd
On Wed, Apr 20, 2011 at 8:20 AM, Arnd Bergmann arnd@arndb.de wrote:
I find it hard to believe that flushing cache lines for the entire buffer once you are done writing is more expensive than doing uncached accesses all the time. That would mean that the CPU is either really good at doing uncached writes or that the cache management operations are really bad. Has anyone actually measured this?
Random data-point from drm/i915: The first approach was to use cacheable buffer objects and flush the before the gpu uses them (gpu bypasses cpu caches). It was a royal pita, even though we've been tracking the dirtyness of the object on a per-page basis.
The next approach was to use uc mappings of the gtt, which is fully coherent with gpu access (minus texture/render caches on the gpu, but that's a different issue). When doing many modify cycles we're using temporay malloced buffers and memcpy the data to uc when we're done. The problem with that approach is that we have to shoot down the ptes when switching between to gpu access (one reason is simply correctness, the other that it's easier when the kernel ensures coherency). For streaming stuff, this turns out to be way too expensive.
The current approach is to copy over the data with read/write ioctls. On cpus that support it, we can use special streaming memcpy (in the kernel), making uc reads about as fast as cached reads. But even for writes it's up to 10 times faster (depending upon circumstances) and actually shows up on opengl benchmarks when eg. switching just vbo upload to it.
The next approach which is still in dev is to use the gpu to blit between uncached and cached. This looks promising for tightly integrating sw fallbacks (in e.g. X).
We're also going to great extends to cache objects (to avoid flushing caches and because putting the mappings into the gtt is costly). Luckily we don't have to change tha pte caching attribute (pat in x86 speak) for that. radeon and nouveau have to do this and for that reasons there's a uc page allocator (and cache) in drm.
Might be worthwhile to move that up into generic layers - arm seems to need something like this, too. -Daniel
This is a variant of the cache aliasing problem we see with virtually indexed caches: You may end up with multiple cache lines for the same physical address, with different contents. The results are unpredictable, so most CPU architectures explicitly forbid this.
I think the extra wrinkle here is the presence of the unity mapping as cached, even if you never access it, causes a problem. I totally understand why you wouldn't want to access mappings with different attributes, but just having them hang around seems like it shouldn't in general be a problem. How does powerpc handle it when you need an uncached page for dma?
You don't need uncached pages for DMA, Linux only supports systems that are coherent on powerpc, which has pretty much solved the problem by forcing hardware designers to do it the easy way, rather than requiring the software add extra overhead to work around it.
Thats not entirely true, I've nearly certain I've spent a bit of time with Ben tracking down non-coherent PCI issues on powerpc
Dave.
On Tuesday 19 April 2011 23:37:48 Rebecca Schultz Zavin wrote:
On Tue, Apr 19, 2011 at 2:23 PM, Arnd Bergmann arnd@arndb.de wrote:
On Tuesday 19 April 2011 22:06:50 Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
Thanks for the summary and getting this started!
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached.
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
That was my original plan, but our graphics folks and those at our partner companies basically have me convinced that the common case is for userspace to stream data into memory, say copying an image into a texture, and never read from it or touch it again. The alternative will mean a lot of cache flushes for small memory regions, in and of itself this becomes a performance problem. I think we want to optimize for this case, rather than the much less likely case of read-modify-write to these buffers.
Another important use case is image/video capture (V4L2 API), where the device writes to memory using DMA, and applications read from the memory. Depending on how applications process captured data, an uncached mapping might impact performances negatively.
We also need to keep in mind that buffers might be passed between different cores (when compressing video data on a DSP for instance). Even if we map the memory uncached on the cores running Linux, other cores might still cache the data by default. This will need to be addressed.
<snip>
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached.
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
That was my original plan, but our graphics folks and those at our partner companies basically have me convinced that the common case is for userspace to stream data into memory, say copying an image into a texture, and never read from it or touch it again. The alternative will mean a lot of cache flushes for small memory regions, in and of itself this becomes a performance problem. I think we want to optimize for this case, rather than the much less likely case of read-modify-write to these buffers.
[TC] I'm not sure I completely agree with this being a use case. From my understanding, the problem we're trying to solve here isn't a generic graphics memory manager but rather a memory manager to facilitate cross-device sharing of 2D image buffers. GPU drivers will still have their own allocators for textures which will probably be in a tiled or other proprietary format no other device can understand anyway. The use case where we (GPU driver vendors) might want uncached memory is for one-time texture upload. If we have a texture we know we're only going to write to once (with the CPU), there is no benefit in using cached memory. In fact, there's a potential performance drop if you used cached memory because the texture upload will cause useful cache lines to be evicted and replaced with useless lines for the texture. However, I don't see any use case where we'd then want to share that CPU-uploaded texture with another device, in which case we would use our own (uncached) allocator, not this "cross-device" allocator. There's also a school of thought (so I'm told) that for one-time texture upload you still want to use cached memory because more modern cores have smaller write buffers (so you want to use the cache for better combining of writes) and are good at detecting large sequential writes and thus don't use the whole cache for those anyway. So, other than one-time texture upload, are there other graphics use cases you know of where it might be more optimal to use uncached memory? What about video decoder use-cases?
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
[TC] I’m not sure I completely agree with this being a use case. From my understanding, the problem we’re trying to solve here isn’t a generic graphics memory manager but rather a memory manager to facilitate cross-device sharing of 2D image buffers. GPU drivers will still have their own allocators for textures which will probably be in a tiled or other proprietary format no other device can understand anyway. The use case where we (GPU driver vendors) might want uncached memory is for one-time texture upload. If we have a texture we know we’re only going to write to once (with the CPU), there is no benefit in using cached memory. In fact, there’s a potential performance drop if you used cached memory because the texture upload will cause useful cache lines to be evicted and replaced with useless lines for the texture. However, I don’t see any use case where we’d then want to share that CPU-uploaded texture with another device, in which case we would use our own (uncached) allocator, not this “cross-device” allocator. There’s also a school of thought (so I’m told) that for one-time texture upload you still want to use cached memory because more modern cores have smaller write buffers (so you want to use the cache for better combining of writes) and are good at detecting large sequential writes and thus don’t use the whole cache for those anyway. So, other than one-time texture upload, are there other graphics use cases you know of where it might be more optimal to use uncached memory? What about video decoder use-cases?
The memory mangaer should be used for all internal GPU memory management as well if desired.
If we have any hope of ever making open source ARM GPU drivers get upstream they can't all just go reinventing the wheel. They need to be based on a common layer.
Dave.
I've been buried all day, but here's my response to a bunch of the discussion points today:
I definitely don't think it makes sense to solve part of the problem and then leave the gpu's to require their own allocator for their common case. Why solve the allocator problem twice? Also, we want to be able to support passing of these buffers between the camera, gpu, video decoder etc. It'd be much easier if they were from a common source.
The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace. The common case for this is to write vertexes, textures etc to these buffers once and never touch them again. This may happen several (or even several 10s or more) of times per frame. My experience with cache flushes on ARM architectures matches Marek's. Typically write combine makes streaming writes really really fast, and on several SOC's we've found it cheaper to flush the whole cache than to flush by line. Clearly this impacts the rest of system performance, not to mention the fact that a couple of large textures and you've totally blown your caches for the rest of the system.
Once we've solved the multiple mapping problem, it becomes quite easy to support both cached AND uncached accesses from the cpu, so I think that covers the cases where it's actually desirable to have a cached mapping -- software rendering, processing frames from a camera etc.
I think the issue of contiguous memory allocation and cache attributes are totally separate, except that in cases where large contiguous regions are necessary -- the problem qualcom's pmem and friends were written to solve -- you pretty much end up needing to put aside a pool of buffers at boot anyway in order to guarantee the availability of large order allocations. Once you've done that, the attributes problem goes away since your memory is not in the direct map.
On Wed, Apr 20, 2011 at 1:34 PM, Dave Airlie airlied@gmail.com wrote:
[TC] I’m not sure I completely agree with this being a use case. From my understanding, the problem we’re trying to solve here isn’t a generic graphics memory manager but rather a memory manager to facilitate cross-device sharing of 2D image buffers. GPU drivers will still have
their
own allocators for textures which will probably be in a tiled or other proprietary format no other device can understand anyway. The use case
where
we (GPU driver vendors) might want uncached memory is for one-time
texture
upload. If we have a texture we know we’re only going to write to once
(with
the CPU), there is no benefit in using cached memory. In fact, there’s a potential performance drop if you used cached memory because the texture upload will cause useful cache lines to be evicted and replaced with
useless
lines for the texture. However, I don’t see any use case where we’d then want to share that CPU-uploaded texture with another device, in which
case
we would use our own (uncached) allocator, not this “cross-device” allocator. There’s also a school of thought (so I’m told) that for
one-time
texture upload you still want to use cached memory because more modern
cores
have smaller write buffers (so you want to use the cache for better combining of writes) and are good at detecting large sequential writes
and
thus don’t use the whole cache for those anyway. So, other than one-time texture upload, are there other graphics use cases you know of where it might be more optimal to use uncached memory? What about video decoder use-cases?
The memory mangaer should be used for all internal GPU memory management as well if desired.
If we have any hope of ever making open source ARM GPU drivers get upstream they can't all just go reinventing the wheel. They need to be based on a common layer.
Dave.
On Wednesday 20 April 2011 23:52:56 Rebecca Schultz Zavin wrote:
The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace. The common case for this is to write vertexes, textures etc to these buffers once and never touch them again. This may happen several (or even several 10s or more) of times per frame. My experience with cache flushes on ARM architectures matches Marek's.
Ok, thanks for the confirmation on that, both of you.
Is the requirement only to have the uncached mapping in user space then while it does not need to be mapped into the kernel, or is there also a requirement to map the same buffers into the kernel address space?
We already allow very limited amounts of buffers to be mapped uncached into the kernel using dma_alloc_coherent(), but that is typically for stuff like descriptor tables where we have at most a few pages per device.
On coherent systems (arch_is_coherent()), we do return a cacheable mapping for dma_alloc_coherent, and I assume that you wouldn't want that, because it prevents mapping the same buffer into user space as uncached.
I suppose what we want here is a way to get a scatterlist for memory has a linear mapping on a given device and is not mapped into the kernel address space at all. This is something that is not possible with the dma-mapping interface yet. In particular, using the streaming mapping API (dma_map_sg) currently requires that the page is mapped into linear mapping because the kernel wants to do the appropriate cache flushes that we are trying to avoid.
I think the issue of contiguous memory allocation and cache attributes are totally separate, except that in cases where large contiguous regions are necessary -- the problem qualcom's pmem and friends were written to solve -- you pretty much end up needing to put aside a pool of buffers at boot anyway in order to guarantee the availability of large order allocations. Once you've done that, the attributes problem goes away since your memory is not in the direct map.
I don't think we strictly have to completely hide that pool from other users, if we decide to use the highmem approach. All pages in highmem today should be movable, so we can free up contiguous space that is not mapped by paging out the data that is there.
Arnd
On Thu, 21 Apr 2011 09:30:10 +0200 Arnd Bergmann arnd@arndb.de wrote:
On Wednesday 20 April 2011 23:52:56 Rebecca Schultz Zavin wrote:
The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace. The common case for this is to write vertexes, textures etc to these buffers once and never touch them again. This may happen several (or even several 10s or more) of times per frame. My experience with cache flushes on ARM architectures matches Marek's.
Ok, thanks for the confirmation on that, both of you.
Is the requirement only to have the uncached mapping in user space then while it does not need to be mapped into the kernel, or is there also a requirement to map the same buffers into the kernel address space?
We like to be able to write to buffers mapped into userspace, preferably with a write combine attribute, for write mostly data like vertices, textures, and other state buffers. On Intel at least, we may end up binding objects at a different address than when the client ran before, so we have a relocation pass that fixes up address handles passed in by userspace, and that happens in kernel space, so we need a mapping there once userspace is done with the buffers and has dispatched the rendering command buffer.
But the main issue here is to avoid the overhead of repeated cache flushes. If the graphics stack is doing its job right, it shouldn't generally be reading gfx memory, so making sure the fast path just deals with already uncached pages is important. In some cases (e.g. software fallback or input detection on transformed objects) reading back from the render target is unavoidable, but those aren't the steady state, fast path cases we want to optimize for.
At least with Mali, we use user-space sub-allocators. So we'll acquire hunks of memory from the kernel, map it into the GPU context's virtual address space and let the user-space driver divide it up into however many buffers it needs. So I guess we could still acquire the hunks of memory from whatever kernel-space allocator we want, but fundamentally, kernel-space will never see what individual buffers that memory is divided up into. I'm also struggling to understand how we could securely share a buffer after it has been allocated given we'd have to share entire pages. It's quite likely a buffer we want to share will have data in pages used by other buffers which we don't want to share? Fundamentally, we need to know at allocation time if a buffer is intended to be shared with another device or process so we can make sure the allocation gets its own pages.
Thinking about the uncached issue some more, I think I've convinced myself that it should be a requirement. We'll have to support older (pre-A9) cores for some time where using uncached memory provides better performance (which seems to be the consensus opinion here?).
Cheers,
Tom
From: rschultz@google.com [mailto:rschultz@google.com] On Behalf Of Rebecca Schultz Zavin Sent: 20 April 2011 22:53 To: Dave Airlie Cc: Tom Cooksey; Arnd Bergmann; linaro-mm-sig@lists.linaro.org Subject: Re: [Linaro-mm-sig] Memory region attribute bits and multiple mappings
I've been buried all day, but here's my response to a bunch of the discussion points today:
I definitely don't think it makes sense to solve part of the problem and then leave the gpu's to require their own allocator for their common case. Why solve the allocator problem twice? Also, we want to be able to support passing of these buffers between the camera, gpu, video decoder etc. It'd be much easier if they were from a common source.
The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace. The common case for this is to write vertexes, textures etc to these buffers once and never touch them again. This may happen several (or even several 10s or more) of times per frame. My experience with cache flushes on ARM architectures matches Marek's. Typically write combine makes streaming writes really really fast, and on several SOC's we've found it cheaper to flush the whole cache than to flush by line. Clearly this impacts the rest of system performance, not to mention the fact that a couple of large textures and you've totally blown your caches for the rest of the system.
Once we've solved the multiple mapping problem, it becomes quite easy to support both cached AND uncached accesses from the cpu, so I think that covers the cases where it's actually desirable to have a cached mapping -- software rendering, processing frames from a camera etc.
I think the issue of contiguous memory allocation and cache attributes are totally separate, except that in cases where large contiguous regions are necessary -- the problem qualcom's pmem and friends were written to solve -- you pretty much end up needing to put aside a pool of buffers at boot anyway in order to guarantee the availability of large order allocations. Once you've done that, the attributes problem goes away since your memory is not in the direct map.
On Wed, Apr 20, 2011 at 1:34 PM, Dave Airlie <airlied@gmail.commailto:airlied@gmail.com> wrote:
[TC] I'm not sure I completely agree with this being a use case. From my understanding, the problem we're trying to solve here isn't a generic graphics memory manager but rather a memory manager to facilitate cross-device sharing of 2D image buffers. GPU drivers will still have their own allocators for textures which will probably be in a tiled or other proprietary format no other device can understand anyway. The use case where we (GPU driver vendors) might want uncached memory is for one-time texture upload. If we have a texture we know we're only going to write to once (with the CPU), there is no benefit in using cached memory. In fact, there's a potential performance drop if you used cached memory because the texture upload will cause useful cache lines to be evicted and replaced with useless lines for the texture. However, I don't see any use case where we'd then want to share that CPU-uploaded texture with another device, in which case we would use our own (uncached) allocator, not this "cross-device" allocator. There's also a school of thought (so I'm told) that for one-time texture upload you still want to use cached memory because more modern cores have smaller write buffers (so you want to use the cache for better combining of writes) and are good at detecting large sequential writes and thus don't use the whole cache for those anyway. So, other than one-time texture upload, are there other graphics use cases you know of where it might be more optimal to use uncached memory? What about video decoder use-cases?
The memory mangaer should be used for all internal GPU memory management as well if desired.
If we have any hope of ever making open source ARM GPU drivers get upstream they can't all just go reinventing the wheel. They need to be based on a common layer.
Dave.
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Apr 21, 2011 at 3:33 AM, Tom Cooksey Tom.Cooksey@arm.com wrote:
At least with Mali, we use user-space sub-allocators. So we’ll acquire hunks of memory from the kernel, map it into the GPU context’s virtual address space and let the user-space driver divide it up into however many buffers it needs. So I guess we could still acquire the hunks of memory from whatever kernel-space allocator we want, but fundamentally, kernel-space will never see what individual buffers that memory is divided up into. I’m also struggling to understand how we could securely share a buffer after it has been allocated given we’d have to share entire pages. It’s quite likely a buffer we want to share will have data in pages used by other buffers which we don’t want to share? Fundamentally, we need to know at allocation time if a buffer is intended to be shared with another device or process so we can make sure the allocation gets its own pages.
I know lots of gpu driver code suballocates, I expect folks will do exactly what you've suggested, allocate a chunk and then manage it elsewhere.
I don't think there's any way to manage security at less than a page boundary. Do you have an mmu for your gpu with smaller page tables than that? I think if you do that kind of thing you are breaking the security model and there's no way around it. Theoretically we could manage security at less than an allocation granularity -- i'm thinking you'd mmap at an offset. I implemented that for PMEM and it's a bit of a metadata management mess, but it's a possible extension. Still it'd be at a page boundary though.
Thinking about the uncached issue some more, I think I’ve convinced myself that it should be a requirement. We’ll have to support older (pre-A9) cores for some time where using uncached memory provides better performance (which seems to be the consensus opinion here?).
Agreed.
Cheers,
Tom
*From:* rschultz@google.com [mailto:rschultz@google.com] *On Behalf Of *Rebecca Schultz Zavin *Sent:* 20 April 2011 22:53 *To:* Dave Airlie *Cc:* Tom Cooksey; Arnd Bergmann; linaro-mm-sig@lists.linaro.org *Subject:* Re: [Linaro-mm-sig] Memory region attribute bits and multiple mappings
I've been buried all day, but here's my response to a bunch of the discussion points today:
I definitely don't think it makes sense to solve part of the problem and then leave the gpu's to require their own allocator for their common case. Why solve the allocator problem twice? Also, we want to be able to support passing of these buffers between the camera, gpu, video decoder etc. It'd be much easier if they were from a common source.
The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace. The common case for this is to write vertexes, textures etc to these buffers once and never touch them again. This may happen several (or even several 10s or more) of times per frame. My experience with cache flushes on ARM architectures matches Marek's. Typically write combine makes streaming writes really really fast, and on several SOC's we've found it cheaper to flush the whole cache than to flush by line. Clearly this impacts the rest of system performance, not to mention the fact that a couple of large textures and you've totally blown your caches for the rest of the system.
Once we've solved the multiple mapping problem, it becomes quite easy to support both cached AND uncached accesses from the cpu, so I think that covers the cases where it's actually desirable to have a cached mapping -- software rendering, processing frames from a camera etc.
I think the issue of contiguous memory allocation and cache attributes are totally separate, except that in cases where large contiguous regions are necessary -- the problem qualcom's pmem and friends were written to solve -- you pretty much end up needing to put aside a pool of buffers at boot anyway in order to guarantee the availability of large order allocations. Once you've done that, the attributes problem goes away since your memory is not in the direct map.
On Wed, Apr 20, 2011 at 1:34 PM, Dave Airlie airlied@gmail.com wrote:
[TC] I’m not sure I completely agree with this being a use case. From my understanding, the problem we’re trying to solve here isn’t a generic graphics memory manager but rather a memory manager to facilitate cross-device sharing of 2D image buffers. GPU drivers will still have
their
own allocators for textures which will probably be in a tiled or other proprietary format no other device can understand anyway. The use case
where
we (GPU driver vendors) might want uncached memory is for one-time
texture
upload. If we have a texture we know we’re only going to write to once
(with
the CPU), there is no benefit in using cached memory. In fact, there’s a potential performance drop if you used cached memory because the texture upload will cause useful cache lines to be evicted and replaced with
useless
lines for the texture. However, I don’t see any use case where we’d then want to share that CPU-uploaded texture with another device, in which
case
we would use our own (uncached) allocator, not this “cross-device” allocator. There’s also a school of thought (so I’m told) that for
one-time
texture upload you still want to use cached memory because more modern
cores
have smaller write buffers (so you want to use the cache for better combining of writes) and are good at detecting large sequential writes
and
thus don’t use the whole cache for those anyway. So, other than one-time texture upload, are there other graphics use cases you know of where it might be more optimal to use uncached memory? What about video decoder use-cases?
The memory mangaer should be used for all internal GPU memory management as well if desired.
If we have any hope of ever making open source ARM GPU drivers get upstream they can't all just go reinventing the wheel. They need to be based on a common layer.
Dave.
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi,
On Wed, Apr 20, 2011 at 02:52:56PM -0700, Rebecca Schultz Zavin wrote:
The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace. The common case for this is to write vertexes, textures etc to these buffers once and never touch them again. This may happen several (or even several 10s or more) of times per frame. My experience with cache flushes on ARM architectures matches Marek's. Typically write combine makes streaming writes really really fast, and on several SOC's we've found it cheaper to flush the whole cache than to flush by line. Clearly this impacts the rest of system performance, not to mention the fact that a couple of large textures and you've totally blown your caches for the rest of the system.
This was exactly our experience with OMAP hardware (the Nokia N900, running an OMAP3430); we made strong use of uncached regions for exactly this reason, and measured a fairly dramatic performance win in both artificial and realistic usecases.
(This was with a Cortex-A8, so happy days.)
Cheers, Daniel
On Tue, Apr 19, 2011 at 4:23 PM, Arnd Bergmann arnd@arndb.de wrote:
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
Would it get simpler if we only allow entire supersections to be moved into the uncached memory allocator?
I think.. although how do you migrate it back when it isn't needed for mm/gfx buffers? I guess if it is all for large >=720p size of buffers, maybe fragmentation is less of a problem?
BR, -R
On Tuesday 19 April 2011 23:23:12 Arnd Bergmann wrote:
On Tuesday 19 April 2011 22:06:50 Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
Thanks for the summary and getting this started!
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached.
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
In general (not talking about ARM in particular), Linux does not support mapping RAM pages with conflicting cache attributes. E.g. on certain powerpc CPUs, you get a checkstop if you try to bypass the cache when there is already an active cache line for it.
This is a variant of the cache aliasing problem we see with virtually indexed caches: You may end up with multiple cache lines for the same physical address, with different contents. The results are unpredictable, so most CPU architectures explicitly forbid this.
A couple of users ran into that exact problem when using USB webcams on ARM. The kernel driver allocates memory using vmalloc_32(), which is then mapped to userspace by an mmap() call. The memory is written to by the CPU in kernel context, and read from in userspace. With VIVT or aliasing VIPT caches, this leads to cache coherency issues.
The only workaround I've been able to find is to munmap() the buffer before passing it to the kernel, and mmap() it back when the kernel is done with it. When talking about uncompressed HD video at 30 fps or more, that's quite expensive. We need a better solution.
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
Right, I believe this is what people generally do to avoid the problem, but I wouldn't call it a solution.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
We are very close to needing highmem on a lot of systems, and in Linaro we generally assume that it's there. For instance, Acer has announced an Android tablet that has a full gigabyte of RAM, so they are most likely using highmem already.
There is a significant overhead in simply enabling highmem on a system where you don't need it, but it also makes it possible to use the memory for page cache that would otherwise be wasted when there is no active user of the reserved memory.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
Would it get simpler if we only allow entire supersections to be moved into the uncached memory allocator?
Hello,
On Tuesday, April 19, 2011 11:23 PM Arnd Bergmann wrote:
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
Flushing cache for large buffers also takes a significant time, especially if it is implemented by iterating over the whole buffer and calling flush instruction for each line.
For most use cases the CPU write speed is not degraded on non-caches memory areas. ARM CPUs with write combining feature performs really well on uncached memory.
Non-cached buffers are also the only solution for buffers that need to be permanently mapped to userspace (like framebuffer). Non-cached mappings are also useful when one doesn't touch the memory with cpu at all (zero copy between 2 independent multimedia blocks).
(snipped)
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your
system
into memory you want accessible to the unified memory manager and memory
you
don't. This may not be that big a deal, since most current solutions,
pmem,
cmem, et al basically do this. I can say that on Android devices running
on
a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
Right, I believe this is what people generally do to avoid the problem, but I wouldn't call it a solution.
Is is also a huge memory waste. Similar solutions have been proposed to overcome the problem of memory fragmentation. I don't think we can afford giving away almost half of the system memory just to have the possibility of processing 720p movie.
That's why we came with the idea of CMA (contiguous memory allocator) which can 'recycle' memory areas that are not used by multimedia hardware. CMA allows system to allocate movable pages (like page cache, user process memory, etc) from defined CMA range and migrate them on allocation request for contiguous memory. For more information, please refer to: https://lkml.org/lkml/2011/3/31/213
I want to merge this idea with changing changing the kernel linear low-mem mapping, so 2-level page mapping will be done only for the defined CMA range, what should reduce TLB pressure. Once the contiguous block is allocated from CMA range, the mapping in low-mem area can be removed to fulfill the ARM specification.
3- fix up the unity mapping so the attribute bits match those desired by
the
unified memory manger. This could be done by removing pages from the
unity
map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
Would it get simpler if we only allow entire supersections to be moved into the uncached memory allocator?
I'm not sure if this will change anything. Updating attributes of supersection still requires iterating over all processes and their page tables. If we change the attribute of the supersection on system boot and allow only to allocate buffer from it, we will get the solution #1 (allocation of dma buffers only from the reserved memory).
Best regards
On Wednesday 20 April 2011, Marek Szyprowski wrote:
Hello,
On Tuesday, April 19, 2011 11:23 PM Arnd Bergmann wrote:
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
Flushing cache for large buffers also takes a significant time, especially if it is implemented by iterating over the whole buffer and calling flush instruction for each line.
For most use cases the CPU write speed is not degraded on non-caches memory areas. ARM CPUs with write combining feature performs really well on uncached memory.
Ok, makes sense.
Non-cached buffers are also the only solution for buffers that need to be permanently mapped to userspace (like framebuffer).
Why? Are the cache flush operations privileged on ARM?
Non-cached mappings are also useful when one doesn't touch the memory with cpu at all (zero copy between 2 independent multimedia blocks).
I would think that if we don't want to touch the data, ideally it should not be mapped at all into the kernel address space.
Right, I believe this is what people generally do to avoid the problem, but I wouldn't call it a solution.
Is is also a huge memory waste. Similar solutions have been proposed to overcome the problem of memory fragmentation. I don't think we can afford giving away almost half of the system memory just to have the possibility of processing 720p movie.
Yes, that was my point. It's not a solution at all, just the easiest way to ignore the problem by creating a different one.
That's why we came with the idea of CMA (contiguous memory allocator) which can 'recycle' memory areas that are not used by multimedia hardware. CMA allows system to allocate movable pages (like page cache, user process memory, etc) from defined CMA range and migrate them on allocation request for contiguous memory. For more information, please refer to: https://lkml.org/lkml/2011/3/31/213
I thought CMA was mostly about dealing with systems that don't have an IOMMU, which is a related problem, but is not the same as dealing with noncoherent DMA.
I want to merge this idea with changing changing the kernel linear low-mem mapping, so 2-level page mapping will be done only for the defined CMA range, what should reduce TLB pressure. Once the contiguous block is allocated from CMA range, the mapping in low-mem area can be removed to fulfill the ARM specification.
I'm not convinced that trying to solve both issues at the same time is a good idea. For systems that have an IOMMU, we just want a bunch of pages and map them virtually contiguous into the bus address space. For other systems, we really need physically contiguous memory. Independent of that, you may or may not need to unmap them from the linear mapping and make them noncached, depending on whether there is prefetching and noncoherent DMA involved.
Arnd
Hello,
On Wednesday, April 20, 2011 5:13 PM Arnd Bergmann wrote:
I'm sorry for a little delay in answer, I've missed some mail from linaro-mm.
On Wednesday 20 April 2011, Marek Szyprowski wrote:
Hello,
On Tuesday, April 19, 2011 11:23 PM Arnd Bergmann wrote:
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
Flushing cache for large buffers also takes a significant time, especially if it is implemented by iterating over the whole buffer and calling flush instruction for each line.
For most use cases the CPU write speed is not degraded on non-caches memory areas. ARM CPUs with write combining feature performs really well on uncached memory.
Ok, makes sense.
Non-cached buffers are also the only solution for buffers that need to be permanently mapped to userspace (like framebuffer).
Why? Are the cache flush operations privileged on ARM?
I simplified this too much. Non-cached buffers are the only solution for userspace APIs that assume coherent memory. With framebuffer example I wanted to say that the fb clients can write data at anytime though the mmaped window and kernel has no way to guess when data has been written, so it has no possibility to flush/invalidate cache.
Non-cached mappings are also useful when one doesn't touch the memory with cpu at all (zero copy between 2 independent multimedia blocks).
I would think that if we don't want to touch the data, ideally it should not be mapped at all into the kernel address space.
That's the other possibility, however it is hardly possible now because of the lacks in the kernel and userspace API for most of the subsystems.
(snipped)
That's why we came with the idea of CMA (contiguous memory allocator) which can 'recycle' memory areas that are not used by multimedia hardware. CMA allows system to allocate movable pages (like page cache, user
process
memory, etc) from defined CMA range and migrate them on allocation
request
for contiguous memory. For more information, please refer to: https://lkml.org/lkml/2011/3/31/213
I thought CMA was mostly about dealing with systems that don't have an IOMMU, which is a related problem, but is not the same as dealing with noncoherent DMA.
Right, the main purpose of the CMA is to provide a possibility to allocate a large chunk of contiguous memory. The reason I mentioned it is the fact that some proposed solutions for coherent mapping problem assumed that memory buffers for coherent allocator will be allocated from separate memory region that is not mapped in low-level kernel memory or not accessed by kernel at all.
I want to merge this idea with changing changing the kernel linear low-
mem
mapping, so 2-level page mapping will be done only for the defined CMA range, what should reduce TLB pressure. Once the contiguous block is allocated from CMA range, the mapping in low-mem area can be removed to fulfill the ARM specification.
I'm not convinced that trying to solve both issues at the same time is a good idea. For systems that have an IOMMU, we just want a bunch of pages and map them virtually contiguous into the bus address space. For other systems, we really need physically contiguous memory. Independent of that, you may or may not need to unmap them from the linear mapping and make them noncached, depending on whether there is prefetching and noncoherent DMA involved.
Right, the problem of allocating memory and keeping the correct page mappings are orthogonal to each other. I just wanted to show that some optimizations can be achieved of both problems are solved together.
Best regards
On 27 April 2011 09:16, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hello,
On Wednesday, April 20, 2011 5:13 PM Arnd Bergmann wrote:
I'm sorry for a little delay in answer, I've missed some mail from linaro-mm.
On Wednesday 20 April 2011, Marek Szyprowski wrote:
Hello,
On Tuesday, April 19, 2011 11:23 PM Arnd Bergmann wrote:
This may be a stupid question, but do we have an agreement that it is actually a requirement to have uncached mappings? With the streaming DMA mapping API, it should be possible to work around noncoherent DMA by flushing the caches at the right times, which probably results in better performance than simply doing noncached mappings. What is the specific requirement for noncached memory regions?
Flushing cache for large buffers also takes a significant time, especially if it is implemented by iterating over the whole buffer and calling flush instruction for each line.
For most use cases the CPU write speed is not degraded on non-caches memory areas. ARM CPUs with write combining feature performs really well on uncached memory.
Ok, makes sense.
Non-cached buffers are also the only solution for buffers that need to be permanently mapped to userspace (like framebuffer).
Why? Are the cache flush operations privileged on ARM?
I simplified this too much. Non-cached buffers are the only solution for userspace APIs that assume coherent memory. With framebuffer example I wanted to say that the fb clients can write data at anytime though the mmaped window and kernel has no way to guess when data has been written, so it has no possibility to flush/invalidate cache.
Non-cached mappings are also useful when one doesn't touch the memory with cpu at all (zero copy between 2 independent multimedia blocks).
I would think that if we don't want to touch the data, ideally it should not be mapped at all into the kernel address space.
That's the other possibility, however it is hardly possible now because of the lacks in the kernel and userspace API for most of the subsystems.
If we can sort out a rational approach to multiple mappers across architectures we could take care of this.
(snipped)
That's why we came with the idea of CMA (contiguous memory allocator) which can 'recycle' memory areas that are not used by multimedia hardware. CMA allows system to allocate movable pages (like page cache, user
process
memory, etc) from defined CMA range and migrate them on allocation
request
for contiguous memory. For more information, please refer to: https://lkml.org/lkml/2011/3/31/213
I thought CMA was mostly about dealing with systems that don't have an IOMMU, which is a related problem, but is not the same as dealing with noncoherent DMA.
Right, the main purpose of the CMA is to provide a possibility to allocate a large chunk of contiguous memory. The reason I mentioned it is the fact that some proposed solutions for coherent mapping problem assumed that memory buffers for coherent allocator will be allocated from separate memory region that is not mapped in low-level kernel memory or not accessed by kernel at all.
I want to merge this idea with changing changing the kernel linear low-
mem
mapping, so 2-level page mapping will be done only for the defined CMA range, what should reduce TLB pressure. Once the contiguous block is allocated from CMA range, the mapping in low-mem area can be removed to fulfill the ARM specification.
We could keep the mapping around and just mark it inactive. If we could handle the multiple mappers problem on-demand that could streamline things.
I'm not convinced that trying to solve both issues at the same time is a good idea. For systems that have an IOMMU, we just want a bunch of pages and map them virtually contiguous into the bus address space. For other systems, we really need physically contiguous memory. Independent of that, you may or may not need to unmap them from the linear mapping and make them noncached, depending on whether there is prefetching and noncoherent DMA involved.
Right, the problem of allocating memory and keeping the correct page mappings are orthogonal to each other. I just wanted to show that some optimizations can be achieved of both problems are solved together.
On Wednesday 27 April 2011 21:12:33 Zach Pfeffer wrote:
On 27 April 2011 09:16, Marek Szyprowski m.szyprowski@samsung.com wrote:
I want to merge this idea with changing changing the kernel linear low-
mem
mapping, so 2-level page mapping will be done only for the defined CMA range, what should reduce TLB pressure. Once the contiguous block is allocated from CMA range, the mapping in low-mem area can be removed to fulfill the ARM specification.
We could keep the mapping around and just mark it inactive. If we could handle the multiple mappers problem on-demand that could streamline things.
According to what Catalin wrote today, even that is not necessary any more, as the ARM ARM gets updated to allow it. We only need to flush the cache for the linear mapping before we establish another one, AFAICT.
Arnd
On Tue, 19 Apr 2011, Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached. However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess.
My own guess is that the cacheable attribute is tied to cache entries which are physically tagged. Access to one mapping could establish some attributes that the second mapping would inherit on cache hit.
I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
Yes, speculative prefetching is what will cause spurious accesses through the kernel direct mapping (that's how it is called in Linux) even if you don't access it explicitly. Older architectures don't have speculative prefetching, and even older ones have VIVT caches which has no problem with multiple different mappings.
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
This is obviously suboptimal.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years.
The kernel tries not to allocate its own data in highmem. Instead, highmem pages are used for user space processes or the buffer cache which can be populated directly by DMA and be largely untouched by the kernel. The highmem pages are also fairly easily reclaimable making them an easy target when large physically contiguous allocations are required.
It is true that most systems might not have enough memory to require highmem, but they can make use of it nevertheless, simply by changing the direct mapped memory threshold.
While highmem is not free in terms of overhead, it is still quite lightweight compared to other memory partitioning schemes, and above all it is already supported across the whole kernel and relied upon by many people already.
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
The kernel direct mapping share the same mapping entries across all processes. So if (part of) the kernel direct mapping uses second level page table entries, then the first level entries will share the same second level page table across all processes. Hence changing memory attributes for those pages covered by that second level table won't require any itteration over all processes. Obviously the drawback here is more TLB pressure, however if the memory put aside is not used by the kernel directly then the associated TLBs won't be involved.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain the problem in more detail?
I'm happy to discuss about the details when needed.
Nicolas
On Tue, Apr 19, 2011 at 2:49 PM, Nicolas Pitre nicolas.pitre@linaro.orgwrote:
On Tue, 19 Apr 2011, Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think
it's
going to be a requirement that some of the memory allocated via the
unified
memory manager is mapped uncached. However, because all of memory is
mapped
cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice.
I'd
also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain
to
me why this causes a problem. I have a theory, but it's mostly a guess.
My own guess is that the cacheable attribute is tied to cache entries which are physically tagged. Access to one mapping could establish some attributes that the second mapping would inherit on cache hit.
I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
Yes, speculative prefetching is what will cause spurious accesses through the kernel direct mapping (that's how it is called in Linux) even if you don't access it explicitly. Older architectures don't have speculative prefetching, and even older ones have VIVT caches which has no problem with multiple different mappings.
My guess was that if a line was present in the L2, even accesses via an uncached mapping would get serviced from there. Presumably the prefetcher would have had to access the cached mapping, causing it to get populated into the l2. Then when the uncached mapping is accessed later, we see the old, possibly stale one. The only mechanism by which this makes sense to me is if the cache attributes are only checked when deciding to put a line into the l2 and not when retrieving one.
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a
time.
Obvious drawbacks here are that you have to statically partition your
system
into memory you want accessible to the unified memory manager and memory
you
don't. This may not be that big a deal, since most current solutions,
pmem,
cmem, et al basically do this. I can say that on Android devices running
on
a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
This is obviously suboptimal.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata
in
highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift
in
the next couple of years.
The kernel tries not to allocate its own data in highmem. Instead, highmem pages are used for user space processes or the buffer cache which can be populated directly by DMA and be largely untouched by the kernel. The highmem pages are also fairly easily reclaimable making them an easy target when large physically contiguous allocations are required.
It is true that most systems might not have enough memory to require highmem, but they can make use of it nevertheless, simply by changing the direct mapped memory threshold.
While highmem is not free in terms of overhead, it is still quite lightweight compared to other memory partitioning schemes, and above all it is already supported across the whole kernel and relied upon by many people already.
This is actually my favorite solution -- it's easy and not to gross. To be honest we've been tweaking things so we get some highmem on some android platforms for a while, because the binder is a huge kernel address space hog. My plan for implementing a memory manager is to start with this and solve the harder problem only if I have to.
3- fix up the unity mapping so the attribute bits match those desired by
the
unified memory manger. This could be done by removing pages from the
unity
map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
The kernel direct mapping share the same mapping entries across all processes. So if (part of) the kernel direct mapping uses second level page table entries, then the first level entries will share the same second level page table across all processes. Hence changing memory attributes for those pages covered by that second level table won't require any itteration over all processes. Obviously the drawback here is more TLB pressure, however if the memory put aside is not used by the kernel directly then the associated TLBs won't be involved.
I think the whole kernel direct mapping might be in 1st level page tables, assuming it's an integer multiple of sections or supersections and aligned properly. I suppose we could hack things to make sure there were at least 1 supersection's worth of second level page table entries required. I came to the same conclusion about TLB pressure, that it didn't matter much if we don't actually ever use those TLBs anyway. We actually had some patches from folks at nvidia to solve part of this problem by just making the whole kernel direct map page based. We never measured the performance impact, just decided we were uncomfortable with the change.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain
the
problem in more detail?
I'm happy to discuss about the details when needed.
Nicolas
On Tue, Apr 19, 2011 at 2:49 PM, Nicolas Pitre nicolas.pitre@linaro.orgwrote:
On Tue, 19 Apr 2011, Rebecca Schultz Zavin wrote:
Hey all,
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think
it's
going to be a requirement that some of the memory allocated via the
unified
memory manager is mapped uncached. However, because all of memory is
mapped
cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice.
I'd
also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain
to
me why this causes a problem. I have a theory, but it's mostly a guess.
My own guess is that the cacheable attribute is tied to cache entries which are physically tagged. Access to one mapping could establish some attributes that the second mapping would inherit on cache hit.
My guess was that if a line was present in the L2, even accesses via an uncached mapping would get serviced from there. Presumably the prefetcher would have had to access the cached mapping, causing it to get populated into the l2. Then when the uncached mapping is accessed later, we see the old, possibly stale one. The only mechanism by which this makes sense to me is if the cache attributes are only checked when deciding to put a line into the l2 and not when retrieving one.
I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour
Yes, speculative prefetching is what will cause spurious accesses through the kernel direct mapping (that's how it is called in Linux) even if you don't access it explicitly. Older architectures don't have speculative prefetching, and even older ones have VIVT caches which has no problem with multiple different mappings.
If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds:
1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a
time.
Obvious drawbacks here are that you have to statically partition your
system
into memory you want accessible to the unified memory manager and memory
you
don't. This may not be that big a deal, since most current solutions,
pmem,
cmem, et al basically do this. I can say that on Android devices running
on
a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this.
This is obviously suboptimal.
2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata
in
highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift
in
the next couple of years.
The kernel tries not to allocate its own data in highmem. Instead, highmem pages are used for user space processes or the buffer cache which can be populated directly by DMA and be largely untouched by the kernel. The highmem pages are also fairly easily reclaimable making them an easy target when large physically contiguous allocations are required.
It is true that most systems might not have enough memory to require highmem, but they can make use of it nevertheless, simply by changing the direct mapped memory threshold.
While highmem is not free in terms of overhead, it is still quite lightweight compared to other memory partitioning schemes, and above all it is already supported across the whole kernel and relied upon by many people already.
This is actually my favorite solution -- it's easy and not to gross. To be honest we've been tweaking things so we get some highmem on some android platforms for a while, because the binder is a huge kernel address space hog. My plan for implementing a memory manager is to start with this and solve the harder problem only if I have to.
3- fix up the unity mapping so the attribute bits match those desired by
the
unified memory manger. This could be done by removing pages from the
unity
map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
The kernel direct mapping share the same mapping entries across all processes. So if (part of) the kernel direct mapping uses second level page table entries, then the first level entries will share the same second level page table across all processes. Hence changing memory attributes for those pages covered by that second level table won't require any itteration over all processes. Obviously the drawback here is more TLB pressure, however if the memory put aside is not used by the kernel directly then the associated TLBs won't be involved.
I think the whole kernel direct mapping might be in 1st level page tables, assuming it's an integer multiple of sections or supersections and aligned properly. I suppose we could hack things to make sure there were at least 1 supersection's worth of second level page table entries required. I came to the same conclusion about TLB pressure, that it didn't matter much if we don't actually ever use those TLBs anyway. We actually had some patches from folks at nvidia to solve part of this problem by just making the whole kernel direct map page based. We never measured the performance impact, just decided we were uncomfortable with the change.
These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain
the
problem in more detail?
I'm happy to discuss about the details when needed.
Nicolas
Rebecca
Hello,
On Tuesday, April 19, 2011 10:07 PM Rebecca Schultz Zavin wrote:
While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions.
Thanks for starting the discussion!
(snipped)
3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables.
There have been proposals to change the way the unity mapping is created on ARM. If we drop section mappings and use standard, two-level mappings changing the attributes of particular page is much easier. Such solution has been proposed in: http://thread.gmane.org/gmane.linux.ports.arm.kernel/86697/focus=86700
Currently I'm working on adapting it to latest kernel and merging with latest CMA patches.
Best regards
Including Steve to talk about Qualcomm's chipset.
On 19 April 2011 15:06, Rebecca Schultz Zavin rebecca@android.com wrote:
Hey all, While we are working out requirements, I was hoping to get some more information about another related issue that keeps coming up on mailing lists and in discussions. ARM has stated that if you have the same physical memory mapped with two different sets of attribute bits you get undefined behavior. I think it's going to be a requirement that some of the memory allocated via the unified memory manager is mapped uncached. However, because all of memory is mapped cached into the unity map at boot, we already have two mappings with different attributes. I want to understand the mechanism of the problem, because none of the solutions I can come up with are particularly nice. I'd also like to know exactly which architectures are affected, since the fix may be costly in performance, memory or both. Can someone at ARM explain to me why this causes a problem. I have a theory, but it's mostly a guess. I especially want to understand if it's still a problem if we never access the memory via the mapping in the unity map. I know speculative prefetching is part of the issue, so I assume older architectures without that feature don't exhibit this behaviour If we really need all mappings of physical memory to have the same cache attribute bits, I see three workarounds: 1- set aside memory at boot that never gets mapped by the kernel. The unified memory manager can then ensure there's only one mapping at a time. Obvious drawbacks here are that you have to statically partition your system into memory you want accessible to the unified memory manager and memory you don't. This may not be that big a deal, since most current solutions, pmem, cmem, et al basically do this. I can say that on Android devices running on a high resolution display (720p and above) we're easily talking about needing 256M of memory or more to dedicate to this. 2- use highmem pages only for the unified memory manager. Highmem pages only get mapped on demand. This has some performance costs when the kernel allocates other metadata in highmem. Most embedded systems still don't have enough memory to need highmem, though I'm guessing that'll follow the current trend and shift in the next couple of years. 3- fix up the unity mapping so the attribute bits match those desired by the unified memory manger. This could be done by removing pages from the unity map. It's complicated by the fact that the unity map makes use of large pages, sections and supersections to reduce tlb pressure. I don't think this is impossible if we restrict the set of contexts from which it can happen, but I'm imagining that we will also need to maintain some kind of pool of memory we've moved from cached to uncached since the process is likely to be expensive. Quite likely we will have to iterate over processes and update all their top level page tables. These all have drawbacks, so I'd like to really understand the problem before pursuing them. Can the linaro folks find someone who can explain the problem in more detail? Thanks, Rebecca _______________________________________________ Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
linaro-mm-sig@lists.linaro.org