The need to allocate pages for "write combining" access goes deeper than anything to do with DMA or IOMMUs. Please keep "write combine" distinct from "coherent" in the allocation/mapping APIs.
Write-combining is a special case because it's an end-to-end requirement, usually architecturally invisible, and getting it to happen requires a very specific combination of mappings and code. There's a good explanation here of the requirements on some Intel implementations of the x86 architecture: http://software.intel.com/en-us/articles/copying-accelerated-video-decode-fr... . As I understand it, similar considerations apply on at least some ARMv7 implementations, with NEON multi-register load/store operations taking the place of MOVNTDQ. (See http://www.arm.com/files/pdf/A8_Paper.pdf for instance; although I don't think there's enough detail about the conditions under which "if the full cache line is written, the Level-2 line is simply marked dirty and no external memory requests are required.")
As far as I can tell, there is not yet any way to get real cache-bypassing write-combining from userland in a mainline kernel, for x86/x86_64 or ARM. I have been able to do it from inside a driver on x86, including in an ISR with some fixes to the kernel's FPU context save/restore code (patch attached, if you're curious); otherwise I haven't yet seen write-combining in operation on Linux. The code that needs to bypass the cache is part of a SoC silicon erratum workaround supplied by Intel. It didn't work as delivered -- it oopsed the kernel -- but is now shipping inside our product, and no problems have been reported from QA or the field. So I'm fairly sure that the changes I made are effective.
I am not expert in this area; I was just forced to learn something about it in order to make a product work. My assertion that "there's no way to do it yet" is almost certainly wrong. I am hoping and expecting to be immediately contradicted, with a working code example and benchmarks that show that cache lines are not being fetched, clobbered, and stored again, with the latencies hidden inside the cache architecture. :-) (Seriously: there are four bits in the Cortex-A8's "L2 Cache Auxiliary Control Register" that control various aspects of this mechanism, and if you don't have a fairly good explanation of which bits do and don't affect your benchmark, then I contend that the job isn't done. I don't begin to understand the equivalent for the multi-core A9 I'm targeting next.)
If some kind person doesn't help me see the error of my ways, I'm going to have to figure it out for myself on ARM in the next couple of months, this time for performance reasons rather than to work around silicon errata. Unfortunately, I do not expect it to be particularly low-hanging fruit. I expect to switch to the hard-float ABI first (the only remaining obstacle being a couple of TI-supplied binary-only libraries). That might provide enough of a system-level performance win (by allowing the compiler to reorder fetches to NEON registers across function/method calls) to obviate the need.
Cheers, - Michael