On Mon, May 06, 2019 at 07:22:06PM -0700, Linus Torvalds wrote:
We do *not* have very strict guarantees for D$-vs-I$ coherency on x86, but we *do* have very strict guarantees for D$-vs-D$ coherency. And so we could use the D$ coherency to give us atomicity guarantees for loading and storing the instruction offset for instruction emulation, in ways we can *not* use the D$-to-I$ guarantees and just executing it directly.
So while we still need those nasty IPI's to guarantee the D$-vs-I$ coherency in the "big picture" model and to get the serialization with the actual 'int3' exception right, we *could* just do all the other parts of the instruction emulation using the D$ coherency.
So we could do the actual "call offset" write with a single atomic 4-byte locked cycle (just use "xchg" to write - it's always locked). And similarly we could do the call offset *read* with a single locked cycle (cmpxchg with a 0 value, for example). It would be atomic even if it crosses a cacheline boundary.
Very 'soon', x86 will start to #AC if you do unaligned LOCK prefixed instructions. The problem is that while aligned LOCK instructions can do the atomicity with the coherency protocol, unaligned (esp, line or page boundary crossing ones) needs that bus-lock thing the SDM talks about.
For giggles, write yourself a while(1) loop that XCHGs across a page-boundary and see what it does to the rest of the system.
So _please_, do not rely on unaligned atomic ops. We really want them to do the way of the Dodo.