On Thu, Aug 10, 2023 at 04:27:02PM -0300, Jason Gunthorpe wrote:
Do we need to worry about the ring wrap around? It is already the case that the VMM has to scan the ring and extract the invalidation commands, wouldn't it already just linearize them?
I haven't got the chance to send the latest vSMMU series but I pass down the raw user CMDQ to the host to go through, as it'd be easier to stall the consumer index movement when a command in the middle fails.
Don't some commands have to be executed by the VMM?
Well, they do. VMM would go through the queue and "execute" non- invalidation commands, then defer the queue to the kernel to go through the queue once more. So, the flaw could be that some of the commands behind the failing TLB flush command got "executed", though in a real case most of other commands would be "executed" standalone with a CMD_SYNC, i.e. not mixing with any invalidation command.
Even so, it seems straightforward enough for the kernel to report the number of commands it executed and the VMM can adjust the virtual consumer index.
It is not that straightforward to revert an array index back to a consumer index because they might not be 1:1 mapped, since in theory there could be other commands mixing in-between, although it unlikely happens.
So, another index-mapping array would be needed for this matter. And this doesn't address the flaw that I mentioned above either. So, I took the former solution to reduce the complication.
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Hmm, why a SW emulated ring?
That is what you are building. The VMM catches the write of the producer pointer and the VMM SW bundles it up to call into the kernel.
Still not fully getting it. Do you mean a ring that is prepared by the VMM? I think the only case that we need to handle a ring is what I did by forwarding the guest CMDQ (a ring) to the host directly. Not sure why VMM would need another ring for those linearized invalidation commands. Or maybe I misunderstood..
Yes for the latter question. SMMU kernel driver has something like Q_WRP and other helpers, so it wasn't difficult to process the user CMDQ in the same raw form. But it does complicates the common code if we want to do it there.
Optimizing wrap around means when the producer/consumer pointers pass the end of the queue memory we execute one, not two ioctls toward the kernel. That is possible a very minor optimization, it depends how big the queues are and how frequent multi-entry items will be present.
There could be other commands being issued by other VMs or even the host between the two ioctls. So probably we'd need to handle the wrapping case when doing a ring solution?
Thanks Nicolin