Linux Loopback POC

List overview All Threads
Download

newer

older

Virtio message status update

PDF of Bertrand's latest work

Bertrand Marquis

23 Apr 2026 23 Apr '26

1:40 p.m.

Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

Cheers Bertrand IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Show replies by date

Viresh Kumar

27 Apr 27 Apr

10:40 a.m.

Hi Bertrand,

On 23-04-26, 13:40, Bertrand Marquis wrote:

...

Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

I tried to go through the kernel patches and it was a lot (~14k lines of code). After a while, I parsed it using chatgpt :)

There are few basic question I have:

- As I understand, this is a completely new and parallel implementation (with almost no similarity) with the one we already have (and sent as the first RFC). Is this interpretation correct ? I was hoping patches on top of the work already done in this area :(

- I see a lot of complexity being added, like networking style bridge between endpoints, etc. Why is this required ? What is it that the current design fails to address ? I am sure there must be reasons behind that, I am not sure what doesn't work right now. Sorry about that.

- I am sorry that I wasn't able to do any deep reviews of the code for now, the code is really _big_ :)

Thanks.

-- viresh

Bertrand Marquis

11:21 a.m.

Hi Viresh,

...

On 27 Apr 2026, at 12:40, Viresh Kumar viresh.kumar@linaro.org wrote:

Hi Bertrand,

On 23-04-26, 13:40, Bertrand Marquis wrote:

...
Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

I tried to go through the kernel patches and it was a lot (~14k lines of code). After a while, I parsed it using chatgpt :)

There are few basic question I have:

As I understand, this is a completely new and parallel implementation (with

almost no similarity) with the one we already have (and sent as the first RFC). Is this interpretation correct ? I was hoping patches on top of the work already done in this area :(

I started from your patches but after some time the changes done where making it easier to start from scratch so i squashed my history to make things simpler. So yes probably not much remaining if anything from your patches

...

I see a lot of complexity being added, like networking style bridge between

endpoints, etc. Why is this required ? What is it that the current design fails to address ? I am sure there must be reasons behind that, I am not sure what doesn't work right now. Sorry about that.

As said during the meeting, I encountered a lot of issues related to: - blocking in code that cannot sleep - dma handling issues - timings and concurrency during probe or runtime

So current design is trying to: - leave the transport out of the timing and concurrency complexity - have loopback really behaving as a bus from transport or bridge side - have ffa bus work in indirect and fifo cases which are introducing lots of defer work issues (I am trying to unify this right now to simplify a bit)

Code is huge but my focus right now is not having simple code but having it working and right now i still have issues to solve before i can be sure that i have all corner cases so that i can work on simplifying. This morning i just detected a new one where if fifo configure fails, current code was trying to unbind in a non sleepable context which was ending up in an error in the kernel because ffa driver was trying to flush possible request before accepting the unbind (I am redesigning that part in arm_ffa and ffa bus right now).

...

I am sorry that I wasn't able to do any deep reviews of the code for now, the

code is really _big_ :)

I can definitely understand and i do not expect you to at this stage to go in any details.

If you can try to review something right now that would be the transport (ignore bus and bridge) as it is the part where i feel i am mostly stable and complete (tm).

For the rest i will try to reduce the code size a bit and simplify but honestly the whole asynchronous part is very challenging (just look at what i had to do to have dma working with loopback and the dma helper....)

Cheers Bertrand

...

Thanks.

-- viresh

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Viresh Kumar

11:34 a.m.

On 27-04-26, 11:21, Bertrand Marquis wrote:

...

As said during the meeting, I encountered a lot of issues related to:

I end up missing things during the meeting sometimes and hence try to get more discussion over email (so I can read again and again to understand clearly).

...

blocking in code that cannot sleep

We can have a spin-lock implementation for that ? How does the current code solve that ? Some sort of blocking needs to be done if the caller expects the response in the same thread.

I hope that can be done with a simple change over the current code.

...

dma handling issues

timings and concurrency during probe or runtime

Can we please discuss these in detail here ? I think we can make the current code work and solve all these issues easily. If not, I am okay with making a change in design and adapt a new strategy. But starting with completely new code at this point doesn't look right. We have already invested so much time with the current code.

There are a lot of examples in kernel where similar (simple) design was adapted, one of them is Greybus (For Google's ARA modular phone and I did work on that earlier). We can adapt parts from that to solve our current problems if required.

Maybe lets start with the problems one by one, with exact use-case to see what we are lacking right now. I am still not able to see the full picture (in sense of the problems we have).

-- viresh

Bertrand Marquis

noon

...

On 27 Apr 2026, at 13:34, Viresh Kumar viresh.kumar@linaro.org wrote:

On 27-04-26, 11:21, Bertrand Marquis wrote:

...
As said during the meeting, I encountered a lot of issues related to:

I end up missing things during the meeting sometimes and hence try to get more discussion over email (so I can read again and again to understand clearly).

...

blocking in code that cannot sleep

We can have a spin-lock implementation for that ? How does the current code solve that ? Some sort of blocking needs to be done if the caller expects the response in the same thread.

I hope that can be done with a simple change over the current code.

thing is you need to sleep to wait for an answer, using a spinlock means we block waiting for an answer from an other VM or qemu in the kernel, this is not possible. In some cases (event sending) you can solve that using deferred work but in some others (block driver during probe) i had to solve it with more complex systems: - config register caches - dma pool and kind of retry answer with defer pool increase so that next try would have enough space in the pool to continue without blocking

...

...

dma handling issues

timings and concurrency during probe or runtime

Can we please discuss these in detail here ? I think we can make the current code work and solve all these issues easily. If not, I am okay with making a change in design and adapt a new strategy. But starting with completely new code at this point doesn't look right. We have already invested so much time with the current code.

Definitely ok with that but right now as said i want to have something working in full to be able to: - ensure the spec is implementable - check if there are some spec enhancements possible to simplify implementation

The code is a PoC not upstreamable and I am highly using chatgpt to make some progress with the main consequence of having code probably more complex than what is required.

...

There are a lot of examples in kernel where similar (simple) design was adapted, one of them is Greybus (For Google's ARA modular phone and I did work on that earlier). We can adapt parts from that to solve our current problems if required.

Agree.

...

Maybe lets start with the problems one by one, with exact use-case to see what we are lacking right now. I am still not able to see the full picture (in sense of the problems we have).

My goal right now is to have the following working in loopback and using FF-A between 2 VMs using qemu as vmm: - entropy device - block device

and be able to stress the system by creating a file from entropy output inside the disk.

I discovered that having entropy working is not that complex but disk is a lot more hacky.

Main issues i encountered so far: - init chicken and egg: - device or driver coming first - driver needing to exchange messages or use dma during probe - messages exchanged or DMA share creation during non sleepable context - qemu/vmm memory handling - when do you unshare - how can you ensure a share is ready before first event avail - all timeout and queuing issues - how to sleep waiting for an answer or waiting to be able to send - how to stack (events for example) or defer - who has to sleep and when - how to handle defered work when exiting, removing a device or VM

Right now i already have several consequences i need to handle in the spec - we must have a pool, sharing on demand does not work for disk - if we have a pool the sharer must say when to release, otherwise we have to reshare the pool content as the device has no idea that something is a pool - we need some config caching in the transport, otherwise any config value request from interrupt context which has to sleep cannot be processed - config generation strict as it is cannot easily be implemented without ending up in loops because generation is changing while you refresh or you have to refresh the whole config cache each time one value is modified

Having something working by something working by simplifying the scope was easy but having something working in a realistic case is far more complex. I managed to have something working fully between qemu and the kernel which is what i shared but with ffa between VMs is still only working reliably only in simple cases.

Bertrand

...

-- viresh

Viresh Kumar

28 Apr 28 Apr

6:41 a.m.

On 27-04-26, 12:00, Bertrand Marquis wrote:

...

...
On 27 Apr 2026, at 13:34, Viresh Kumar viresh.kumar@linaro.org wrote: We can have a spin-lock implementation for that ? How does the current code solve that ? Some sort of blocking needs to be done if the caller expects the response in the same thread.

I hope that can be done with a simple change over the current code.

thing is you need to sleep to wait for an answer, using a spinlock means we block waiting for an answer from an other VM or qemu in the kernel, this is not possible.

Right.

...

In some cases (event sending) you can solve that using deferred work but in some others (block driver during probe) i had to solve it with more complex systems:

config register caches

dma pool and kind of retry answer with defer pool increase so that next try

would have enough space in the pool to continue without blocking

I believe that can be done with current design too ?

...

Definitely ok with that but right now as said i want to have something working in full to be able to:

ensure the spec is implementable

check if there are some spec enhancements possible to simplify implementation

That's okay.

...

...
Maybe lets start with the problems one by one, with exact use-case to see what we are lacking right now. I am still not able to see the full picture (in sense of the problems we have).

My goal right now is to have the following working in loopback and using FF-A between 2 VMs using qemu as vmm:

entropy device

block device

and be able to stress the system by creating a file from entropy output inside the disk.

Nice.

...

I discovered that having entropy working is not that complex but disk is a lot more hacky.

Main issues i encountered so far:

init chicken and egg: - device or driver coming first

Not sure why that is an issue. The Linux driver model will probe only after the device is available.

...

    - driver needing to exchange messages or use dma during probe

This is quite normal and must be supported. Not sure what prevents that currently. I have tested I2C, GPIO, Vsock so far and they do basic message exchange at probe.

...

    - messages exchanged or DMA share creation during non sleepable context
qemu/vmm memory handling - when do you unshare - how can you ensure a share is ready before first event avail

all timeout and queuing issues - how to sleep waiting for an answer or waiting to be able to send - how to stack (events for example) or defer - who has to sleep and when - how to handle defered work when exiting, removing a device or VM

Right, these all look valid concerns and I don't see why this can't work with the current implementation. Maybe we need to fix a few things here and there.

...

Right now i already have several consequences i need to handle in the spec

we must have a pool, sharing on demand does not work for disk

Isn't that a device specific issue ? Why is this a virtio-msg kernel or spec issue ?

...

if we have a pool the sharer must say when to release, otherwise we have to reshare the pool content as the device has no idea that something is a pool

we need some config caching in the transport,

I feel that may not be the right approach. The transport should provide the mechanism to make it work, sleep-able and non-sleep-able (busy loop). The driver can choose to do what it wants, but it may not be correct for the transport to manage that.

Though it may be better to get views on this at the time of upstreaming. Maintainers may have a say in that and may not agree with what I said :)

...

otherwise any config value request from interrupt context which has to sleep cannot be processed

But that can busy loop ?

...

config generation strict as it is cannot easily be implemented without ending up in loops because generation is changing while you refresh or you have to refresh the whole config cache each time one value is modified

Having something working by something working by simplifying the scope was easy but having something working in a realistic case is far more complex. I managed to have something working fully between qemu and the kernel which is what i shared but with ffa between VMs is still only working reliably only in simple cases.

Hopefully we won't require the complex design with bridges etc here and the simple one can be modified to sort this all out in the end. Lets see how it goes.

-- viresh

Bertrand Marquis

7:35 a.m.

Hi Viresh,

...

On 28 Apr 2026, at 08:41, Viresh Kumar viresh.kumar@linaro.org wrote:

On 27-04-26, 12:00, Bertrand Marquis wrote:

...
...
On 27 Apr 2026, at 13:34, Viresh Kumar viresh.kumar@linaro.org wrote: We can have a spin-lock implementation for that ? How does the current code solve that ? Some sort of blocking needs to be done if the caller expects the response in the same thread.

I hope that can be done with a simple change over the current code.

thing is you need to sleep to wait for an answer, using a spinlock means we block waiting for an answer from an other VM or qemu in the kernel, this is not possible.

Right.

...
In some cases (event sending) you can solve that using deferred work but in some others (block driver during probe) i had to solve it with more complex systems:

config register caches

dma pool and kind of retry answer with defer pool increase so that next try

would have enough space in the pool to continue without blocking

I believe that can be done with current design too ?

Correct me if I'm wrong but your current design is just using a statically shared area from DT, it is not using ffa mem share. So it can be done and this is what i did.

...

...
Definitely ok with that but right now as said i want to have something working in full to be able to:

ensure the spec is implementable

check if there are some spec enhancements possible to simplify implementation

That's okay.

...
...
Maybe lets start with the problems one by one, with exact use-case to see what we are lacking right now. I am still not able to see the full picture (in sense of the problems we have).

My goal right now is to have the following working in loopback and using FF-A between 2 VMs using qemu as vmm:

entropy device

block device

and be able to stress the system by creating a file from entropy output inside the disk.

Nice.

...
I discovered that having entropy working is not that complex but disk is a lot more hacky.

Main issues i encountered so far:

init chicken and egg: - device or driver coming first

Not sure why that is an issue. The Linux driver model will probe only after the device is available.

Linux driver model does I agree with that but in practice when using indirect messages you potentially have several drivers coming at the same time and also making themselves visible to others which is ending up in some complexity. This is possible to do and what i am trying to solve (works on my current ffa poc) but requires a bunch of loops and retry and waiting and also a bit of ordering so that drivers are ready and working before they start receiving indirect messages from other VMs.

You seem to imply that this is simple, i though so to but the more i go to a realistic case the more i have to strengthen the implementation which ends up in more complex code which is logic.

My initial bare metal implementation is way simpler because it does not have all the linux complexity.

...

...
   - driver needing to exchange messages or use dma during probe
This is quite normal and must be supported. Not sure what prevents that currently. I have tested I2C, GPIO, Vsock so far and they do basic message exchange at probe.

Agree but disk is doing dma sharing, virtqueue configuration and using them from a non sleepable context which is not that simple to handle without going against the scheduling model of linux (thinks worked a lot better before i activated all linux RCU, locks and timing constraints debugging).

...

...
   - messages exchanged or DMA share creation during non sleepable context
qemu/vmm memory handling - when do you unshare - how can you ensure a share is ready before first event avail

all timeout and queuing issues - how to sleep waiting for an answer or waiting to be able to send - how to stack (events for example) or defer - who has to sleep and when - how to handle defered work when exiting, removing a device or VM
Right, these all look valid concerns and I don't see why this can't work with the current implementation. Maybe we need to fix a few things here and there.

Agree but a few is at the end a lot but yes all those are fixable.

...

...
Right now i already have several consequences i need to handle in the spec

we must have a pool, sharing on demand does not work for disk

Isn't that a device specific issue ? Why is this a virtio-msg kernel or spec issue ?

Because to properly use a pool you need to inform the other side that a shared area is not to be relinquished once used as it will be reused. So i need to modify the ffa bus sharing protocol to add something to define to the device side if it should try to relinquish once it is not needed anymore or if it should wait for an explicit release from the driver side.

This is not major but still need some spec rework.

Added to that, the main issue that i am facing in the kernel is the fact that memory sharing request from an non sleepable context cannot sleep... and ffa bus memory sharing relies on request/response system. If you try to send the request without waiting for the response you end up having an event avail sent before the other side can actually access the memory because it did not process the share before the event avail and you have errors. This requires either some ordering or a different way to share asynchronously to have some performance. I designed this in the early stage with sharing going through a table in shared memory so that mapping could be done on demand but seeing how qemu works this could work but would need some rework (we cannot say that shared memory bus addresses can be anywhere otherwise implementation becomes to complex.

So it is not a spec issue but not modifying the spec is making implementation very complex and my current investigation is showing that some changes could be make implementation simpler and win a lot of performance. Finding those was the point of my PoC.

...

...

if we have a pool the sharer must say when to release, otherwise we have to reshare

the pool content as the device has no idea that something is a pool

we need some config caching in the transport,

I feel that may not be the right approach. The transport should provide the mechanism to make it work, sleep-able and non-sleep-able (busy loop). The driver can choose to do what it wants, but it may not be correct for the transport to manage that.

configuration space is a transport thing and the bus handling config caching would be very complex. The whole idea from the generation system from Bill was to allow this kind of things and practice seem to reveal that it is in fact needed. I am not quite sure how this could be solved at bus level but definitely open to suggestions here.

...

Though it may be better to get views on this at the time of upstreaming. Maintainers may have a say in that and may not agree with what I said :)

Before upstreaming anything my goal is to have something working.

...

...
otherwise any config value request from interrupt context which has to sleep cannot be processed

But that can busy loop ?

You cannot busy loop waiting for an interrupt generated by an other VM, that would go against our goal of making things asynchronous. You can try and with all debug activated you will for sure end up in a kernel oops.

...

...

config generation strict as it is cannot easily be implemented without ending up in loops

because generation is changing while you refresh or you have to refresh the whole config cache each time one value is modified

Having something working by something working by simplifying the scope was easy but having something working in a realistic case is far more complex. I managed to have something working fully between qemu and the kernel which is what i shared but with ffa between VMs is still only working reliably only in simple cases.

Hopefully we won't require the complex design with bridges etc here and the simple one can be modified to sort this all out in the end. Lets see how it goes.

I think we do. The bridge design goal is to have qemu implementation independent of the bus implementation in the kernel. The PoC works with the same qemu for loopback or ffa which is nice and having a clear layering with bridge - bus - transport was kind of the spec goal. Now where to cut could be a question and we might be able one day to move implementation down to FIFO management and handling inside Qemu but the layering is becoming blurry as in ffa case not everything can be transferred through the FIFO, reset message for example cannot as during reset the fifo memory will be relinquished so you will never get the answer back.

Cheers Bertrand

...

-- viresh

Edgar E. Iglesias

30 Apr 30 Apr

1:23 p.m.

On Thu, Apr 23, 2026 at 01:40:40PM +0000, Bertrand Marquis wrote:

...

Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

Cheers Bertrand

Nice work Bertrand!

Most of the QEMU parts look good! And I think this code highlights some issues with the current implementation. Thanks.

The new RAMBlock is a good idea. It makes sense to me for these bridge-published DMA windows, since you want them to work through the normal DMA and address space paths in QEMU.

The thing I would watch carefully is lifetime tracking of host pointer use. Once a published range is exposed that way, every path that gets a direct host pointer really needs to give it back again. Otherwise DEL_REQ teardown can end up waiting on references that never drop.

That feels like the main thing worth double checking in this part of the code. Just thought I'd mention it.

On the new virtio-msg kernel interface, I was hoping we would not need to expose the rings to user-space. I do agree that the kernel probably needs an internal ring, queue or something. I would just like to understand better why we need to expose that part too. I made a few changes to Viresh code at some point to internally use the AMP queues but still maintaining the read/write/select/poll interface towards user-space. A little similar to a UDP socket.

E.g. Driver to Device:

1. Driver puts msg-1 on internal queue. 2. Driver puts msg-2 on internal queue before device reads msg-1. 3. At some point device wakes up, reads msg-1 and msg-2 in a loop before read() would block.

Device to driver: 1. write() msg-1 2. Before driver reads msg-1, device write() msg-2. 3. write() blocks when the internal queue is full.

That was the direction I was hoping to go with Viresh code. With a suitable depth of queue.

Another problem I found in the AMP code when using Viresh's virtio-msg was the problem if multiple contexts are trying to send/receive messages at the same time, e.g interrupt context vs normal thread context. IIRC, I worked around it in the AMP code by having separate rx buffers. The problem was for example that the driver would send a msg-1 and wait for a response to that msg. While it waits, it would process incoming msg's and look for a match to msg-1. During this wait, interrupt context sends msg-2 (in my case notification), and it would override the buffer holding msg-1 causing a hang.

I don't know what the best way to solve it is, but in the AMP code, multiple buffers solved the problem for us.

It would be good to discuss this a bit more before going too far down a substantial redesign, and first see if the problems we hit there can be solved within Viresh's design.

A few nitpicks:

I've tried to call virtio-msg-bus implementations virtio-msg-bus-something. In this case, virtio-msg-bus-linux-bridge.c I guess.

include/hw/virtio/virtio-msg-prot.h has some general code to print messages and convert some parts to strings. It's missing stuff but it looks like you could reuse some of it or extend it for your needs.

A couple more things I think are worth a look (assisted by codex):

Commit 6899bd4d48 `hw/virtio: add virtio-msg linux bridge transport parent` ----------------------------------------------------------------------------

High: `virtio_msg_linux_bridge_transport_unrealize()` destroys the bridge before it unrealizes the `VirtIOMSGProxy`. The bridge unrealize path destroys the bridge DMA address space, but child virtio devices under the proxy can still hold and use that address space through the earlier latched transport caps / `vdev->dma_as`. That looks like a use-after-free risk.

Commit 466fbe293e `virtio-msg: latch transport capabilities during pre-plug` ------------------------------------------------------------------------------

Medium: this introduces a one-time cached transport capability snapshot on the proxy, but there is no invalidation path once `latched_caps_valid` becomes true. If the backend later tears down or recreates its DMA address space, `virtio_msg_get_dma_as()` may keep returning a stale pointer.

Best regards, Edgar

Bertrand Marquis

1:37 p.m.

Hi Edgar,

...

On 30 Apr 2026, at 15:23, Edgar E. Iglesias edgar.iglesias@amd.com wrote:

On Thu, Apr 23, 2026 at 01:40:40PM +0000, Bertrand Marquis wrote:

...
Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

Cheers Bertrand

Nice work Bertrand!

Thanks

...

Most of the QEMU parts look good! And I think this code highlights some issues with the current implementation. Thanks.

The new RAMBlock is a good idea. It makes sense to me for these bridge-published DMA windows, since you want them to work through the normal DMA and address space paths in QEMU.

The thing I would watch carefully is lifetime tracking of host pointer use. Once a published range is exposed that way, every path that gets a direct host pointer really needs to give it back again. Otherwise DEL_REQ teardown can end up waiting on references that never drop.

That feels like the main thing worth double checking in this part of the code. Just thought I'd mention it.

Yes I am having some trouble there and I am working on it. Finding when to release a shared area without having to much performance impact is an issue right now and i am working on improving the interface to have a way to signal to qemu to keep some shared areas until linux ask them back to optimize the pool case.

...

On the new virtio-msg kernel interface, I was hoping we would not need to expose the rings to user-space. I do agree that the kernel probably needs an internal ring, queue or something. I would just like to understand better why we need to expose that part too. I made a few changes to Viresh code at some point to internally use the AMP queues but still maintaining the read/write/select/poll interface towards user-space. A little similar to a UDP socket.

E.g. Driver to Device:

Driver puts msg-1 on internal queue.

Driver puts msg-2 on internal queue before device reads msg-1.

At some point device wakes up, reads msg-1 and msg-2 in a loop before read() would block.

Device to driver:

write() msg-1

Before driver reads msg-1, device write() msg-2.

write() blocks when the internal queue is full.

That was the direction I was hoping to go with Viresh code. With a suitable depth of queue.

Idea of the ring was to reduce qemu to kernel ping pongs by just waiting on the fd and then reading messages inside the ring.

In fact i reworked the interface this week to also put the memory sharing/unsharing inside the ring to keep ordering and prevent the dual ioctl/ring poll inside qemu.

But using a blocking read is an idea i can definitely look at.

I will try to share an updated version with the reworked bridge interface and also linux kernel ffa implementation. I was hopping to do that today but i still have some issues to work on.

...

Another problem I found in the AMP code when using Viresh's virtio-msg was the problem if multiple contexts are trying to send/receive messages at the same time, e.g interrupt context vs normal thread context. IIRC, I worked around it in the AMP code by having separate rx buffers. The problem was for example that the driver would send a msg-1 and wait for a response to that msg. While it waits, it would process incoming msg's and look for a match to msg-1. During this wait, interrupt context sends msg-2 (in my case notification), and it would override the buffer holding msg-1 causing a hang.

My early PoC was doing something like that and i went into troubles because I ended up with EVENT_AVAIL being processed before the VQUEUE_SET or DMA_SHARE things so i went back to a single queue to enforce ordering and prevent events to be processed before queue configuration or share memory requests.

...

I don't know what the best way to solve it is, but in the AMP code, multiple buffers solved the problem for us.

Let's discuss that because i have trouble with ordering that i needed to solve. Mostly in the virtio block case which is doing dma_map and queue config and queue kick from interrupt context

...

It would be good to discuss this a bit more before going too far down a substantial redesign, and first see if the problems we hit there can be solved within Viresh's design.

Ack

...

A few nitpicks:

I've tried to call virtio-msg-bus implementations virtio-msg-bus-something. In this case, virtio-msg-bus-linux-bridge.c I guess.

...

include/hw/virtio/virtio-msg-prot.h has some general code to print messages and convert some parts to strings. It's missing stuff but it looks like you could reuse some of it or extend it for your needs.

The debug print which are added there were a temporary debug i removed since and i now have errors through GUEST_ERRORS and i am using tracing to follow the protocol.

...

A couple more things I think are worth a look (assisted by codex):

Commit 6899bd4d48 `hw/virtio: add virtio-msg linux bridge transport parent`

High: `virtio_msg_linux_bridge_transport_unrealize()` destroys the bridge before it unrealizes the `VirtIOMSGProxy`. The bridge unrealize path destroys the bridge DMA address space, but child virtio devices under the proxy can still hold and use that address space through the earlier latched transport caps / `vdev->dma_as`. That looks like a use-after-free risk.

Ack, will check that.

...

Commit 466fbe293e `virtio-msg: latch transport capabilities during pre-plug`

Medium: this introduces a one-time cached transport capability snapshot on the proxy, but there is no invalidation path once `latched_caps_valid` becomes true. If the backend later tears down or recreates its DMA address space, `virtio_msg_get_dma_as()` may keep returning a stale pointer.

Nice, I will check that to.

Thanks a lot for the review that is really helpfull.

Regards Bertrand

...

Best regards, Edgar

Bertrand Marquis

4:45 p.m.

Hi Everyone,

Quick follow up on one block point with discussed.

Manos pointed me to the ccw transport code which does sleep: https://github.com/torvalds/linux/blob/e75a43c7cec459a07d91ed17de4de13ede2b7...

And i checked and tested with a big print when i though i could not sleep and had to use a shadow config. In fact I can sleep and i check my original code when i started to introduce this and the test was wrong at the time: I was checking if i was in an interrupt context but not if could sleep and my rcu might have been called by other issues.

So we might not have to solve this problem at all as get/set config are sleepable !!!

I will do some extra investigations and checks inside the kernel code but this could make things way easier :-)

Thanks a lot Manos for the pointer.

Cheers Bertrand

...

On 30 Apr 2026, at 15:23, Edgar E. Iglesias edgar.iglesias@amd.com wrote:

On Thu, Apr 23, 2026 at 01:40:40PM +0000, Bertrand Marquis wrote:

...
Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

Cheers Bertrand

Nice work Bertrand!

Most of the QEMU parts look good! And I think this code highlights some issues with the current implementation. Thanks.

The new RAMBlock is a good idea. It makes sense to me for these bridge-published DMA windows, since you want them to work through the normal DMA and address space paths in QEMU.

The thing I would watch carefully is lifetime tracking of host pointer use. Once a published range is exposed that way, every path that gets a direct host pointer really needs to give it back again. Otherwise DEL_REQ teardown can end up waiting on references that never drop.

That feels like the main thing worth double checking in this part of the code. Just thought I'd mention it.

On the new virtio-msg kernel interface, I was hoping we would not need to expose the rings to user-space. I do agree that the kernel probably needs an internal ring, queue or something. I would just like to understand better why we need to expose that part too. I made a few changes to Viresh code at some point to internally use the AMP queues but still maintaining the read/write/select/poll interface towards user-space. A little similar to a UDP socket.

E.g. Driver to Device:

Driver puts msg-1 on internal queue.

Driver puts msg-2 on internal queue before device reads msg-1.

At some point device wakes up, reads msg-1 and msg-2 in a loop before read() would block.

Device to driver:

write() msg-1

Before driver reads msg-1, device write() msg-2.

write() blocks when the internal queue is full.

That was the direction I was hoping to go with Viresh code. With a suitable depth of queue.

Another problem I found in the AMP code when using Viresh's virtio-msg was the problem if multiple contexts are trying to send/receive messages at the same time, e.g interrupt context vs normal thread context. IIRC, I worked around it in the AMP code by having separate rx buffers. The problem was for example that the driver would send a msg-1 and wait for a response to that msg. While it waits, it would process incoming msg's and look for a match to msg-1. During this wait, interrupt context sends msg-2 (in my case notification), and it would override the buffer holding msg-1 causing a hang.

I don't know what the best way to solve it is, but in the AMP code, multiple buffers solved the problem for us.

It would be good to discuss this a bit more before going too far down a substantial redesign, and first see if the problems we hit there can be solved within Viresh's design.

A few nitpicks:

I've tried to call virtio-msg-bus implementations virtio-msg-bus-something. In this case, virtio-msg-bus-linux-bridge.c I guess.

include/hw/virtio/virtio-msg-prot.h has some general code to print messages and convert some parts to strings. It's missing stuff but it looks like you could reuse some of it or extend it for your needs.

A couple more things I think are worth a look (assisted by codex):

Commit 6899bd4d48 `hw/virtio: add virtio-msg linux bridge transport parent`

High: `virtio_msg_linux_bridge_transport_unrealize()` destroys the bridge before it unrealizes the `VirtIOMSGProxy`. The bridge unrealize path destroys the bridge DMA address space, but child virtio devices under the proxy can still hold and use that address space through the earlier latched transport caps / `vdev->dma_as`. That looks like a use-after-free risk.

Commit 466fbe293e `virtio-msg: latch transport capabilities during pre-plug`

Medium: this introduces a one-time cached transport capability snapshot on the proxy, but there is no invalidation path once `latched_caps_valid` becomes true. If the backend later tears down or recreates its DMA address space, `virtio_msg_get_dma_as()` may keep returning a stale pointer.

Best regards, Edgar

Edgar E. Iglesias

4:51 p.m.

On Thu, Apr 30, 2026 at 04:45:05PM +0000, Bertrand Marquis wrote:

...

Hi Everyone,

Quick follow up on one block point with discussed.

Manos pointed me to the ccw transport code which does sleep: https://github.com/torvalds/linux/blob/e75a43c7cec459a07d91ed17de4de13ede2b7...

And i checked and tested with a big print when i though i could not sleep and had to use a shadow config. In fact I can sleep and i check my original code when i started to introduce this and the test was wrong at the time: I was checking if i was in an interrupt context but not if could sleep and my rcu might have been called by other issues.

So we might not have to solve this problem at all as get/set config are sleepable !!!

I will do some extra investigations and checks inside the kernel code but this could make things way easier :-)

Thanks a lot Manos for the pointer.

Fantastic!

Thanks Manos!

Cheers, Edgar

...

Cheers Bertrand

...
On 30 Apr 2026, at 15:23, Edgar E. Iglesias edgar.iglesias@amd.com wrote:

On Thu, Apr 23, 2026 at 01:40:40PM +0000, Bertrand Marquis wrote:

...
Hi Everyone,

I have a first PoC showing virtio-msg working with a loopback system between Linux kernel and Qemu.

The patches, build and run instructions can be found here:

https://github.com/bertrand-marquis/virtio-msg-spec/tree/linux-poc/v0/linux-...

This is very early stage but this shows a fully functional version with rng and block validated. I used ChatGPT help to fix issues and write part of the code (or big parts for Qemu) and this is far from upstreamable so do not share that.

I will share a v1 in the next weeks with FF-A support but i still have some timings and DMA issues to solve.

Any comment on this is more than welcome !!

Cheers Bertrand

Nice work Bertrand!

Most of the QEMU parts look good! And I think this code highlights some issues with the current implementation. Thanks.

The new RAMBlock is a good idea. It makes sense to me for these bridge-published DMA windows, since you want them to work through the normal DMA and address space paths in QEMU.

The thing I would watch carefully is lifetime tracking of host pointer use. Once a published range is exposed that way, every path that gets a direct host pointer really needs to give it back again. Otherwise DEL_REQ teardown can end up waiting on references that never drop.

That feels like the main thing worth double checking in this part of the code. Just thought I'd mention it.

On the new virtio-msg kernel interface, I was hoping we would not need to expose the rings to user-space. I do agree that the kernel probably needs an internal ring, queue or something. I would just like to understand better why we need to expose that part too. I made a few changes to Viresh code at some point to internally use the AMP queues but still maintaining the read/write/select/poll interface towards user-space. A little similar to a UDP socket.

E.g. Driver to Device:

Driver puts msg-1 on internal queue.

Driver puts msg-2 on internal queue before device reads msg-1.

At some point device wakes up, reads msg-1 and msg-2 in a loop before read() would block.

Device to driver:

write() msg-1

Before driver reads msg-1, device write() msg-2.

write() blocks when the internal queue is full.

That was the direction I was hoping to go with Viresh code. With a suitable depth of queue.

Another problem I found in the AMP code when using Viresh's virtio-msg was the problem if multiple contexts are trying to send/receive messages at the same time, e.g interrupt context vs normal thread context. IIRC, I worked around it in the AMP code by having separate rx buffers. The problem was for example that the driver would send a msg-1 and wait for a response to that msg. While it waits, it would process incoming msg's and look for a match to msg-1. During this wait, interrupt context sends msg-2 (in my case notification), and it would override the buffer holding msg-1 causing a hang.

I don't know what the best way to solve it is, but in the AMP code, multiple buffers solved the problem for us.

It would be good to discuss this a bit more before going too far down a substantial redesign, and first see if the problems we hit there can be solved within Viresh's design.

A few nitpicks:

I've tried to call virtio-msg-bus implementations virtio-msg-bus-something. In this case, virtio-msg-bus-linux-bridge.c I guess.

include/hw/virtio/virtio-msg-prot.h has some general code to print messages and convert some parts to strings. It's missing stuff but it looks like you could reuse some of it or extend it for your needs.

A couple more things I think are worth a look (assisted by codex):

Commit 6899bd4d48 `hw/virtio: add virtio-msg linux bridge transport parent`

High: `virtio_msg_linux_bridge_transport_unrealize()` destroys the bridge before it unrealizes the `VirtIOMSGProxy`. The bridge unrealize path destroys the bridge DMA address space, but child virtio devices under the proxy can still hold and use that address space through the earlier latched transport caps / `vdev->dma_as`. That looks like a use-after-free risk.

Commit 466fbe293e `virtio-msg: latch transport capabilities during pre-plug`

Medium: this introduces a one-time cached transport capability snapshot on the proxy, but there is no invalidation path once `latched_caps_valid` becomes true. If the backend later tears down or recreates its DMA address space, `virtio_msg_get_dma_as()` may keep returning a stale pointer.

Best regards, Edgar

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Viresh Kumar

5 May 5 May

2:55 a.m.

On 30-04-26, 16:45, Bertrand Marquis wrote:

...

Hi Everyone,

Quick follow up on one block point with discussed.

Manos pointed me to the ccw transport code which does sleep: https://github.com/torvalds/linux/blob/e75a43c7cec459a07d91ed17de4de13ede2b7...

And i checked and tested with a big print when i though i could not sleep and had to use a shadow config. In fact I can sleep and i check my original code when i started to introduce this and the test was wrong at the time: I was checking if i was in an interrupt context but not if could sleep and my rcu might have been called by other issues.

So we might not have to solve this problem at all as get/set config are sleepable !!!

I will do some extra investigations and checks inside the kernel code but this could make things way easier :-)

Very nice :)

-- viresh

days inactive

days old

virtio-msg@lists.linaro.org

11 comments

participants

tags (0)

participants (3)

Bertrand Marquis
Edgar E. Iglesias
Viresh Kumar