On Tue, Sep 10, 2024 at 01:17:03PM +0200, Takashi Iwai wrote:
On Mon, 09 Sep 2024 22:02:08 +0200, Elliott Mitchell wrote:
On Sat, Sep 07, 2024 at 11:38:50AM +0100, Andrew Cooper wrote:
Individual subsystems ought not to know or care about XENPV; it's a layering violation.
If the main APIs don't behave properly, then it probably means we've got a bug at a lower level (e.g. Xen SWIOTLB is a constant source of fun) which is probably affecting other subsystems too.
This is a big problem. Debian bug #988477 (https://bugs.debian.org/988477) showed up in May 2021. While some characteristics are quite different, the time when it was first reported is similar to the above and it is also likely a DMA bug with Xen.
Yes, some incompatible behavior has been seen on Xen wrt DMA buffer handling, as it seems. But note that, in the case of above, it was triggered by the change in the sound driver side, hence we needed a quick workaround there. The result was to move back to the old method for Xen in the end.
As already mentioned in another mail, the whole code was changed for 6.12, and the revert isn't applicable in anyway.
So I'm going to submit another patch to drop this Xen PV-specific workaround for 6.12. The new code should work without the workaround (famous last words). If the problem happens there, I'd rather leave it to Xen people ;)
I've seen that patch, but haven't seen any other activity related to this sound problem. I'm wondering whether the problem got fixed by something else, there is activity on different lists I don't see, versus no activity until Qubes OS discovers it is again broken.
An overview of the other bug which may or may not be the same as this sound card bug:
Both reproductions of the RAID1 bug have been on systems with AMD processors. This may indicate this is distinct, but could also mean only people who get AMD processors are wary enough of flash to bother with RAID1 on flash devices. Presently I suspect it is the latter, but not very many people are bothering with RAID1 with flash.
Only systems with IOMMUv2 (full IOMMU, not merely GART) are effected.
Samsung SATA devices are severely effected.
Crucial/Micron NVMe devices are mildly effected.
Crucial/Micron SATA devices are uneffected.
Specifications for Samsung SATA and Crucial/Micron SATA devices are fairly similar. Similar IOps, similar bandwith capabilities.
Crucial/Micron NVMe devices have massively superior specifications to the Samsung SATA devices. Yet the Crucial/Micron NVMe devices are less severely effected than the Samsung SATA devices.
This seems likely to be a latency issue. Could be when commands are sent to the Samsung SATA devices, they are fast enough to start executing them before the IOMMU is ready.
This could match with the sound driver issue. Since the sound hardware is able to execute its first command with minimal latency, that is when the problem occurs. If the first command gets through, the second command is likely executed with some delay and the IOMMU is reliably ready.