On Sat, 07 Sep 2024 12:38:50 +0200, Andrew Cooper wrote:
On 07/09/2024 8:46 am, Takashi Iwai wrote:
On Fri, 06 Sep 2024 20:42:09 +0200, Ariadne Conill wrote:
This patch attempted to work around a DMA issue involving Xen, but causes subtle kernel memory corruption.
When I brought up this patch in the XenDevel matrix channel, I was told that it had been requested by the Qubes OS developers because they were trying to fix an issue where the sound stack would fail after a few hours of uptime. They wound up disabling SG buffering entirely instead as a workaround.
Accordingly, I propose that we should revert this workaround patch, since it causes kernel memory corruption and that the ALSA and Xen communities should collaborate on fixing the underlying problem in such a way that SG buffering works correctly under Xen.
This reverts commit 53466ebdec614f915c691809b0861acecb941e30.
Signed-off-by: Ariadne Conill ariadne@ariadne.space Cc: stable@vger.kernel.org Cc: xen-devel@lists.xenproject.org Cc: alsa-devel@alsa-project.org Cc: Takashi Iwai tiwai@suse.de
The relevant code has been largely rewritten for 6.12, so please check the behavior with sound.git tree for-next branch. I guess the same issue should happen as the Xen workaround was kept and applied there, too, but it has to be checked at first.
If the issue is persistent with there, the fix for 6.12 code would be rather much simpler like the blow.
thanks,
Takashi
--- a/sound/core/memalloc.c +++ b/sound/core/memalloc.c @@ -793,9 +793,6 @@ static void *snd_dma_sg_alloc(struct snd_dma_buffer *dmab, size_t size) int type = dmab->dev.type; void *p;
- if (cpu_feature_enabled(X86_FEATURE_XENPV))
return snd_dma_sg_fallback_alloc(dmab, size);
- /* try the standard DMA API allocation at first */ if (type == SNDRV_DMA_TYPE_DEV_WC_SG) dmab->dev.type = SNDRV_DMA_TYPE_DEV_WC;
Individual subsystems ought not to know or care about XENPV; it's a layering violation.
If the main APIs don't behave properly, then it probably means we've got a bug at a lower level (e.g. Xen SWIOTLB is a constant source of fun) which is probably affecting other subsystems too.
I think we need to re-analyse the original bug. Right now, the behaviour resulting from 53466ebde is worse than what it was trying to fix.
I agree with its hackishness and ugliness there; it was to fix the regression at that time. And, it definitely needs a cleaner solution.
But the fact is that this change was applied quite some time ago (over a year). It implies that it's no urgent regression that has to be addressed in the next few days for 6.11-final.
Meanwhile we have largish code changes for 6.12, hence the revert can't be applied. So I'm asking testing the latest code for 6.12, and work on there.
thanks,
Takashi