On Mon 11-12-23 09:28:40, Pavel Machek wrote:
So I've got back to this and the failure is a subtle interaction between iomap code and ext4 code. In particular that fact that commit 936e114a245b6 ("iomap: update ki_pos a little later in iomap_dio_complete") is not in stable causes that file position is not updated after direct IO write and thus we direct IO writes are ending in wrong locations effectively corrupting data. The subtle detail is that before this commit if ->end_io handler returns non-zero value (which the new ext4 ->end_io handler does), file pos doesn't get updated, after this commit it doesn't get updated only if the return value is < 0.
The commit got merged in 6.5-rc1 so all stable kernels that have 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still carrying the problematic commit. Greg, please take out the commit from all stable kernels before 6.5 as soon as possible, we'll figure out proper backport once user data are not being corrupted anymore. Thanks!
Thanks a lot for the update.
Turns out this is causing a regression in chromeos-6.1, and reverting the offending patch fixes the problem. I suspect anyone running v6.1.64+ may have a problem.
Jan, thanks for the report, and Guenter, thanks for letting me know as well. I'll go queue up the fix now and push out new -rc releases.
Would someone have a brief summary here? I see 6.1.66 is out but I don't see any "Fixes: 91562895f803" tags.
Plus, what is the severity of this? It is "data being corrupted when using O_SYNC|O_DIRECT" or does metadata somehow get corrupted, too?
It is pure data corruption happening for ext4 direct IO writes because they do not properly update current file position after the write.
Honza