[ Added Borislav and stable people ]
On Sun, Apr 30, 2023 at 9:31 PM syzbot syzbot+401145a9a237779feb26@syzkaller.appspotmail.com wrote:
syzbot suspects this issue was fixed by commit:
Indeed.
My initial reaction was "no, that didn't fix anything, it just cleaned stuff up", but it turns out that yes, it did in fact fix a real bug in the process.
The fix was not intentional, but the cleanup actually got rid of buggy code.
So here's the automatic marker for syzbot:
#syz fix: x86: don't use REP_GOOD or ERMS for user memory clearing
and the reason for the bug - in case people care - is that the old clear_user_rep_good (which no longer exists after that commit) had the exception entry pointing to the wrong instruction.
The buggy code did:
.Lrep_good_bytes: mov %edx, %ecx rep stosb
and the exception entry weas
_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
so the exception entry pointed at the register move instruction, not at the actual "rep stosb" that does the user space store.
End result: if you had a situation where you *should* return -EFAULT, and you triggered that "last final bytes" case, instead of the exception handling dealing with it properly and fixing it up, you got that kernel oops.
The bug goes back to commit 0db7058e8e23 ("x86/clear_user: Make it faster") from about a year ago, which made it into v6.1.
It only affects old hardware that doesn't have the ERMS capability flag, which *probably* means that it's mostly only triggerable in virtualization (since pretty much any CPU from the last decade has ERMS, afaik).
Borislav - opinions? This needs fixing for v6.1..v6.3, and the options are:
(1) just fix up the exception entry. I think this is literally this one-liner, but somebody should double-check me. I did *not* actually test this:
--- a/arch/x86/lib/clear_page_64.S +++ b/arch/x86/lib/clear_page_64.S @@ -142,8 +142,8 @@ SYM_FUNC_START(clear_user_rep_good) and $7, %edx jz .Lrep_good_exit
-.Lrep_good_bytes: mov %edx, %ecx +.Lrep_good_bytes: rep stosb
.Lrep_good_exit:
because the only use of '.Lrep_good_bytes' is that exception table entry.
(2) backport just that one commit for clear_user
In this case we should probably do commit e046fe5a36a9 ("x86: set FSRS automatically on AMD CPUs that have FSRM") too, since that commit changes the decision to use 'rep stosb' to check FSRS.
(3) backport the entire series of commits:
git log --oneline v6.3..034ff37d3407
Or we could even revert that commit 0db7058e8e23, but it seems silly to revert when we have so many ways to fix it, including a one-line code movement.
Borislav / stable people? Opinions?
Linus