syzkaller discovered the following crash: (kernel BUG)
[ 44.607039] ------------[ cut here ]------------ [ 44.607422] kernel BUG at mm/userfaultfd.c:2067! [ 44.608148] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 44.608814] CPU: 1 UID: 0 PID: 2475 Comm: reproducer Not tainted 6.16.0-rc6 #1 PREEMPT(none) [ 44.609635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 44.610695] RIP: 0010:userfaultfd_release_all+0x3a8/0x460
<snip other registers, drop unreliable trace>
[ 44.617726] Call Trace: [ 44.617926] <TASK> [ 44.619284] userfaultfd_release+0xef/0x1b0 [ 44.620976] __fput+0x3f9/0xb60 [ 44.621240] fput_close_sync+0x110/0x210 [ 44.622222] __x64_sys_close+0x8f/0x120 [ 44.622530] do_syscall_64+0x5b/0x2f0 [ 44.622840] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 44.623244] RIP: 0033:0x7f365bb3f227
Kernel panics because it detects UFFD inconsistency during userfaultfd_release_all(). Specifically, a VMA which has a valid pointer to vma->vm_userfaultfd_ctx, but no UFFD flags in vma->vm_flags.
The inconsistency is caused in ksm_madvise(): when user calls madvise() with MADV_UNMEARGEABLE on a VMA that is registered for UFFD in MINOR mode, it accidentally clears all flags stored in the upper 32 bits of vma->vm_flags.
Assuming x86_64 kernel build, unsigned long is 64-bit and unsigned int and int are 32-bit wide. This setup causes the following mishap during the &= ~VM_MERGEABLE assignment.
VM_MERGEABLE is a 32-bit constant of type unsigned int, 0x8000'0000. After ~ is applied, it becomes 0x7fff'ffff unsigned int, which is then promoted to unsigned long before the & operation. This promotion fills upper 32 bits with leading 0s, as we're doing unsigned conversion (and even for a signed conversion, this wouldn't help as the leading bit is 0). & operation thus ends up AND-ing vm_flags with 0x0000'0000'7fff'ffff instead of intended 0xffff'ffff'7fff'ffff and hence accidentally clears the upper 32-bits of its value.
Fix it by casting `VM_MERGEABLE` constant to unsigned long to preserve the upper 32 bits, in case it's needed.
Note: other VM_* flags are not affected: This only happens to the VM_MERGEABLE flag, as the other VM_* flags are all constants of type int and after ~ operation, they end up with leading 1 and are thus converted to unsigned long with leading 1s.
Note 2: After commit 31defc3b01d9 ("userfaultfd: remove (VM_)BUG_ON()s"), this is no longer a kernel BUG, but a WARNING at the same place:
[ 45.595973] WARNING: CPU: 1 PID: 2474 at mm/userfaultfd.c:2067
but the root-cause (flag-drop) remains the same.
Fixes: 7677f7fd8be76 ("userfaultfd: add minor fault registration mode") Signed-off-by: Jakub Acs acsjakub@amazon.de Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: Xu Xin xu.xin16@zte.com.cn Cc: Chengming Zhou chengming.zhou@linux.dev Cc: Peter Xu peterx@redhat.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org ---
I looked around the kernel and found one more flag that might be causing similar issues: "IORESOURCE_BUSY" - as its inverted version is bit-anded to unsigned long fields. However, it seems those fields don't actually use any bits from upper 32-bits as flags (yet?).
I also considered changing the constant definition by adding ULL, but am not sure where else that could blow up, plus it would likely call to define all the related constants as ULL for consistency. If you'd prefer that fix, let me know.
mm/ksm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/ksm.c b/mm/ksm.c index 160787bb121c..c24137a1eeb7 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2871,7 +2871,7 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start, return err; }
- *vm_flags &= ~VM_MERGEABLE; + *vm_flags &= ~((unsigned long) VM_MERGEABLE); break; }