+cc Linus as reference a commit of his below...
On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote:
On 23.10.24 08:24, Dmitry Vyukov wrote:
Hi Florian, Lorenzo,
This looks great!
Thanks!
What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space.
Yeah we don't in any way.
I think adding something like this would be a bit of its own project.
The fault andler for this is in handle_pte_marker() in mm/memory.c, where we do the following:
/* Hitting a guard page is always a fatal condition. */ if (marker & PTE_MARKER_GUARD) return VM_FAULT_SIGSEGV;
So basically we pass this back to whoever invoked the fault. For uaccess we end up in arch-specific code that eventually checks exception tables etc. and for x86-64 that's kernelmode_fixup_or_oops().
There used to be a sig_on_uaccess_err in the x86-specific thread_struct that let you propagate it but Linus pulled it out in commit 02b670c1f88e ("x86/mm: Remove broken vsyscall emulation code from the page fault code") where it was presumably used for vsyscall.
Of course we could just get something much higher up the stack to send the signal, but we'd need to be careful we weren't breaking anything doing it...
I address GUP below.
Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug.
Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs.
Right, we discussed that offline also as a possible extension to the userfaultfd SIGBUS mode.
I did not look into that yet, but I was wonder if there could be cases where a different process could trigger that SIGSEGV, and how to (and if to) handle that.
For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I think with userfaultfd() we will currently return -EFAULT, because we call get_user_page_vma_remote() that is not prepared for dropping the mmap lock. Possibly that is the right thing to do, but not sure :)
These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be able to distinguish them and perform different handling.
So all GUP will return -EFAULT when hitting guard pages unless we change something.
In GUP we handle this in faultin_page():
if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, flags);
if (err) return err; BUG(); }
And vm_fault_to_errno() is:
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) { if (vm_fault & VM_FAULT_OOM) return -ENOMEM; if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; return 0; }
Again, I think if we wanted special handling here we'd need to probably propagate that fault from higher up, but yes we'd need to for one definitely not do so if it's remote but I worry about other cases.
-- Cheers,
David / dhildenb
Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow.
So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know.
But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)