Hello Dave,
Thanks for your email.
On Thu, May 18, 2023 at 8:38 AM Dave Hansen dave.hansen@intel.com wrote:
On 5/17/23 16:48, Jeff Xu wrote:
However, there are a few challenges I have not yet worked through. First, the code needs to track when the first signaling entry occurs (saving the PKRU register to the thread struct) and when it is last returned (restoring the PKRU register from the thread struct).
Would tracking signal "depth" work in the face of things like siglongjmp?
Thank you for your question! I am eager to learn more about this area and I worry about blind spots. I will investigate and get back to you.
Taking a step back...
Here's my concern about this whole thing: it's headed down a rabbit hole which is *highly* specialized both in the apps that will use it and the attacks it will mitigate. It probably *requires* turning off a bunch of syscalls (like io_uring) that folks kinda like in general.
ChromeOS currently disabled io_uring, but it is not required to do so. io_uring supports the IORING_OP_MADVICE operation, which calls the do_madvise() function. This means that io_uring will have the same pkey checks as the madvice() system call. From that perspective, we will fully support io_uring for this feature.
We're balancing that highly specialized mitigation with a feature that add new ABI, touches core memory management code and signal handling.
The ABI change uses the existing flag field in pkey_alloc() which is reserved. The implementation is backward compatible with all existing pkey usages in both kernel and user space. Or do you have other concerns about ABI in mind ?
Yes, you are right about the risk of touching core mm code. To minimize the risk, I try to control the scope of the change (it is about 3 lines in mprotect, more in munmap but really just 3 effective lines from syscall entry). I added new self-tests in mm to make sure it doesn't regress in api behavior. I run those tests before and after my kernel code change to make sure the behavior remains the same, I tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing discovered a behavior change for mprotect() between 6.1 and 6.4 (not from this patch, there are refactoring works going on in mm) see this thread [1] I hope those steps will help to mitigate the risk.
Agreed on signaling handling is a tough part: what do you think about the approach (modifying PKRU from saved stack after XSAVE), is there a blocker ?
On the x86 side, PKRU is a painfully special snowflake. It's exposed in the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. This would be making it an even more special snowflake because it would
I admit I'm quite ignorant on XSAVE to understand the above statement, and how that is related. Could you explain it to me please ? And what is in your mind that might improve the situation ?
need new altstack ABI and handling.
I thought adding protected memory support to signaling handling is an independent project with its own weight. As Jann Horn points out in [2]: "we could prevent the attacker from corrupting the signal context if we can protect the signal stack with a pkey." However, the kernel will send SIGSEGV when the stack is protected by PKEY, so there is a benefit to make this work. (Maybe Jann can share some more thoughts on the benefits)
And I believe we could do this in a way with minimum ABI change, as below: - allocate PKEY with a new flag (PKEY_ALTSTACK) - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected, (similar as what mprotect does in this patch) and save it along with stack address/size. - at signaling handling, use the saved info to fill in PKRU. The ABI change is similar to PKEY_ENFORCE_API, and there is no backward compatibility issue.
Will these mentioned help our case ? What do you think ?
(Stephan has more info on gains, as far as I know, V8 engineers have worked/thought really hard to come to a suitable solution to make chrome browser safer)
[1] https://lore.kernel.org/linux-mm/20230516165754.pocx4kaagn3yyw3r@revolver/T/ [2] https://docs.google.com/document/d/1OlnJbR5TMoaOAJsf4hHOc-FdTmYK2aDUI7d2hfCZ...
Thanks! Best regards, -Jeff