On Thu, May 18, 2023 at 11:04 PM Dave Hansen dave.hansen@intel.com wrote:
On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole thing: it's headed down a rabbit hole
which is *highly* specialized both in the apps that will use it and the attacks it will mitigate. It probably *requires* turning off a bunch of syscalls (like io_uring) that folks kinda like in general.
ChromeOS currently disabled io_uring, but it is not required to do so. io_uring supports the IORING_OP_MADVICE operation, which calls the do_madvise() function. This means that io_uring will have the same pkey checks as the madvice() system call. From that perspective, we will fully support io_uring for this feature.
io_uring fundamentally doesn't have the same checks. The kernel side work can be done from an asynchronous kernel thread. That kernel thread doesn't have a meaningful PKRU value. The register has a value, but it's not really related to the userspace threads that are sending it requests.
We're balancing that highly specialized mitigation with a feature that add new ABI, touches core memory management code and signal handling.
The ABI change uses the existing flag field in pkey_alloc() which is reserved. The implementation is backward compatible with all existing pkey usages in both kernel and user space. Or do you have other concerns about ABI in mind ?
I'm not worried about the past, I'm worried any time we add a new ABI since we need to support it forever.
Yes, you are right about the risk of touching core mm code. To minimize the risk, I try to control the scope of the change (it is about 3 lines in mprotect, more in munmap but really just 3 effective lines from syscall entry). I added new self-tests in mm to make sure it doesn't regress in api behavior. I run those tests before and after my kernel code change to make sure the behavior remains the same, I tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing discovered a behavior change for mprotect() between 6.1 and 6.4 (not from this patch, there are refactoring works going on in mm) see this thread [1] I hope those steps will help to mitigate the risk.
Agreed on signaling handling is a tough part: what do you think about the approach (modifying PKRU from saved stack after XSAVE), is there a blocker ?
Yes, signal entry and sigreturn are not necessarily symmetric so you can't really have a stack.
On the x86 side, PKRU is a painfully special snowflake. It's exposed in the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. This would be making it an even more special snowflake because it would
I admit I'm quite ignorant on XSAVE to understand the above statement, and how that is related. Could you explain it to me please ? And what is in your mind that might improve the situation ?
In a nutshell: XSAVE components are classified as either user or supervisor. User components can be modified from userspace and supervisor ones only from the kernel. In general, user components don't affect the kernel; the kernel doesn't care what is in ZMM11 (an XSAVE-managed register). That lets us do fun stuff like be lazy about when ZMM11 is saved/restored. Being lazy is good because it give us things like faster context switches and KVM VMEXIT handling.
PKRU is a user component, but it affects the kernel when the kernel does copy_to/from_user() and friends. That means that the kernel can't do any "fun stuff" with PKRU. As soon as userspace provides a new value, the kernel needs to start respecting it. That makes PKRU a very special snowflake.
So, even though PKRU can be managed by XSAVE, it isn't. It isn't kept in the kernel XSAVE buffer. But it *IS* in the signal stack XSAVE buffer. You *can* save/restore it with the other XSAVE components with ptrace(). The user<->kernel ABI pretends that PKRU is XSAVE managed even though it is not.
All of this is special-cased. There's a ton of code to handle this mess. It's _complicated_. I haven't even started talking about how this interacts with KVM and guests.
How could we improve it? A time machine would help to either change the architecture or have Linux ignore the fact that XSAVE knows anything about PKRU.
So, the bar is pretty high for things that want to further muck with PKRU. Add signal and sigaltstack in particular into the fray, and we've got a recipe for disaster. sigaltstack and XSAVE don't really get along very well. https://lwn.net/Articles/862541/
need new altstack ABI and handling.
I thought adding protected memory support to signaling handling is an independent project with its own weight. As Jann Horn points out in [2]: "we could prevent the attacker from corrupting the signal context if we can protect the signal stack with a pkey." However, the kernel will send SIGSEGV when the stack is protected by PKEY, so there is a benefit to make this work. (Maybe Jann can share some more thoughts on the benefits)
And I believe we could do this in a way with minimum ABI change, as below:
- allocate PKEY with a new flag (PKEY_ALTSTACK)
- at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected,
(similar as what mprotect does in this patch) and save it along with stack address/size.
- at signaling handling, use the saved info to fill in PKRU.
The ABI change is similar to PKEY_ENFORCE_API, and there is no backward compatibility issue.
Will these mentioned help our case ? What do you think ?
To be honest, no.
What you've laid out here is the tip of the complexity iceberg. There are a lot of pieces of the kernel that are not yet factored in.
Let's also remember: protection keys is *NOT* a security feature. It's arguable that pkeys is a square peg trying to go into a round security hole.
While they're not a security feature, they're pretty close to providing us with exactly what we need: per-thread memory permissions that we can use for in-process isolation. We've spent quite some effort up front thinking about potential attacks and we're confident we can build something that will pose a meaningful boundary.
On the x86 side, PKRU is a painfully special snowflake. It's exposed in the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. This would be making it an even more special snowflake because it would need new altstack ABI and handling.
Most of the complexity in the signal handling proposal seems to come from the saving/restoring pkru before/after the signal handler execution. However, this is just nice to have. We just need the kernel to allow us to register pkey-tagged memory as a sigaltstack, i.e. it shouldn't crash when trying to write the register state to the stack. Everything else, we can do in userland.
It probably *requires* turning off a bunch of syscalls (like io_uring) that folks kinda like in general.
Kind of. This approach only works in combination with an effort in userland to restrict the syscalls. Though that doesn't mean you have to turn them off, there's also the option of adding validation before it. The same applies to the memory management syscalls in this patchset. We can add validation for these in userland, but we're hoping to do it in kernel instead for the reasons I mentioned before (e.g. they're very common and it's much easier to validate in the kernel). Also subjectively it seems like a nice property if the pkey protections would not just apply to the memory contents, but also apply to the metadata.