On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote:
On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote:
On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
Hmm, could the shadow stack underflow onto the real stack then? Not sure how bad that is. INCSSP (incrementing the SSP register on x86) loops are not rare so it seems like something that could happen.
Shadow stack underflow should fault on attempt to access non-shadow-stack memory as shadow-stack, no?
Maybe I'm misunderstanding. I thought the proposal included allowing shadow stack access to convert normal address ranges to shadow stack, and normal writes to convert shadow stack to normal.
As I understood the original discussion of the proposal on IRC, it was only one-way (from shadow to normal). Unless I'm missing something, making it one-way is necessary to catch situations where the shadow stack would become compromised.
The original post here: https://lore.kernel.org/lkml/22a53b78-10d7-4a5a-a01e-b2f3a8c22e94@app.fastma...
...has: "Shadow stack faults on non-shadow stack pages, if flexible shadow stack handling is in effect, cause the affected page to become a shadow stack page. When this happens, the page filled with invalid address tokens."
...and: "Faults from non-shadow-stack accesses to a shadow-stack page which was created by the previous paragraph will cause the page to revert to non- shadow-stack usage, with or without clearing."
I see Stefan has clarified in another response. So I'll go try to figure it out.
Shadow stacks currently have automatic guard gaps to try to prevent one thread from overflowing onto another thread's shadow stack. This would somewhat opens that up, as the stack guard gaps are usually maintained by userspace for new threads. It would have to be thought through if these could still be enforced with checking at additional spots.
I would think the existing guard pages would already do that if a thread's shadow stack is contiguous with its own data stack.
The difference is that the kernel provides the guard gaps, where this would rely on userspace to do it. It is not a showstopper either.
I think my biggest question on this is how does it change the capability for two threads to share a shadow stack. It might require some special rules around the syscall that writes restore tokens. So I'm not sure. It probably needs a POC.
Why would they be sharing a shadow stack?
The guard gap was introduced originally based on a suggestion that overflowing a shadow stack onto an adjacent shadow stack could cause corruption that could be used by an attacker to work around the protection. There wasn't any concrete demonstrated attacks or suggestion that all the protection was moot.
But when we talk about capabilities for converting memory to shadow stack with simple memory accesses, and syscalls that can write restore token to shadow stacks, it's not immediately clear to me that it wouldn't open up something like that. Like if two restore tokens were written to a shadow stack, or two shadow stacks were adjacent with normal memory between them that later got converted to shadow stack. Those sorts of scenarios, but I won't lean on those specific examples. Sorry for being hand wavy. It's just where I'm at, at this point.
From the musl side, I have always looked at the entirely of shadow stack stuff with very heavy skepticism, and anything that breaks existing interface contracts, introduced places where apps can get auto-killed because a late resource allocation fails, or requires applications to code around the existence of something that should be an implementation detail, is a non-starter. To even consider shadow stack support, it must truely be fully non-breaking.
The manual assembly stack switching and JIT code in the apps needs to be updated. I don't think there is a way around it.
Indeed, I'm not talking about programs with JIT/manual stack- switching asm, just anything using existing APIs for control of stack -- pthread_setstack, makecontext, sigaltstack, etc.
Then I think WRSS might fit your requirements better than what glibc did. It was considered a reduced security mode that made libc's job much easier and had better compatibility, but the last discussion was to try to do it without WRSS.
I agree though that the late allocation failures are not great. Mark is working on clone3 support which should allow moving the shadow stack allocation to happen in userspace with the normal stack. Even for riscv though, doesn't it need to update a new register in stack switching?
If clone is called with signals masked, it's probably not necessary for the kernel to set the shadow stack register as part of clone3.
So you would want a mode of clone3 that basically leaves the shadow stack bits alone? Mark was driving that effort, but it doesn't seem horrible to me on first impression. If it would open up the possibility of musl support.
BTW, x86 shadow stack has a mode where the shadow stack is writable with a special instruction (WRSS). It enables the SSP to be set arbitrarily by writing restore tokens. We discussed this as an option to make the existing longjmp() and signal stuff work more transparently for glibc.
BTW, when I talk about "not supporting" I don't mean the app should crash. I mean it should instead run normally, just without shadow stack enabled. Not sure if that was clear. Since shadow stack is not essential for an application to function, it is only security hardening on top.
Although determining if an application supports shadow stack has turned out to be difficult in practice. Handling dlopen() is especially hard.
One reasonable thing to do, that might be preferable to overengineered solutions, is to disable shadow-stack process-wide if an interface incompatible with it is used (sigaltstack, pthread_create with an attribute setup using pthread_attr_setstack, makecontext, etc.), as well as if an incompatible library is is dlopened.
I think it would be an interesting approach to determining compatibility. On x86 there has been cases of binaries getting mismarked as supporting shadow stack. So an automated way of filtering some of those out would be very useful I think. I guess the dynamic linker could determine this based on some list of functions?
The dlopen() bit gets complicated though. You need to disable shadow stack for all threads, which presumably the kernel could be coaxed into doing. But those threads might be using shadow stack instructions (INCSSP, RSTORSSP, etc). These are a collection of instructions that allow limited control of the SSP. When shadow stack gets disabled, these suddenly turn into #UD generating instructions. So any other threads executing those instructions when shadow stack got disabled would be in for a nasty surprise.
Glibc's permissive mode (that disables shadow stack when dlopen()ing a DSO that doesn't support shadow stack) is quite limited because of this. There was a POC for working around it, but I'll stop there for now, to not spam you with the details. I'm not sure of arm and risc-v details on this specific corner, but for x86.
This is much more acceptable than continuing to run with shadow stacks managed sloppily by the kernel and async killing the process on OOM, and is probably *more compatible* with apps than changing the minimum stack size requirements out from under them.
Yep.
The place where it's really needed to be able to allocate the shadow stack synchronously under userspace control, in order to harden normal applications that aren't doing funny things, is in pthread_create without a caller-provided stack.
Yea most apps don't do anything too tricky. Mostly shadow stack "just works". But it's no excuse to just crash for the others.