Does this fix it?
I think moving the explicit 'struct fpu' out of task_struct took the knowledge away from the compiler on how to keep the XSAVE buffer aligned. Once that happened, we ended up with unaligned XSAVE operations and bad things happened.
Also, open-coding "task + sizeof(*task)" in three different places seems suboptimal.