2025-04-23T17:00:29-07:00, Deepak Gupta debug@rivosinc.com:
On Thu, Apr 10, 2025 at 01:04:39PM +0200, Radim Krčmář wrote:
2025-03-14T14:39:24-07:00, Deepak Gupta debug@rivosinc.com:
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h @@ -62,6 +62,9 @@ struct thread_info { long user_sp; /* User stack pointer */ int cpu; unsigned long syscall_work; /* SYSCALL_WORK_ flags */ +#ifdef CONFIG_RISCV_USER_CFI
- struct cfi_status user_cfi_state;
+#endif
I don't think it makes sense to put all the data in thread_info. kernel_ssp and user_ssp is more than enough and the rest can comfortably live elsewhere in task_struct.
thread_info is supposed to be as small as possible -- just spanning multiple cache-lines could be noticeable.
I can change it to only include only `user_ssp`, base and size.
No need for base and size either -- we don't touch that in the common exception code.
But before we go there, see below:
$ pahole -C thread_info kbuild/vmlinux struct thread_info { long unsigned int flags; /* 0 8 */ int preempt_count; /* 8 4 */
/* XXX 4 bytes hole, try to pack */ long int kernel_sp; /* 16 8 */ long int user_sp; /* 24 8 */ int cpu; /* 32 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int syscall_work; /* 40 8 */ struct cfi_status user_cfi_state; /* 48 32 */ /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */ long unsigned int a0; /* 80 8 */ long unsigned int a1; /* 88 8 */ long unsigned int a2; /* 96 8 */ /* size: 104, cachelines: 2, members: 10 */ /* sum members: 96, holes: 2, sum holes: 8 */ /* last cacheline: 40 bytes */
};
If we were to remove entire `cfi_status`, it would still be 72 bytes (88 bytes if shadow call stack were enabled) and already spans across two cachelines.
It has only 64 bytes of data without shadow call stack, but it wasted 8 bytes on the holes. a2 is somewhat an outlier that is not used most exception paths and excluding it makes everything fit nicely even now.
if shadow call stack were enabled) and already spans across two cachelines. I did see the comment above that it should fit inside a cacheline. Although I assumed its stale comment given that it already spans across cacheline and I didn't see any special mention in commit messages of changes which grew this structure above one cacheline. So I assumed this was a stale comment.
On the other hand, whenever enable/lock bits are checked, there is a high likelyhood that user_ssp and other fields are going to be accessed and thus it actually might be helpful to have it all in one cacheline during runtime.
Yes, although accessing enable/lock bits will be relatively rare. It seems better to have the overhead during thread setup, rather than on every trap.
So I am not sure if its helpful sticking to the comment which already is stale.
We could fix the holes and also use sp instead of a0 in the new_vmalloc_check, so everything would fit better.
We are really close to fitting into a single cache-line, so I'd prefer if shadow stack only filled thread_info with data that is used very often in the exception handling code.
I think we could do without user_sp in thread_info as well, so there are other packing options.
Btw. could ssp be added to pt_regs?
Thanks.