v5: automated test for !defined(GENERIC_ENTRY) failed, fix fs/proc use ifdef for GENERIC_ENTRY || TIF_SYSCALL_USER_DISPATCH note: syscall user dispatch is not presently supported for non-generic entry, but could be implemented. question is whether the TIF_ define should be carved out now or then
v4: Whitespace s/CHECKPOINT_RESTART/CHECKPOINT_RESUME check test_syscall_work(SYSCALL_USER_DISPATCH) to determine if it's turned on or not in fs/proc/array and getter interface
v3: Kernel test robot static function fix Whitespace nitpicks
v2: Implements the getter/setter interface in ptrace rather than prctl
Syscall user dispatch makes it possible to cleanly intercept system calls from user-land. However, most transparent checkpoint software presently leverages some combination of ptrace and system call injection to place software in a ready-to-checkpoint state.
If Syscall User Dispatch is enabled at the time of being quiesced, injected system calls will subsequently be interposed upon and dispatched to the task's signal handler.
This patch set implements 3 features to enable software such as CRIU to cleanly interpose upon software leveraging syscall user dispatch.
- Implement PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH, akin to a similar feature for SECCOMP. This allows a ptracer to temporarily disable syscall user dispatch, making syscall injection possible.
- Implement an fs/proc extension that reports whether Syscall User Dispatch is being used in proc/status. A similar value is present for SECCOMP, and is used to determine whether special logic is needed during checkpoint/resume.
- Implement a getter interface for Syscall User Dispatch config info. To resume successfully, the checkpoint/resume software has to save and restore this information. Presently this configuration is write-only, with no way for C/R software to save it.
This was done in ptrace because syscall user dispatch is not part of uapi. The syscall_user_dispatch_config structure was added to the ptrace exports.
Gregory Price (3): ptrace,syscall_user_dispatch: Implement Syscall User Dispatch Suspension fs/proc/array: Add Syscall User Dispatch to proc status ptrace,syscall_user_dispatch: add a getter/setter for sud configuration
.../admin-guide/syscall-user-dispatch.rst | 5 +- fs/proc/array.c | 10 ++++ include/linux/ptrace.h | 2 + include/linux/syscall_user_dispatch.h | 19 +++++++ include/uapi/linux/ptrace.h | 16 +++++- kernel/entry/syscall_user_dispatch.c | 51 +++++++++++++++++++ kernel/ptrace.c | 13 +++++ 7 files changed, 114 insertions(+), 2 deletions(-)
Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and modify Syscall User Dispatch to suspend interception when enabled.
This is modeled after the SUSPEND_SECCOMP feature, which suspends SECCOMP interposition. Without doing this, software like CRIU will inject system calls into a process and be intercepted by Syscall User Dispatch, either causing a crash (due to blocked signals) or the delivery of those signals to a ptracer (not the intended behavior).
Since Syscall User Dispatch is not a privileged feature, a check for permissions is not required, however attempting to set this option when CONFIG_CHECKPOINT_RESTORE it not supported should be disallowed, as its intended use is checkpoint/resume.
Signed-off-by: Gregory Price gregory.price@memverge.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org --- include/linux/ptrace.h | 2 ++ include/uapi/linux/ptrace.h | 6 +++++- kernel/entry/syscall_user_dispatch.c | 5 +++++ kernel/ptrace.c | 4 ++++ 4 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h index eaaef3ffec22..461ae5c99d57 100644 --- a/include/linux/ptrace.h +++ b/include/linux/ptrace.h @@ -45,6 +45,8 @@ extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
#define PT_EXITKILL (PTRACE_O_EXITKILL << PT_OPT_FLAG_SHIFT) #define PT_SUSPEND_SECCOMP (PTRACE_O_SUSPEND_SECCOMP << PT_OPT_FLAG_SHIFT) +#define PT_SUSPEND_SYSCALL_USER_DISPATCH \ + (PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH << PT_OPT_FLAG_SHIFT)
extern long arch_ptrace(struct task_struct *child, long request, unsigned long addr, unsigned long data); diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index 195ae64a8c87..ba9e3f19a22c 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -146,9 +146,13 @@ struct ptrace_rseq_configuration { /* eventless options */ #define PTRACE_O_EXITKILL (1 << 20) #define PTRACE_O_SUSPEND_SECCOMP (1 << 21) +#define PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH (1 << 22)
#define PTRACE_O_MASK (\ - 0x000000ff | PTRACE_O_EXITKILL | PTRACE_O_SUSPEND_SECCOMP) + 0x000000ff | \ + PTRACE_O_EXITKILL | \ + PTRACE_O_SUSPEND_SECCOMP | \ + PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH)
#include <asm/ptrace.h>
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c index 0b6379adff6b..b5ec75164805 100644 --- a/kernel/entry/syscall_user_dispatch.c +++ b/kernel/entry/syscall_user_dispatch.c @@ -8,6 +8,7 @@ #include <linux/uaccess.h> #include <linux/signal.h> #include <linux/elf.h> +#include <linux/ptrace.h>
#include <linux/sched/signal.h> #include <linux/sched/task_stack.h> @@ -36,6 +37,10 @@ bool syscall_user_dispatch(struct pt_regs *regs) struct syscall_user_dispatch *sd = ¤t->syscall_dispatch; char state;
+ if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) && + unlikely(current->ptrace & PT_SUSPEND_SYSCALL_USER_DISPATCH)) + return false; + if (likely(instruction_pointer(regs) - sd->offset < sd->len)) return false;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 54482193e1ed..a348b68d07a2 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -370,6 +370,10 @@ static int check_ptrace_options(unsigned long data) if (data & ~(unsigned long)PTRACE_O_MASK) return -EINVAL;
+ if (unlikely(data & PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH) && + (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE))) + return -EINVAL; + if (unlikely(data & PTRACE_O_SUSPEND_SECCOMP)) { if (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || !IS_ENABLED(CONFIG_SECCOMP))
Report the value of test_syscall_work(SYSCALL_USER_DISPATCH)) in proc/status if GENERIC_ENTRY is enabled or the arch has implemented it.
This provides an indicator to userland checkpoint/restore software that it must manage special signal conditions (similar to SECCOMP)
Signed-off-by: Gregory Price gregory.price@memverge.com --- fs/proc/array.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/proc/array.c b/fs/proc/array.c index 49283b8103c7..d4e4ee2409c6 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -428,6 +428,15 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm) seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); }
+static inline void task_syscall_user_dispatch(struct seq_file *m, + struct task_struct *p) +{ +#if defined(CONFIG_GENERIC_ENTRY) || defined(TIF_SYSCALL_USER_DISPATCH) + seq_put_decimal_ull(m, "\nSyscall_user_dispatch:\t", + test_task_syscall_work(p, SYSCALL_USER_DISPATCH)); +#endif +} + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { @@ -451,6 +460,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); + task_syscall_user_dispatch(m, task); return 0; }
Implement ptrace getter/setter interface for syscall user dispatch.
Presently, these settings are write-only via prctl, making it impossible to implement transparent checkpoint (coordination with the software is required).
This is modeled after a similar interface for SECCOMP, which can have its configuration dumped by ptrace for software like CRIU.
Signed-off-by: Gregory Price gregory.price@memverge.com --- .../admin-guide/syscall-user-dispatch.rst | 5 +- include/linux/syscall_user_dispatch.h | 19 ++++++++ include/uapi/linux/ptrace.h | 10 ++++ kernel/entry/syscall_user_dispatch.c | 46 +++++++++++++++++++ kernel/ptrace.c | 9 ++++ 5 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst index 60314953c728..a23ae21a1d5b 100644 --- a/Documentation/admin-guide/syscall-user-dispatch.rst +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -43,7 +43,10 @@ doesn't rely on any of the syscall ABI to make the filtering. It uses only the syscall dispatcher address and the userspace key.
As the ABI of these intercepted syscalls is unknown to Linux, these -syscalls are not instrumentable via ptrace or the syscall tracepoints. +syscalls are not instrumentable via ptrace or the syscall tracepoints, +however an interfaces to suspend, checkpoint, and restore syscall user +dispatch configuration has been added to ptrace to assist userland +checkpoint/restart software.
Interface --------- diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h index a0ae443fb7df..9e1bd0d87c1e 100644 --- a/include/linux/syscall_user_dispatch.h +++ b/include/linux/syscall_user_dispatch.h @@ -22,6 +22,13 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset, #define clear_syscall_work_syscall_user_dispatch(tsk) \ clear_task_syscall_work(tsk, SYSCALL_USER_DISPATCH)
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size, + void __user *data); + +int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size, + void __user *data); + + #else struct syscall_user_dispatch {};
@@ -35,6 +42,18 @@ static inline void clear_syscall_work_syscall_user_dispatch(struct task_struct * { }
+static inline int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size, + void __user *data) +{ + return -EINVAL; +} + +static inline int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size, + void __user *data) +{ + return -EINVAL; +} + #endif /* CONFIG_GENERIC_ENTRY */
#endif /* _SYSCALL_USER_DISPATCH_H */ diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index ba9e3f19a22c..8b93c78189b5 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -112,6 +112,16 @@ struct ptrace_rseq_configuration { __u32 pad; };
+#define PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG 0x4210 +#define PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG 0x4211 +struct syscall_user_dispatch_config { + __u64 mode; + __s8 *selector; + __u64 offset; + __u64 len; + __u8 on_dispatch; +}; + /* * These values are stored in task->ptrace_message * by ptrace_stop to describe the current syscall-stop. diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c index b5ec75164805..a303c8de59af 100644 --- a/kernel/entry/syscall_user_dispatch.c +++ b/kernel/entry/syscall_user_dispatch.c @@ -111,3 +111,49 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
return 0; } + +int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size, + void __user *data) +{ + struct syscall_user_dispatch *sd = &task->syscall_dispatch; + struct syscall_user_dispatch_config config; + + if (size != sizeof(struct syscall_user_dispatch_config)) + return -EINVAL; + + if (test_syscall_work(SYSCALL_USER_DISPATCH)) + config.mode = PR_SYS_DISPATCH_ON; + else + config.mode = PR_SYS_DISPATCH_OFF; + + config.offset = sd->offset; + config.len = sd->len; + config.selector = sd->selector; + config.on_dispatch = sd->on_dispatch; + + if (copy_to_user(data, &config, sizeof(config))) + return -EFAULT; + + return 0; +} + +int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size, + void __user *data) +{ + struct syscall_user_dispatch_config config; + int ret; + + if (size != sizeof(struct syscall_user_dispatch_config)) + return -EINVAL; + + if (copy_from_user(&config, data, sizeof(config))) + return -EFAULT; + + ret = set_syscall_user_dispatch(config.mode, config.offset, config.len, + config.selector); + if (ret) + return ret; + + task->syscall_dispatch.on_dispatch = config.on_dispatch; + return 0; +} diff --git a/kernel/ptrace.c b/kernel/ptrace.c index a348b68d07a2..76de46e080e2 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -32,6 +32,7 @@ #include <linux/compat.h> #include <linux/sched/signal.h> #include <linux/minmax.h> +#include <linux/syscall_user_dispatch.h>
#include <asm/syscall.h> /* for syscall_get_* */
@@ -1263,6 +1264,14 @@ int ptrace_request(struct task_struct *child, long request, break; #endif
+ case PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG: + ret = syscall_user_dispatch_set_config(child, addr, datavp); + break; + + case PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG: + ret = syscall_user_dispatch_get_config(child, addr, datavp); + break; + default: break; }
On 01/22, Gregory Price wrote:
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
void __user *data)
+{
- struct syscall_user_dispatch *sd = &task->syscall_dispatch;
- struct syscall_user_dispatch_config config;
- if (size != sizeof(struct syscall_user_dispatch_config))
return -EINVAL;
- if (test_syscall_work(SYSCALL_USER_DISPATCH))
config.mode = PR_SYS_DISPATCH_ON;
- else
config.mode = PR_SYS_DISPATCH_OFF;
Stupid question...
Why do we need 2/3 (which reports SYSCALL_USER_DISPATCH in proc/pid/status) then?
Oleg.
On Mon, Jan 23, 2023 at 04:41:02PM +0100, Oleg Nesterov wrote:
On 01/22, Gregory Price wrote:
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
void __user *data)
+{
- struct syscall_user_dispatch *sd = &task->syscall_dispatch;
- struct syscall_user_dispatch_config config;
- if (size != sizeof(struct syscall_user_dispatch_config))
return -EINVAL;
- if (test_syscall_work(SYSCALL_USER_DISPATCH))
config.mode = PR_SYS_DISPATCH_ON;
- else
config.mode = PR_SYS_DISPATCH_OFF;
Stupid question...
Why do we need 2/3 (which reports SYSCALL_USER_DISPATCH in proc/pid/status) then?
Oleg.
Actually a good question.
My original though was: CRIU uses proc/status to determine whether to use seccomp dumping, so i may as well implement the same thing.
On further thought, I think you're right. We can just always read and set these settings regardless of the original state because SUD is not seccomp.
1. if GENERIC_ENTRY is not compiled, and TIF_SYSCALL_USER_DISPATCH is not available, these settings get ignored anyway. 2. if disabled, offset/len/selector is guaranteed to be off 3. if you try to set something other than the above then this will fail anyway (see: set_syscall_user_dispatch)
ergo 4. It's always say to read/write these settings. As with anything else you can certainly cause the user program to crash by setting garbage but that's to be expected.
So i think dropping 2/3 in the list is good. If you concur i'll do that.
On 01/23, Gregory Price wrote:
So i think dropping 2/3 in the list is good. If you concur i'll do that.
Well I obviously think that 2/3 should be dropped ;)
As for 1/3 and 3/3, feel free to add my reviewed-by.
Oleg.
On Mon, Jan 23, 2023 at 08:52:29PM +0100, Oleg Nesterov wrote:
On 01/23, Gregory Price wrote:
So i think dropping 2/3 in the list is good. If you concur i'll do that.
Well I obviously think that 2/3 should be dropped ;)
As for 1/3 and 3/3, feel free to add my reviewed-by.
Oleg.
I'm actually going to walk my agreement back.
After one more review, the need for the proc/status entry is not to decide whether to dump SUD settings, but for use in deciding whether to set the SUSPEND_SYSCALL_DISPATCH option from patch 1/3.
For SECCOMP, CRIU's `compel` does the following:
1. ptrace attach / halt 2. examine proc/status for seccomp usage 3. if seccomp in use, set PTRACE_O_SUSPEND_SECCOMP 4. proceed with further operations
The same pattern would be used for syscall dispatch.
Technically I think setting the flag unconditionally would be safe, but it would lead to unclear system state (i.e. did i actually suspend something? was the process actually using it?)
To me it seems better to leave it explicit and keep the second commit.
Thoughts?
(cc: @avagin if you happen to have any input on this particular pattern)
~Gregory
I won't really argue, but...
On 01/24, Gregory Price wrote:
On Mon, Jan 23, 2023 at 08:52:29PM +0100, Oleg Nesterov wrote:
On 01/23, Gregory Price wrote:
So i think dropping 2/3 in the list is good. If you concur i'll do that.
Well I obviously think that 2/3 should be dropped ;)
As for 1/3 and 3/3, feel free to add my reviewed-by.
Oleg.
I'm actually going to walk my agreement back.
After one more review, the need for the proc/status entry is not to decide whether to dump SUD settings, but for use in deciding whether to set the SUSPEND_SYSCALL_DISPATCH option from patch 1/3.
Rather than read /proc/pid/status, CRIU can just do PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG unconditionally and check syscall_user_dispatch_config.mode ?
Why do want to expose SYSCALL_USER_DISPATCH in /proc/status? If this task is not stopped you can't trust this value anyway. If it is stopped, I don't think ptrace(PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG) is slower than reading /proc.
but perhaps I missed something?
Oleg.
On Tue, Jan 24, 2023 at 05:43:47PM +0100, Oleg Nesterov wrote:
I won't really argue, but...
On 01/24, Gregory Price wrote:
On Mon, Jan 23, 2023 at 08:52:29PM +0100, Oleg Nesterov wrote:
On 01/23, Gregory Price wrote:
So i think dropping 2/3 in the list is good. If you concur i'll do that.
Well I obviously think that 2/3 should be dropped ;)
As for 1/3 and 3/3, feel free to add my reviewed-by.
Oleg.
I'm actually going to walk my agreement back.
After one more review, the need for the proc/status entry is not to decide whether to dump SUD settings, but for use in deciding whether to set the SUSPEND_SYSCALL_DISPATCH option from patch 1/3.
Rather than read /proc/pid/status, CRIU can just do PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG unconditionally and check syscall_user_dispatch_config.mode ?
Why do want to expose SYSCALL_USER_DISPATCH in /proc/status? If this task is not stopped you can't trust this value anyway. If it is stopped, I don't think ptrace(PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG) is slower than reading /proc.
but perhaps I missed something?
Oleg.
*facepalm* good point, i'm wondering if there's a reason CRIU doesn't do the same for SECCOMP.
either way, going to drop it
On Tue, Jan 24, 2023 at 8:54 AM Gregory Price gregory.price@memverge.com wrote:
On Tue, Jan 24, 2023 at 05:43:47PM +0100, Oleg Nesterov wrote:
I won't really argue, but...
On 01/24, Gregory Price wrote:
On Mon, Jan 23, 2023 at 08:52:29PM +0100, Oleg Nesterov wrote:
On 01/23, Gregory Price wrote:
So i think dropping 2/3 in the list is good. If you concur i'll do that.
Well I obviously think that 2/3 should be dropped ;)
As for 1/3 and 3/3, feel free to add my reviewed-by.
Oleg.
I'm actually going to walk my agreement back.
After one more review, the need for the proc/status entry is not to decide whether to dump SUD settings, but for use in deciding whether to set the SUSPEND_SYSCALL_DISPATCH option from patch 1/3.
Rather than read /proc/pid/status, CRIU can just do PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG unconditionally and check syscall_user_dispatch_config.mode ?
Why do want to expose SYSCALL_USER_DISPATCH in /proc/status? If this task is not stopped you can't trust this value anyway. If it is stopped, I don't think ptrace(PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG) is slower than reading /proc.
but perhaps I missed something?
Oleg.
*facepalm* good point, i'm wondering if there's a reason CRIU doesn't do the same for SECCOMP.
Because information about seccomp was in /proc/pid/status forever and we started using it before the ptrace interface was merged. I am not sure that this is the only reason, but it is definitely one of them.
either way, going to drop it
On Tue, Jan 24, 2023 at 09:58:02AM -0800, Andrei Vagin wrote:
*facepalm* good point, i'm wondering if there's a reason CRIU doesn't do the same for SECCOMP.
Because information about seccomp was in /proc/pid/status forever and we started using it before the ptrace interface was merged. I am not sure that this is the only reason, but it is definitely one of them.
Even better reason to drop it. I'll send out (hopefully) the final configuration here shortly.
Glad this simplified down as much as it did.
On Sun, Jan 22, 2023 at 8:22 PM Gregory Price gourry.memverge@gmail.com wrote: <snip>
+#define PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG 0x4210 +#define PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG 0x4211 +struct syscall_user_dispatch_config {
__u64 mode;
__s8 *selector;
__u64 offset;
__u64 len;
__u8 on_dispatch;
Sorry, I didn't notice this in the previous version. on_dispatch looks like an internal property and I don't see how we can stop a process with ptrace when on_dispatch is set to a non-zero value. I am not sure that we need to expose it to user-space.
Other than that, the patch looks good to me.
Thanks, Andrei
On Mon, Jan 23, 2023 at 06:51:07PM -0800, Andrei Vagin wrote:
On Sun, Jan 22, 2023 at 8:22 PM Gregory Price gourry.memverge@gmail.com wrote:
<snip> > > +#define PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG 0x4210 > +#define PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG 0x4211 > +struct syscall_user_dispatch_config { > + __u64 mode; > + __s8 *selector; > + __u64 offset; > + __u64 len; > + __u8 on_dispatch;
Sorry, I didn't notice this in the previous version. on_dispatch looks like an internal property and I don't see how we can stop a process with ptrace when on_dispatch is set to a non-zero value. I am not sure that we need to expose it to user-space.
Other than that, the patch looks good to me.
Thanks, Andrei
I tried tracing down the exit routes, but wasn't sure if there was a no-return somewhere in the stack i hadn't accounted for, so i left it in just in case.
I'll take one more look then i'll drop it before shipping out a v6.
May I add your Reviewed-by?
Thanks ~Gregory
linux-kselftest-mirror@lists.linaro.org