From: Peter Zijlstra <peterz(a)infradead.org>
commit 9e298e8604088a600d8100a111a532a9d342af09 upstream.
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/ftrace.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -228,6 +229,7 @@ int ftrace_modify_call(struct dyn_ftrace
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -256,6 +258,8 @@ int ftrace_update_ftrace_func(ftrace_fun
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -291,13 +295,28 @@ int ftrace_int3_handler(struct pt_regs *
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
static int ftrace_write(unsigned long ip, const char *val, int size)
@@ -868,6 +887,8 @@ void arch_ftrace_update_trampoline(struc
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -964,6 +985,7 @@ static int ftrace_mod_jmp(unsigned long
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
From: Peter Zijlstra <peterz(a)infradead.org>
commit 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 upstream.
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/text-patching.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_r
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
From: Peter Zijlstra <peterz(a)infradead.org>
commit 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 upstream.
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/text-patching.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -38,4 +38,32 @@ extern void *text_poke(void *addr, const
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
These patches try to test the fix made in commit e2f7fc0ac695 ("bpf:
fix undefined behavior in narrow load handling"). The problem existed
in the generated BPF bytecode that was doing a 32bit narrow read of a
64bit field, so to test it the code would need to be executed.
Currently the only such field exists in BPF_PROG_TYPE_PERF_EVENT,
which is not supported by bpf_prog_test_run().
First commit adds the test, but since I can't run it, I'm not sure if
it is even valid (because endianness, for example). Second commit adds
a message to test_verifier when it couldn't run the program because of
lack permissions or program type being not supported. Third commit
fixes a possible problem with errno.
With these patches, I get the following output on my test:
/kernel/tools/testing/selftests/bpf$ sudo ./test_verifier 920 920
#920/p 32bit loads of a 64bit field (both least and most significant words) Did not run the program (not supported) OK
Summary: 1 PASSED, 0 SKIPPED, 0 FAILED
So it looks like I need to pick a different approach.
Krzesimir Nowak (3):
selftests/bpf: Test correctness of narrow 32bit read on 64bit field
selftests/bpf: Print a message when tester could not run a program
selftests/bpf: Avoid a clobbering of errno
tools/testing/selftests/bpf/test_verifier.c | 19 +++++++++++++++----
.../testing/selftests/bpf/verifier/var_off.c | 15 +++++++++++++++
2 files changed, 30 insertions(+), 4 deletions(-)
--
2.20.1
This patch series enables the KVM selftests for s390x. As a first
test, the sync_regs from x86 has been adapted to s390x.
Please note that the ucall() interface is not used yet - since
s390x neither has PIO nor MMIO, this needs some more work first
before it becomes usable (we likely should use a DIAG hypercall
here, which is what the sync_reg test is currently using, too...).
Thomas Huth (4):
KVM: selftests: Guard struct kvm_vcpu_events with
__KVM_HAVE_VCPU_EVENTS
KVM: selftests: Align memory region addresses to 1M on s390x
KVM: selftests: Add processor code for s390x
KVM: selftests: Add the sync_regs test for s390x
MAINTAINERS | 2 +
tools/testing/selftests/kvm/Makefile | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/s390x/processor.h | 22 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 24 +-
.../selftests/kvm/lib/s390x/processor.c | 277 ++++++++++++++++++
.../selftests/kvm/s390x/sync_regs_test.c | 151 ++++++++++
7 files changed, 476 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/s390x/processor.h
create mode 100644 tools/testing/selftests/kvm/lib/s390x/processor.c
create mode 100644 tools/testing/selftests/kvm/s390x/sync_regs_test.c
--
2.21.0
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9e298e8604088a600d8100a111a532a9d342af09 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] ftrace/x86_64: Emulate call function while updating in
breakpoint handler
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..bd553b3af22e 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,28 @@ int ftrace_int3_handler(struct pt_regs *regs)
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +878,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +981,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9e298e8604088a600d8100a111a532a9d342af09 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] ftrace/x86_64: Emulate call function while updating in
breakpoint handler
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..bd553b3af22e 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,28 @@ int ftrace_int3_handler(struct pt_regs *regs)
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +878,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +981,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] x86_64: Allow breakpoints to emulate call instructions
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..05861cc08787 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] x86_64: Allow breakpoints to emulate call instructions
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..05861cc08787 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
## TLDR
I rebased the last patchset on 5.1-rc7 in hopes that we can get this in
5.2.
Shuah, I think you, Greg KH, and myself talked off thread, and we agreed
we would merge through your tree when the time came? Am I remembering
correctly?
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1-rc7/v1
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1-rc7/v1 branch.
## Changes Since Last Version
None. I just rebased the last patchset on v5.1-rc7.
--
2.21.0.593.g511ec345e18-goog
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 6cea33701eb024bc6c920ab83940ee22afd29139 ]
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf_open.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 8fcd1c076add0..cbd55f5f8d598 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -11,6 +11,8 @@ static const char *__doc__ =
#include <bpf/libbpf.h>
#include <getopt.h>
+#include "bpf_rlimit.h"
+
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"debug", no_argument, NULL, 'D' },
--
2.20.1
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 6cea33701eb024bc6c920ab83940ee22afd29139 ]
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf_open.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 8fcd1c076add0..cbd55f5f8d598 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -11,6 +11,8 @@ static const char *__doc__ =
#include <bpf/libbpf.h>
#include <getopt.h>
+#include "bpf_rlimit.h"
+
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"debug", no_argument, NULL, 'D' },
--
2.20.1
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 6cea33701eb024bc6c920ab83940ee22afd29139 ]
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf_open.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 65cbd30704b5a..9e9db202d218a 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -11,6 +11,8 @@ static const char *__doc__ =
#include <bpf/libbpf.h>
#include <getopt.h>
+#include "bpf_rlimit.h"
+
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"debug", no_argument, NULL, 'D' },
--
2.20.1
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().
In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.
A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].
This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.
Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.
Note: arm64 and nds32 respective fixes have been merged in 5.2-rc1,
hence they have been removed from this series.
[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2
Changes:
--------
v3:
- Rebased on 5.2-rc1.
- Addressed review comments.
v2:
- Rebased on 5.1-rc5.
- Addressed review comments.
Cc: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Vincenzo Frascino (3):
powerpc: Fix vDSO clock_getres()
s390: Fix vDSO clock_getres()
kselftest: Extend vDSO selftest to clock_getres
arch/powerpc/include/asm/vdso_datapage.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +-
arch/powerpc/kernel/time.c | 1 +
arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +-
arch/s390/include/asm/vdso.h | 1 +
arch/s390/kernel/asm-offsets.c | 2 +-
arch/s390/kernel/time.c | 1 +
arch/s390/kernel/vdso32/clock_getres.S | 12 +-
arch/s390/kernel/vdso64/clock_getres.S | 10 +-
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_clock_getres.c | 137 ++++++++++++++++++
12 files changed, 168 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c
--
2.21.0
A test for the basic NAT functionality uses ip command which
needs veth device.There is a condition where the kernel support
for veth is not compiled into the kernel and the test script
breaks.This patch contains code for reasonable error display
and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
---
tools/testing/selftests/netfilter/nft_nat.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 8ec76681605c..f25f72a75cf3 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -23,7 +23,11 @@ ip netns add ns0
ip netns add ns1
ip netns add ns2
-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+ echo "SKIP: No virtual ethernet pair device support in kernel"
+ exit $ksft_skip
+fi
ip link add veth1 netns ns0 type veth peer name eth0 netns ns2
ip -net ns0 link set lo up
--
2.20.1
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().
In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.
A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].
This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.
Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.
[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2
Changes:
--------
v2:
- Rebased on 5.1-rc5.
- Addressed review comments.
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Greentime Hu <green.hu(a)gmail.com>
Cc: Vincent Chen <deanbo422(a)gmail.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Vincenzo Frascino (5):
arm64: Fix vDSO clock_getres()
powerpc: Fix vDSO clock_getres()
s390: Fix vDSO clock_getres()
nds32: Fix vDSO clock_getres()
kselftest: Extend vDSO selftest to clock_getres
arch/arm64/include/asm/vdso_datapage.h | 1 +
arch/arm64/kernel/asm-offsets.c | 2 +-
arch/arm64/kernel/vdso.c | 2 +
arch/arm64/kernel/vdso/gettimeofday.S | 22 ++--
arch/nds32/include/asm/vdso_datapage.h | 1 +
arch/nds32/kernel/vdso.c | 1 +
arch/nds32/kernel/vdso/gettimeofday.c | 4 +-
arch/powerpc/include/asm/vdso_datapage.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +-
arch/powerpc/kernel/time.c | 1 +
arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +-
arch/s390/include/asm/vdso.h | 1 +
arch/s390/kernel/asm-offsets.c | 2 +-
arch/s390/kernel/time.c | 1 +
arch/s390/kernel/vdso32/clock_getres.S | 12 +-
arch/s390/kernel/vdso64/clock_getres.S | 10 +-
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_clock_getres.c | 108 ++++++++++++++++++
19 files changed, 159 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c
--
2.21.0
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.
The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.
First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating
file descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:
static int close_all_fds(void)
{
DIR *dir;
struct dirent *direntp;
dir = opendir("/proc/self/fd");
if (!dir)
return -1;
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir); /* cannot fail */
return 0;
}
to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us
2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us
close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.
>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.
/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d8…
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa…
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/sr…
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/gr…
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5…
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.
Suggested-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Dmitry V. Levin <ldv(a)altlinux.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: linux-api(a)vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 30 +++++++++++++++++++++
fs/open.c | 20 ++++++++++++++
include/linux/fdtable.h | 2 ++
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/unistd.h | 4 ++-
22 files changed, 75 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+545 common close_range sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+435 n32 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
+435 n64 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
431 o32 fsconfig sys_fsconfig
432 o32 fsmount sys_fsmount
433 o32 fspick sys_fspick
+435 o32 close_range sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+435 common close_range sys_close_range sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+435 i386 close_range sys_close_range __ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+435 common close_range __x64_sys_close_range
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..3680977a685a 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -641,6 +641,36 @@ int __close_fd(struct files_struct *files, unsigned fd)
}
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+ unsigned int cur_max;
+
+ if (fd > max_fd)
+ return -EINVAL;
+
+ rcu_read_lock();
+ cur_max = files_fdtable(files)->max_fds;
+ rcu_read_unlock();
+
+ /* cap to last valid index into fdtable */
+ if (max_fd >= cur_max)
+ max_fd = cur_max - 1;
+
+ while (fd <= max_fd)
+ __close_fd(files, fd++);
+
+ return 0;
+}
+
/*
* variant of __close_fd that gets a ref on the file for later fput
*/
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
return retval;
}
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags: reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+ unsigned int, flags)
+{
+ if (flags)
+ return -EINVAL;
+
+ return __close_range(current->files, fd, max_fd);
+}
+
/*
* This routine simulates a hangup on the tty, to arrange that users
* are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
extern int __close_fd(struct files_struct *files,
unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+ unsigned int max_fd);
extern int __close_fd_get_file(unsigned int fd, struct file **res);
extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags);
asmlinkage long sys_vhangup(void);
/* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
/*
* 32 bit systems traditionally used different
--
2.21.0
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
v1:
- kbuild test robot <lkp(a)intel.com>:
- add missing entry for pidfd_open to arch/arm/tools/syscall.tbl
- Oleg Nesterov <oleg(a)redhat.com>:
- use simpler thread-group leader check
v2:
- Oleg Nesterov <oleg(a)redhat.com>:
- avoid using additional variable
- remove unneeded comment
- Arnd Bergmann <arnd(a)arndb.de>:
- switch from 428 to 434 since the new mount api has taken it
- bump syscall numbers in arch/arm64/include/asm/unistd.h
- Joel Fernandes (Google) <joel(a)joelfernandes.org>:
- switch from ESRCH to EINVAL when the passed-in pid does not refer to a
thread-group leader
- Christian Brauner <christian(a)brauner.io>:
- rebase on v5.2-rc1
- adapt syscall number to account for new mount api syscalls
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 43 +++++++++++++++++++++
21 files changed, 66 insertions(+), 3 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..1db9bbcfb84e 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+544 common pidfd_open sys_pidfd_open
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..81e6e1817c45 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 70e6882853c0..e8f7d95a1481 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -44,7 +44,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 434
+#define __NR_compat_syscalls 435
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..7a3158ccd68e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_pidfd_open 434
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..ecc44926737b 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..9a3eb2558568 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..ad706f83c755 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..97035e19ad03 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+434 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..5022b9e179c2 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..f2c3bda2d39f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..6ebacfeaf853 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+434 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..834c9c7d79fa 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..c58e71f21129 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..43e4429a5272 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..1bee0a77fdd3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+434 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..782b81945ccc 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..e5684a4512c0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_pidfd_open 434
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..c3df226f47a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1724,7 +1724,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 89548d35eefb..8fc9d94f6ac1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -37,6 +37,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -450,6 +451,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ ret = 0;
+ rcu_read_lock();
+ if (!pid_task(p, PIDTYPE_TGID))
+ ret = -EINVAL;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
Hi,friend,
This is Daniel Murray and i am from Sinara Group Co.Ltd in Russia.
We are glad to know about your company from the web and we are interested in your products.
Could you kindly send us your Latest catalog and price list for our trial order.
Best Regards,
Daniel Murray
Purchasing Manager
The check for entry->index == 0 is done twice. One time should
be sufficient.
Suggested-by: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Signed-off-by: Thomas Huth <thuth(a)redhat.com>
---
Vitaly already noticed this in his review to the "Fix a condition
in test_hv_cpuid()" patch a couple of days ago, but so far I haven't
seen any patch yet on the list that fixes this ... if I missed it
instead, please simply ignore this patch.
tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
index 9a21e912097c..8bdf1e7da6cc 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
@@ -52,9 +52,6 @@ static void test_hv_cpuid(struct kvm_cpuid2 *hv_cpuid_entries,
TEST_ASSERT(entry->index == 0,
".index field should be zero");
- TEST_ASSERT(entry->index == 0,
- ".index field should be zero");
-
TEST_ASSERT(entry->flags == 0,
".flags field should be zero");
--
2.21.0
The code is trying to check that all the padding is zeroed out and it
does this:
entry->padding[0] == entry->padding[1] == entry->padding[2] == 0
Assume everything is zeroed correctly, then the first comparison is
true, the next comparison is false and false is equal to zero so the
overall condition is true. This bug doesn't affect run time very
badly, but the code should instead just check that all three paddings
are zero individually.
Also the error message was copy and pasted from an earlier error and it
wasn't correct.
Fixes: 7edcb7343327 ("KVM: selftests: Add hyperv_cpuid test")
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
---
tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
index 9a21e912097c..63b9fc3fdfbe 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
@@ -58,9 +58,8 @@ static void test_hv_cpuid(struct kvm_cpuid2 *hv_cpuid_entries,
TEST_ASSERT(entry->flags == 0,
".flags field should be zero");
- TEST_ASSERT(entry->padding[0] == entry->padding[1]
- == entry->padding[2] == 0,
- ".index field should be zero");
+ TEST_ASSERT(!entry->padding[0] && !entry->padding[1] &&
+ !entry->padding[2], "padding should be zero");
/*
* If needed for debug:
--
2.20.1
> Hi Alex!
>
> Sorry for the late reply, somehow I missed your e-mails.
>
> The patchset looks good to me, except a typo in subjects/commit logs:
> you probably meant "unexpected failures".
>
> Please, fix and feel free to use:
> Reviewed-by: Roman Gushchin <guro(a)fb.com>
> for the series.
>
Thanks Roman!
BTW, do you know other test cases for cgroup2 function or performance?
Thanks
Alex
The test_core will skip the
test_cgcore_no_internal_process_constraint_on_threads test case if the
'cpu' controller missing in root's subtree_control. In fact we need to
set the 'cpu' in subtree_control, to make the testing meaningful.
./test_core
...
ok 4 # skip test_cgcore_no_internal_process_constraint_on_threads
...
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
---
tools/testing/selftests/cgroup/test_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index b1a570d7c557..2ffc3e07d98d 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -198,7 +198,7 @@ static int test_cgcore_no_internal_process_constraint_on_threads(const char *roo
char *parent = NULL, *child = NULL;
if (cg_read_strstr(root, "cgroup.controllers", "cpu") ||
- cg_read_strstr(root, "cgroup.subtree_control", "cpu")) {
+ cg_write(root, "cgroup.subtree_control", "+cpu")) {
ret = KSFT_SKIP;
goto cleanup;
}
--
2.19.1.856.g8858448bb
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:
$sudo ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
---
tools/testing/selftests/cgroup/test_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..b1a570d7c557 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.19.1.856.g8858448bb
> Hi Alex!
>
> Sorry for the late reply, somehow I missed your e-mails.
>
> The patchset looks good to me, except a typo in subjects/commit logs:
> you probably meant "unexpected failures".
>
> Please, fix and feel free to use:
> Reviewed-by: Roman Gushchin <guro(a)fb.com>
> for the series.
Hi Roman,
Thanks for the typo correct and review. I will resend them with your suggestion.
BTW, How do you test cgroup2 except kselftest? Any other function/performance testing case?
Thanks a lot!
Alex
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a huge problem
for Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfd for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Acked-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
v1:
- kbuild test robot <lkp(a)intel.com>:
- add missing entry for pidfd_open to arch/arm/tools/syscall.tbl
- Oleg Nesterov <oleg(a)redhat.com>:
- use simpler thread-group leader check
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 50 +++++++++++++++++++++
20 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 165f268beafc..ddc3c93ad7a7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
535 common io_uring_setup sys_io_uring_setup
536 common io_uring_enter sys_io_uring_enter
537 common io_uring_register sys_io_uring_register
+538 common pidfd_open sys_pidfd_open
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0393917eaa57..fc41fb34a636 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -441,3 +441,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 23f1a44acada..350e2049b4a9 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -874,6 +874,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 56e3d0b685e1..7115f6dd347a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -348,3 +348,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index df4ec3ec71d1..44bf12b16ffe 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4964947732af..0d32e5152dc0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9392dfe33f97..726e107b3c9f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -366,3 +366,4 @@
425 n32 io_uring_setup sys_io_uring_setup
426 n32 io_uring_enter sys_io_uring_enter
427 n32 io_uring_register sys_io_uring_register
+428 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index fe8ca623add8..83b46b568d51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -424,3 +424,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 00f5a63c8d9a..5294d04d7fa5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 061418f787c3..dcdb838adf49 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 480b057556ee..8e66edfbc521 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index a1dd24307b00..d6f3bc686939 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..1af6b469160a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c18e6ebe3387 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 30084eaf8422..21ee795f3003 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..94a257a93d20 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 737db1828437..980cc1d2b8d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..4afca3d6dcb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -38,6 +38,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -451,6 +452,55 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+ struct task_struct *tsk;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ ret = 0;
+ rcu_read_lock();
+ /*
+ * If this returns non-NULL the pid was used as a thread-group
+ * leader. Note, we race with exec here: If it changes the
+ * thread-group leader we might return the old leader.
+ */
+ tsk = pid_task(p, PIDTYPE_TGID);
+ if (!tsk)
+ ret = -ESRCH;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
Atom-based CPUs trigger stack fault when invoke 32-bit SYSENTER instruction
with invalid register values. So we also need SIGBUS handling in this case.
Following is assembly when the fault exception happens.
(gdb) disassemble $eip
Dump of assembler code for function __kernel_vsyscall:
0xf7fd8fe0 <+0>: push %ecx
0xf7fd8fe1 <+1>: push %edx
0xf7fd8fe2 <+2>: push %ebp
0xf7fd8fe3 <+3>: mov %esp,%ebp
0xf7fd8fe5 <+5>: sysenter
0xf7fd8fe7 <+7>: int $0x80
=> 0xf7fd8fe9 <+9>: pop %ebp
0xf7fd8fea <+10>: pop %edx
0xf7fd8feb <+11>: pop %ecx
0xf7fd8fec <+12>: ret
End of assembler dump.
According to Intel SDM, this could also be a Stack Segment Fault(#SS, 12),
except a normal Page Fault(#PF, 14). Especially, in section 6.9 of Vol.3A,
both stack and page faults are within the 10th(lowest priority) class, and
as it said, "exceptions within each class are implementation-dependent and
may vary from processor to processor". It's expected for processors like
Intel Atom to trigger stack fault(SIGBUS), while we get page fault(SIGSEGV)
from common Core processors.
Signed-off-by: Tong Bo <bo.tong(a)intel.com>
Acked-by: Andy Lutomirski <luto(a)kernel.org>
---
tools/testing/selftests/x86/syscall_arg_fault.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/x86/syscall_arg_fault.c b/tools/testing/selftests/x86/syscall_arg_fault.c
index 7db4fc9..d2548401 100644
--- a/tools/testing/selftests/x86/syscall_arg_fault.c
+++ b/tools/testing/selftests/x86/syscall_arg_fault.c
@@ -43,7 +43,7 @@ static sigjmp_buf jmpbuf;
static volatile sig_atomic_t n_errs;
-static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+static void sigsegv_or_sigbus(int sig, siginfo_t *info, void *ctx_void)
{
ucontext_t *ctx = (ucontext_t*)ctx_void;
@@ -73,7 +73,13 @@ int main()
if (sigaltstack(&stack, NULL) != 0)
err(1, "sigaltstack");
- sethandler(SIGSEGV, sigsegv, SA_ONSTACK);
+ sethandler(SIGSEGV, sigsegv_or_sigbus, SA_ONSTACK);
+ /*
+ * The actual exception can vary. On Atom CPUs, we get #SS
+ * instead of #PF when the vDSO fails to access the stack when
+ * ESP is too close to 2^32, and #SS causes SIGBUS.
+ */
+ sethandler(SIGBUS, sigsegv_or_sigbus, SA_ONSTACK);
sethandler(SIGILL, sigill, SA_ONSTACK);
/*
--
2.7.4
Hi Linus,
Please pull the following Kselftest update for Linux 5.2-rc1
This kselftest second update for Linux 5.2-rc1 consists of
Kselftest framework fixes from Shuah Khan
- kselftest framework bpf build/test workflow regression fix
- Fix to kselftest install to use default install path
- Fix to kselftest KBUILD_OUTPUT builds to not clutter main
KBUILD_OUTPUT directory with selftest objects
- .gitignore fixes from Kelsey Skunberg
- rseq selftests updates from Mathieu Desnoyers and Martin Schwidefsky:
They change the per-architecture pre-abort signatures to ensure those
are valid trap instructions.
The way exit points are presented to debuggers is enhanced, ensuring
all exit points are present, so debuggers don't have to disassemble
rseq critical section to properly skip over them.
Discussions with the glibc community is reaching a consensus of
exposing a __rseq_handled symbol from glibc to coexist with rseq early
adopters.
Update the rseq selftest code to expose and use this symbol.
Support for compiling asm goto with clang is added with the
"-no-integrated-as" compiler switch, similarly to the top level kernel
Makefile.
- kselftest Makefile test run output refactoring and making test
output TAP13 compliant from Kees Cook:
This re-factors the selftest Makefiles to extract the test running
logic to be reused between "run_tests" and "emit_tests", while also
fixing up the test output to be TAP version 13 compliant:
- added "plan" line
- fixed result line syntax
- moved all test output to be "# "-prefixed as TAP "diagnostic"
lines
The prefixing code includes a fallback mode for limited execution
environments.
Additionally, the plan lines are fixed for all callers of kselftest.h.
diff is attached.
Also, there is a merge conflict with the following commit which went
through the s390 maintainer tree and is now in your master which will
require a minor fix-up.
commit e24e4712efad737ca09ff299276737331cd021d9
Author: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
rseq/selftests: s390: use trap4 for RSEQ_SIG
thanks,
-- Shuah
---------------------------------------------------------------
The following changes since commit d917fb876f6eaeeea8a2b620d2a266ce26372f4d:
selftests: build and run gpio when output directory is the src dir
(2019-04-22 17:02:26 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.2-rc1-2
for you to fetch changes up to 61c2018c0743fe0c9ca68e308b5727b8a7c3d544:
selftests: avoid KBUILD_OUTPUT dir cluttering with selftest objects
(2019-05-14 17:37:41 -0600)
----------------------------------------------------------------
linux-kselftest-5.2-rc1-2
This kselftest second update for Linux 5.2-rc1 consists of
Kselftest framework fixes from Shuah Khan
- kselftest framework bpf build/test workflow regression fix
- Fix to kselftest install to use default install path
- Fix to kselftest KBUILD_OUTPUT builds to not clutter main
KBUILD_OUTPUT directory with selftest objects
- .gitignore fixes from Kelsey Skunberg
- rseq selftests updates from Mathieu Desnoyers and Martin Schwidefsky:
They change the per-architecture pre-abort signatures to ensure those
are valid trap instructions.
The way exit points are presented to debuggers is enhanced, ensuring
all exit points are present, so debuggers don't have to disassemble
rseq critical section to properly skip over them.
Discussions with the glibc community is reaching a consensus of exposing
a __rseq_handled symbol from glibc to coexist with rseq early adopters.
Update the rseq selftest code to expose and use this symbol.
Support for compiling asm goto with clang is added with the
"-no-integrated-as" compiler switch, similarly to the top level kernel
Makefile.
- kselftest Makefile test run output refactoring and making test
output TAP13 compliant from Kees Cook:
This re-factors the selftest Makefiles to extract the test running logic
to be reused between "run_tests" and "emit_tests", while also fixing
up the test output to be TAP version 13 compliant:
- added "plan" line
- fixed result line syntax
- moved all test output to be "# "-prefixed as TAP "diagnostic"
lines
The prefixing code includes a fallback mode for limited execution
environments.
Additionally, the plan lines are fixed for all callers of kselftest.h.
----------------------------------------------------------------
Kees Cook (8):
selftests: Extract single-test shell logic from lib.mk
selftests: Use runner.sh for emit targets
selftests: Extract logic for multiple test runs
selftests: Add plan line and fix result line syntax
selftests: Distinguish between missing and non-executable
selftests: Move test output to diagnostic lines
selftests: Remove KSFT_TAP_LEVEL
selftests: Add test plan API to kselftest.h and adjust callers
Kelsey Skunberg (2):
selftests: pidfd: Create .gitignore to include pidfd_test
selftests: drivers: Create .gitignore to include /dma-buf/udmabuf
Martin Schwidefsky (1):
rseq/selftests: s390: use trap4 for RSEQ_SIG
Mathieu Desnoyers (11):
rseq/selftests: x86: Work-around bogus gcc-8 optimisation
rseq/selftests: Add __rseq_exit_point_array section for debuggers
rseq/selftests: Introduce __rseq_cs_ptr_array, rename
__rseq_table to __rseq_cs
rseq/selftests: Use __rseq_handled symbol to coexist with glibc
rseq/selftests: s390: use jg instruction for jumps outside of the asm
rseq/selftests: x86: use ud1 instruction as RSEQ_SIG opcode
rseq/selftests: arm: use udf instruction for RSEQ_SIG
rseq/selftests: aarch64 code signature: handle big-endian environment
rseq/selftests: powerpc code signature: generate valid instructions
rseq/selftests: mips: use break instruction for RSEQ_SIG
rseq/selftests: add -no-integrated-as for clang
Shuah Khan (3):
selftests: fix install target to use default install path
selftests: fix bpf build/test workflow regression when
KBUILD_OUTPUT is set
selftests: avoid KBUILD_OUTPUT dir cluttering with selftest objects
tools/testing/selftests/.gitignore | 1 -
tools/testing/selftests/Makefile | 39 +--
.../selftests/breakpoints/breakpoint_test.c | 15 +-
.../selftests/breakpoints/breakpoint_test_arm64.c | 3 +-
.../breakpoints/step_after_suspend_test.c | 8 +
tools/testing/selftests/capabilities/test_execve.c | 6 +-
tools/testing/selftests/drivers/.gitignore | 1 +
.../selftests/futex/functional/futex_requeue_pi.c | 1 +
.../functional/futex_requeue_pi_mismatched_ops.c | 1 +
.../functional/futex_requeue_pi_signal_restart.c | 1 +
.../functional/futex_wait_private_mapped_file.c | 1 +
.../futex/functional/futex_wait_timeout.c | 1 +
.../functional/futex_wait_uninitialized_heap.c | 1 +
.../futex/functional/futex_wait_wouldblock.c | 1 +
tools/testing/selftests/kselftest.h | 17 +-
tools/testing/selftests/kselftest/prefix.pl | 23 ++
tools/testing/selftests/kselftest/runner.sh | 86 +++++++
tools/testing/selftests/lib.mk | 76 ++----
.../testing/selftests/membarrier/membarrier_test.c | 1 +
tools/testing/selftests/pidfd/.gitignore | 1 +
tools/testing/selftests/pidfd/pidfd_test.c | 1 +
tools/testing/selftests/rseq/Makefile | 8 +-
tools/testing/selftests/rseq/rseq-arm.h | 132 +++++++++--
tools/testing/selftests/rseq/rseq-arm64.h | 74 +++++-
tools/testing/selftests/rseq/rseq-mips.h | 115 ++++++++-
tools/testing/selftests/rseq/rseq-ppc.h | 90 ++++++-
tools/testing/selftests/rseq/rseq-s390.h | 78 +++++-
tools/testing/selftests/rseq/rseq-x86.h | 264
++++++++++++++-------
tools/testing/selftests/rseq/rseq.c | 55 ++++-
tools/testing/selftests/rseq/rseq.h | 1 +
tools/testing/selftests/sigaltstack/sas.c | 1 +
tools/testing/selftests/sync/sync_test.c | 1 +
32 files changed, 883 insertions(+), 221 deletions(-)
create mode 100644 tools/testing/selftests/drivers/.gitignore
create mode 100755 tools/testing/selftests/kselftest/prefix.pl
create mode 100644 tools/testing/selftests/kselftest/runner.sh
create mode 100644 tools/testing/selftests/pidfd/.gitignore
---------------------------------------------------------------
This series of patches move the commonly used bpf_printk macro to
bpf_helpers.h which is already included in all BPF programs which
defined that macro on their own.
v1->v2:
- If HBM_DEBUG is not defined in hbm sample, undefine bpf_printk and set
an empty macro for it.
Michal Rostecki (2):
selftests: bpf: Move bpf_printk to bpf_helpers.h
samples: bpf: Do not define bpf_printk macro
samples/bpf/hbm_kern.h | 11 ++---------
samples/bpf/tcp_basertt_kern.c | 7 -------
samples/bpf/tcp_bufs_kern.c | 7 -------
samples/bpf/tcp_clamp_kern.c | 7 -------
samples/bpf/tcp_cong_kern.c | 7 -------
samples/bpf/tcp_iw_kern.c | 7 -------
samples/bpf/tcp_rwnd_kern.c | 7 -------
samples/bpf/tcp_synrto_kern.c | 7 -------
samples/bpf/tcp_tos_reflect_kern.c | 7 -------
samples/bpf/xdp_sample_pkts_kern.c | 7 -------
tools/testing/selftests/bpf/bpf_helpers.h | 8 ++++++++
.../testing/selftests/bpf/progs/sockmap_parse_prog.c | 7 -------
.../selftests/bpf/progs/sockmap_tcp_msg_prog.c | 7 -------
.../selftests/bpf/progs/sockmap_verdict_prog.c | 7 -------
.../testing/selftests/bpf/progs/test_lwt_seg6local.c | 7 -------
tools/testing/selftests/bpf/progs/test_xdp_noinline.c | 7 -------
tools/testing/selftests/bpf/test_sockmap_kern.h | 7 -------
17 files changed, 10 insertions(+), 114 deletions(-)
--
2.21.0
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a huge problem
for Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfd for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 48 +++++++++++++++++++++
19 files changed, 69 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 165f268beafc..ddc3c93ad7a7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
535 common io_uring_setup sys_io_uring_setup
536 common io_uring_enter sys_io_uring_enter
537 common io_uring_register sys_io_uring_register
+538 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 23f1a44acada..350e2049b4a9 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -874,6 +874,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 56e3d0b685e1..7115f6dd347a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -348,3 +348,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index df4ec3ec71d1..44bf12b16ffe 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4964947732af..0d32e5152dc0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9392dfe33f97..726e107b3c9f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -366,3 +366,4 @@
425 n32 io_uring_setup sys_io_uring_setup
426 n32 io_uring_enter sys_io_uring_enter
427 n32 io_uring_register sys_io_uring_register
+428 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index fe8ca623add8..83b46b568d51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -424,3 +424,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 00f5a63c8d9a..5294d04d7fa5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 061418f787c3..dcdb838adf49 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 480b057556ee..8e66edfbc521 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index a1dd24307b00..d6f3bc686939 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..1af6b469160a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c18e6ebe3387 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 30084eaf8422..21ee795f3003 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..94a257a93d20 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 737db1828437..980cc1d2b8d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..237d18d6ecb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -38,6 +38,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+ struct task_struct *tsk;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ rcu_read_lock();
+ tsk = pid_task(p, PIDTYPE_PID);
+ if (!tsk)
+ ret = -ESRCH;
+ else if (unlikely(!thread_group_leader(tsk)))
+ ret = -EINVAL;
+ else
+ ret = 0;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
From: Paolo Bonzini <pbonzini(a)redhat.com>
[ Upstream commit 76d58e0f07ec203bbdfcaabd9a9fc10a5a3ed5ea ]
If a memory slot's size is not a multiple of 64 pages (256K), then
the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
either requires the requested page range to go beyond memslot->npages,
or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
requires log->num_pages to be both in range and aligned.
To allow this case, allow log->num_pages not to be a multiple of 64 if
it ends exactly on the last page of the slot.
Reported-by: Peter Xu <peterx(a)redhat.com>
Fixes: 98938aa8edd6 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/virtual/kvm/api.txt | 5 +++--
tools/testing/selftests/kvm/dirty_log_test.c | 9 ++++++---
virt/kvm/kvm_main.c | 7 ++++---
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ba8927c0d45c4..a1b8e6d92298c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3790,8 +3790,9 @@ The ioctl clears the dirty status of pages in a memory slot, according to
the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap
field. Bit 0 of the bitmap corresponds to page "first_page" in the
memory slot, and num_pages is the size in bits of the input bitmap.
-Both first_page and num_pages must be a multiple of 64. For each bit
-that is set in the input bitmap, the corresponding page is marked "clean"
+first_page must be a multiple of 64; num_pages must also be a multiple of
+64 unless first_page + num_pages is the size of the memory slot. For each
+bit that is set in the input bitmap, the corresponding page is marked "clean"
in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
(for example via write-protection, or by clearing the dirty bit in
a page table entry).
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 4715cfba20dce..93f99c6b7d79e 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -288,8 +288,11 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
#endif
max_gfn = (1ul << (guest_pa_bits - guest_page_shift)) - 1;
guest_page_size = (1ul << guest_page_shift);
- /* 1G of guest page sized pages */
- guest_num_pages = (1ul << (30 - guest_page_shift));
+ /*
+ * A little more than 1G of guest page sized pages. Cover the
+ * case where the size is not aligned to 64 pages.
+ */
+ guest_num_pages = (1ul << (30 - guest_page_shift)) + 3;
host_page_size = getpagesize();
host_num_pages = (guest_num_pages * guest_page_size) / host_page_size +
!!((guest_num_pages * guest_page_size) % host_page_size);
@@ -359,7 +362,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
#ifdef USE_CLEAR_DIRTY_LOG
kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0,
- DIV_ROUND_UP(host_num_pages, 64) * 64);
+ host_num_pages);
#endif
vm_dirty_log_verify(bmap);
iteration++;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b4f2d892a1d36..c9b102e65a3a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1241,7 +1241,7 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
return -EINVAL;
- if ((log->first_page & 63) || (log->num_pages & 63))
+ if (log->first_page & 63)
return -EINVAL;
slots = __kvm_memslots(kvm, as_id);
@@ -1254,8 +1254,9 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
n = kvm_dirty_bitmap_bytes(memslot);
if (log->first_page > memslot->npages ||
- log->num_pages > memslot->npages - log->first_page)
- return -EINVAL;
+ log->num_pages > memslot->npages - log->first_page ||
+ (log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
+ return -EINVAL;
*flush = false;
dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
--
2.20.1
Linus, Greg, Masahiro, Here are some simple fixes for the kheaders feature.
Please consider these patches for an rc release. They are based on Linus's
master branch. The only difference between the last series [1] and this one is
I squashed 1/3 and 3/3 and rebased. Thanks!
[1] https://patchwork.kernel.org/cover/10939557/
Joel Fernandes (Google) (2):
kheaders: Move from proc to sysfs
kheaders: Do not regenerate archive if config is not changed
init/Kconfig | 17 +++++----
kernel/Makefile | 4 +--
kernel/{gen_ikh_data.sh => gen_kheaders.sh} | 17 ++++++---
kernel/kheaders.c | 40 +++++++++------------
4 files changed, 38 insertions(+), 40 deletions(-)
rename kernel/{gen_ikh_data.sh => gen_kheaders.sh} (82%)
--
2.21.0.1020.gf2820cf01a-goog
## TLDR
I mostly wanted to incorporate feedback I got over the last week and a
half.
Biggest things to look out for:
- KUnit core now outputs results in TAP14.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Greg, Logan, you might want to re-review this.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
There is still some discussion going on on the [PATCH v2 00/17] thread,
but I wanted to get some of these updates out before they got too stale
(and too difficult for me to keep track of). I hope no one minds.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3].
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1/v3 branch.
## Changes Since Last Version
- Converted KUnit core to print test results in TAP14 format as
suggested by Greg and Frank.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
- Added a new set of EXPECTs and ASSERTs for pointer comparison.
- Removed more function indirection as suggested by Logan.
- Added a new patch that adds `kunit_try_catch_throw` to objtool's
noreturn list.
- Fixed a number of minorish issues pointed out by Shuah, Masahiro, and
kbuild bot.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1/v3
--
2.21.0.1020.gf2820cf01a-goog
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexcepted failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..73612d604a2a 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.19.1.856.g8858448bb
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
This patch adds polling support to pidfd. Using the polling support, LMK
will be able to get notified when a process exists in race-free and fast
way, and allows the LMK to do other things (such as by polling on other
fds) while awaiting the process being killed to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when
the tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage
of all the cases the patch is handling.
Andy had a similar patch [1] in the past which was a good reference
however this patch tries to handle different situations properly related
to thread group existence, and how/where it notifies. And also solves
other bugs (waitqueue lifetime). Daniel had a similar patch [2]
recently which this patch supercedes.
[1] https://lore.kernel.org/patchwork/patch/345098/
[2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Christian Brauner <christian(a)brauner.io>
Cc: Jann Horn <jannh(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: kernel-team(a)android.com
(Oleg improved the code by showing how to avoid tasklist_lock)
Suggested-by: Oleg Nesterov <oleg(a)redhat.com>
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
v1 -> v2:
* Restructure poll code to avoid tasklist_lock (Oleg)
* use task_pid instead of get_pid_task in notify_pidfd (Oleg)
* Added comments to code, commit message nits (Christian)
* Test case nits/improvements (Christian)
RFC -> v1:
* Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
* Updated selftests.
* Renamed poll wake function to do_notify_pidfd.
* Removed depending on EXIT flags
* Removed POLLERR flag since semantics are controversial and
we don't have usecases for it right now (later we can add if there's
a need for it).
include/linux/pid.h | 3 +++
kernel/fork.c | 29 +++++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 11 +++++++++++
4 files changed, 45 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..1484db6ca8d1 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@
#define _LINUX_PID_H
#include <linux/rculist.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -60,6 +61,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 5525837ed80e..721f8c9d2921 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,8 +1685,37 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid = file->private_data;
+ int poll_flags = 0;
+
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ WARN_ON_ONCE(task && !thread_group_leader(task));
+
+ /*
+ * Inform pollers only when the whole thread group exits, if thread
+ * group leader exits before all other threads in the group, then
+ * poll(2) should block, similar to the wait(2) family.
+ */
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+ rcu_read_unlock();
+
+ return poll_flags;
+}
+
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..5c90c239242f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index 1581140f2d99..a17fff073c3d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1800,6 +1800,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1823,6 +1831,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
2.21.0.593.g511ec345e18-goog
On Wed, May 08, 2019 at 05:39:07PM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 07:42:48AM -0500, Josh Poimboeuf wrote:
> > On Wed, May 08, 2019 at 02:04:16PM +0200, Peter Zijlstra wrote:
>
> > > Do the x86_64 variants also want some ORC annotation?
> >
> > Maybe so. Though it looks like regs->ip isn't saved. The saved
> > registers might need to be tweaked. I'll need to look into it.
>
> What all these sites do (and maybe we should look at unifying them
> somehow) is turn a CALL frame (aka RET-IP) into an exception frame (aka
> pt_regs).
>
> So regs->ip will be the return address (which is fixed up to be the CALL
> address in the handler).
But from what I can tell, trampoline_handler() hard-codes regs->ip to
point to kretprobe_trampoline(), and the original return address is
placed in regs->sp.
Masami, is there a reason why regs->ip doesn't have the original return
address and regs->sp doesn't have the original SP? I think that would
help the unwinder understand things.
--
Josh
The following files are generated after building /selftests/bpf/ and
should be added to .gitignore:
- libbpf.pc
- libbpf.so.*
Signed-off-by: Kelsey Skunberg <skunberg.kelsey(a)gmail.com>
---
Change since v1:
- Add libbpf.so.* in replace of libbpf.so.0 and
libbpf.so.0.0.3
- Update commit log to reflect file change
tools/testing/selftests/bpf/.gitignore | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 41e8a689aa77..a877803e4ba8 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -32,3 +32,5 @@ test_tcpnotify_user
test_libbpf
test_tcp_check_syncookie_user
alu32
+libbpf.pc
+libbpf.so.*
--
2.20.1