Historically, kretprobe has always produced unusable stack traces
(kretprobe_trampoline is the only entry in most cases, because of the
funky stack pointer overwriting). This has caused quite a few annoyances
when using tracing to debug problems[1] -- since return values are only
available with kretprobes but stack traces were only usable for kprobes,
users had to probe both and then manually associate them.
This patch series stores the stack trace within kretprobe_instance on
the kprobe entry used to set up the kretprobe. This allows for
DTrace-style stack aggregation between function entry and exit with
tools like BPFtrace -- which would not really be doable if the stack
unwinder understood kretprobe_trampoline.
We also revert commit 76094a2cf46e ("ftrace: distinguish kretprobe'd
functions in trace logs") and any follow-up changes because that code is
no longer necessary now that stack traces are sane. *However* this patch
might be a bit contentious since the original usecase (that ftrace
returns shouldn't show kretprobe_trampoline) is arguably still an
issue. Feel free to drop it if you think it is wrong.
Patch changelog:
v2:
* documentation: mention kretprobe stack-stashing
* ftrace: add self-test for fixed kretprobe stacktraces
* ftrace: remove [unknown/kretprobe'd] handling
* kprobe: remove needless EXPORT statements
* kprobe: minor corrections to current_kretprobe_instance (switch
away from hlist_for_each_entry_safe)
* kprobe: make maximum stack size 127, which is the ftrace default
(I forgot to Cc the BPF folks in v1, I've added them now.)
Aleksa Sarai (2):
kretprobe: produce sane stack traces
trace: remove kretprobed checks
Documentation/kprobes.txt | 6 +-
include/linux/kprobes.h | 15 +++
kernel/events/callchain.c | 8 +-
kernel/kprobes.c | 101 +++++++++++++++++-
kernel/trace/trace.c | 11 +-
kernel/trace/trace_output.c | 34 +-----
.../test.d/kprobe/kretprobe_stacktrace.tc | 25 +++++
7 files changed, 165 insertions(+), 35 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
--
2.19.1
Discussions around time virtualization are there for a long time.
The first attempt to implement time namespace was in 2006 by Jeff Dike.
>From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping
migration from being used in production systems, given that not all
containers and connections can be migrated as long as a time dependency
is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
start points for them are not defined and are different for each running
system. When a container is migrated from one node to another, all
clocks have to be restored into consistent states; in other words, they
have to continue running from the same points where they have been
dumped.
The main idea behind this patch set is adding per-namespace offsets for
system clocks. When a process in a non-root time namespace requests
time of a clock, a namespace offset is added to the current value of
this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as
part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks.
Questions to discuss:
* Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone
system call.
* Realtime clock implementation details:
Is having a simple offset enough?
What to do when date and time is changed on the host?
Is there a need to adjust vfs modification and creation times?
Implementation for adjtime() syscall.
Cc: Dmitry Safonov <0x7f454c46(a)gmail.com>
Cc: Adrian Reber <adrian(a)lisas.de>
Cc: Andrei Vagin <avagin(a)openvz.org>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Christian Brauner <christian.brauner(a)ubuntu.com>
Cc: Cyrill Gorcunov <gorcunov(a)openvz.org>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jeff Dike <jdike(a)addtoit.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Pavel Emelyanov <xemul(a)virtuozzo.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: containers(a)lists.linux-foundation.org
Cc: criu(a)openvz.org
Cc: linux-api(a)vger.kernel.org
Cc: x86(a)kernel.org
Andrei Vagin (12):
ns: Introduce Time Namespace
timens: Add timens_offsets
timens: Introduce CLOCK_MONOTONIC offsets
timens: Introduce CLOCK_BOOTTIME offset
timerfd/timens: Take into account ns clock offsets
kernel: Take into account timens clock offsets in clock_nanosleep
x86/vdso/timens: Add offsets page in vvar
x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
posix-timers/timens: Take into account clock offsets
selftest/timens: Add test for timerfd
selftest/timens: Add test for clock_nanosleep
timens/selftest: Add timer offsets test
Dmitry Safonov (8):
timens: Shift /proc/uptime
x86/vdso: Restrict splitting vvar vma
x86/vdso: Purge timens page on setns()/unshare()/clone()
x86/vdso: Look for vvar vma to purge timens page
timens: Add align for timens_offsets
timens: Optimize zero-offsets
selftest: Add Time Namespace test for supported clocks
timens/selftest: Add procfs selftest
arch/Kconfig | 5 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vclock_gettime.c | 52 +++++
arch/x86/entry/vdso/vdso-layout.lds.S | 9 +-
arch/x86/entry/vdso/vdso2c.c | 3 +
arch/x86/entry/vdso/vma.c | 67 +++++++
arch/x86/include/asm/vdso.h | 2 +
fs/proc/namespaces.c | 3 +
fs/proc/uptime.c | 3 +
fs/timerfd.c | 16 +-
include/linux/nsproxy.h | 1 +
include/linux/proc_ns.h | 1 +
include/linux/time_namespace.h | 72 +++++++
include/linux/timens_offsets.h | 25 +++
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 8 +
kernel/Makefile | 1 +
kernel/fork.c | 3 +-
kernel/nsproxy.c | 19 +-
kernel/time/hrtimer.c | 8 +
kernel/time/posix-timers.c | 89 ++++++++-
kernel/time/posix-timers.h | 2 +
kernel/time_namespace.c | 230 +++++++++++++++++++++++
tools/testing/selftests/timens/.gitignore | 5 +
tools/testing/selftests/timens/Makefile | 6 +
tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++
tools/testing/selftests/timens/config | 1 +
tools/testing/selftests/timens/log.h | 21 +++
tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++
tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++
tools/testing/selftests/timens/timer.c | 95 ++++++++++
tools/testing/selftests/timens/timerfd.c | 96 ++++++++++
33 files changed, 1272 insertions(+), 13 deletions(-)
create mode 100644 include/linux/time_namespace.h
create mode 100644 include/linux/timens_offsets.h
create mode 100644 kernel/time_namespace.c
create mode 100644 tools/testing/selftests/timens/.gitignore
create mode 100644 tools/testing/selftests/timens/Makefile
create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
create mode 100644 tools/testing/selftests/timens/config
create mode 100644 tools/testing/selftests/timens/log.h
create mode 100644 tools/testing/selftests/timens/procfs.c
create mode 100644 tools/testing/selftests/timens/timens.c
create mode 100644 tools/testing/selftests/timens/timer.c
create mode 100644 tools/testing/selftests/timens/timerfd.c
--
2.13.6
This series fixes issues I encountered building and running the
selftests on a Ubuntu Cosmic ppc64le system.
Joel Stanley (6):
selftests: powerpc/ptrace: Make tests build
selftests: powerpc/ptrace: Remove clean rule
selftests: powerpc/ptrace: Fix linking against pthread
selftests: powerpc/signal: Make tests build
selftests: powerpc/signal: Fix signal_tm CFLAGS
selftests: powerpc/pmu: Link ebb tests with -no-pie
tools/testing/selftests/powerpc/pmu/ebb/Makefile | 3 +++
tools/testing/selftests/powerpc/ptrace/Makefile | 11 ++++-------
tools/testing/selftests/powerpc/signal/Makefile | 9 +++------
3 files changed, 10 insertions(+), 13 deletions(-)
--
2.19.1
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.
One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active. This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer.
See CursorWindow documentation in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow
This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:
#include <stdio.h>
#include <errno.h>
#include <sys/mman.h>
#include <linux/memfd.h>
#include <linux/fcntl.h>
#include <asm/unistd.h>
#include <unistd.h>
#define F_SEAL_FUTURE_WRITE 0x0010
#define REGION_SIZE (5 * 1024 * 1024)
int memfd_create_region(const char *name, size_t size)
{
int ret;
int fd = syscall(__NR_memfd_create, name, MFD_ALLOW_SEALING);
if (fd < 0) return fd;
ret = ftruncate(fd, size);
if (ret < 0) { close(fd); return ret; }
return fd;
}
int main() {
int ret, fd;
void *addr, *addr2, *addr3, *addr1;
ret = memfd_create_region("test_region", REGION_SIZE);
printf("ret=%d\n", ret);
fd = ret;
// Create map
addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
printf("map 0 failed\n");
else
printf("map 0 passed\n");
if ((ret = write(fd, "test", 4)) != 4)
printf("write failed even though no future-write seal "
"(ret=%d errno =%d)\n", ret, errno);
else
printf("write passed\n");
addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr1 == MAP_FAILED)
perror("map 1 prot-write failed even though no seal\n");
else
printf("map 1 prot-write passed as expected\n");
ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE |
F_SEAL_GROW |
F_SEAL_SHRINK);
if (ret == -1)
printf("fcntl failed, errno: %d\n", errno);
else
printf("future-write seal now active\n");
if ((ret = write(fd, "test", 4)) != 4)
printf("write failed as expected due to future-write seal\n");
else
printf("write passed (unexpected)\n");
addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr2 == MAP_FAILED)
perror("map 2 prot-write failed as expected due to seal\n");
else
printf("map 2 passed\n");
addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
if (addr3 == MAP_FAILED)
perror("map 3 failed\n");
else
printf("map 3 prot-read passed as expected\n");
}
The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
future-write seal now active
write failed as expected due to future-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected
Cc: jreck(a)google.com
Cc: john.stultz(a)linaro.org
Cc: tkjos(a)google.com
Cc: gregkh(a)linuxfoundation.org
Cc: hch(a)infradead.org
Reviewed-by: John Stultz <john.stultz(a)linaro.org>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
v1->v2: No change, just added selftests to the series. manpages are
ready and I'll submit them once the patches are accepted.
v2->v3: Updated commit message to have more support code (John Stultz)
Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE
(Christoph Hellwig)
Allow for this seal only if grow/shrink seals are also
either previous set, or are requested along with this seal.
(Christoph Hellwig)
Added locking to synchronize access to file->f_mode.
(Christoph Hellwig)
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..a2f8658f1c55 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7 @@
#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
+#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
/* (1U << 31) is reserved for signed error codes */
/*
diff --git a/mm/memfd.c b/mm/memfd.c
index 2bb5e257080e..5ba9804e9515 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
#define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
- F_SEAL_WRITE)
+ F_SEAL_WRITE | \
+ F_SEAL_FUTURE_WRITE)
static int memfd_add_seals(struct file *file, unsigned int seals)
{
@@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
}
}
+ if ((seals & F_SEAL_FUTURE_WRITE) &&
+ !(*file_seals & F_SEAL_FUTURE_WRITE)) {
+ /*
+ * The FUTURE_WRITE seal also prevents growing and shrinking
+ * so we need them to be already set, or requested now.
+ */
+ int test_seals = (seals | *file_seals) &
+ (F_SEAL_GROW | F_SEAL_SHRINK);
+
+ if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
+ error = -EINVAL;
+ goto unlock;
+ }
+
+ spin_lock(&file->f_lock);
+ file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
+ spin_unlock(&file->f_lock);
+ }
+
*file_seals |= seals;
error = 0;
--
2.19.1.331.ge82ca0e54c-goog
arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:
1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
pointers")
When passing tagged pointers to syscalls, there's a special case of such a
pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
These syscalls don't do memory accesses but rather deal with memory
ranges, hence an untagged pointer is better suited.
This patchset extends tagged pointer support to non-memory syscalls. This
is done by reusing the untagged_addr macro to untag user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok).
The following testing approaches has been taken to find potential issues
with user pointer untagging:
1. Static testing (with sparse [2] and separately with a custom static
analyzer based on Clang) to track casts of __user pointers to integer
types to find places where untagging needs to be done.
2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added
to the patchset.
This patchset is a prerequisite for ARM's memory tagging hardware feature
support [3].
Thanks!
[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060…
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architectur…
Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.
Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).
Changes in v5:
- Added 3 new patches that add untagging to places found with static
analysis.
- Rebased onto 44c929e1 (4.18-rc8).
Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).
Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.
Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.
Changes in v1:
- Rebased onto 4.17-rc1.
Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).
Andrey Konovalov (8):
arm64: add type casts to untagged_addr macro
uaccess: add untagged_addr definition for other arches
arm64: untag user addresses in access_ok and __uaccess_mask_ptr
mm, arm64: untag user addresses in mm/gup.c
lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user
fs, arm64: untag user address in copy_mount_options
arm64: update Documentation/arm64/tagged-pointers.txt
selftests, arm64: add a selftest for passing tagged pointers to kernel
Documentation/arm64/tagged-pointers.txt | 24 +++++++++++--------
arch/arm64/include/asm/uaccess.h | 14 +++++++----
fs/namespace.c | 2 +-
include/linux/uaccess.h | 4 ++++
lib/strncpy_from_user.c | 2 ++
lib/strnlen_user.c | 2 ++
mm/gup.c | 4 ++++
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 11 +++++++++
.../testing/selftests/arm64/run_tags_test.sh | 12 ++++++++++
tools/testing/selftests/arm64/tags_test.c | 19 +++++++++++++++
11 files changed, 79 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_test.c
--
2.19.0.605.g01d371f741-goog