This patch is being developed here (with snapshots of each series
version being stashed in separate branches with names of the form
"resolveat/vX-summary"):
<https://github.com/cyphar/linux/tree/resolveat/master>
Patch changelog:
v10:
* Ensure that unlazy_walk() will fail if we are in a scoped walk and
the caller has zeroed nd->root (this happens in a few places, I'm
not sure why because unlazy_walk() does legitimize_path()
already). In this case we need to go through path_init() again to
reset it (otherwise we will have a breakout because set_root()
will breakout).
* Also add a WARN_ON (and return -ENOTRECOVERABLE) if
LOOKUP_IN_ROOT is set and we are in set_root() -- which should
never happen and will cause a breakout.
* Make changes suggested by Al Viro:
* Remove nd->{opath_mask,acc_mode} by moving all of the magic-link
permission logic be done after trailing_symlink() (with
trailing_magiclink()) only within path_openat().
* Introduce LOOKUP_MAGICLINK_JUMPED to be able to detect
magic-link jumps done with nd_jump_link() (so we don't end up
blocking other LOOKUP_JUMPED cases).
* Simplify all of the path_init() changes to make the code far
less confusing. dirfd_path_init() turns out to be un-necessary.
* Make openat2(2) also -EINVAL on unknown how->flags.
[Dmitry V. Levin]
* Clean up bad definitions of O_EMPTYPATH on architectures where O_*
flags are subtly different to <asm-generic/fcntl.h>.
* Switch away from passing a struct to build_open_flags() and
instead just copy the one field we need to temporarily modify
(how->flags). Also fix a bug in OPENHOW_MODE. [Rasmus Villemoes]
* Fix syscall linkages and switch to 437. [Arnd Bergmann]
* Clean up text in commit messages and the cover-letter.
[Rolf Eike Beer]
* Fix openat2 selftest makefile. [Michael Ellerman]
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides an alternative way to get an O_PATH file descriptor (the
reasoning for doing this is included in the patch description). The
following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. openat2(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.
In order to make all of the above more usable, I'm working on
libpathrs[7] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm(a)xmission.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Christian Brauner <christian(a)brauner.io>
Cc: David Drysdale <drysdale(a)google.com>
Cc: Tycho Andersen <tycho(a)tycho.ws>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <containers(a)lists.linux-foundation.org>
Cc: <linux-fsdevel(a)vger.kernel.org>
Cc: <linux-api(a)vger.kernel.org>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
[7]: https://github.com/openSUSE/libpathrs
Aleksa Sarai (9):
namei: obey trailing magic-link DAC permissions
procfs: switch magic-link modes to be more sane
open: O_EMPTYPATH: procfs-less file descriptor re-opening
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
open: openat2(2) syscall
kselftest: save-and-restore errno to allow for %m formatting
selftests: add openat2(2) selftests
Documentation/filesystems/path-lookup.rst | 12 +-
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 39 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 270 ++++++++++--
fs/open.c | 112 ++++-
fs/proc/base.c | 20 +-
fs/proc/fd.c | 23 +-
fs/proc/namespaces.c | 2 +-
include/linux/fcntl.h | 17 +-
include/linux/fs.h | 8 +-
include/linux/namei.h | 9 +
include/linux/syscalls.h | 17 +-
include/uapi/asm-generic/fcntl.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fcntl.h | 42 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest.h | 15 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 162 +++++++
tools/testing/selftests/openat2/helpers.h | 114 +++++
.../testing/selftests/openat2/linkmode_test.c | 326 ++++++++++++++
.../selftests/openat2/rename_attack_test.c | 124 ++++++
.../testing/selftests/openat2/resolve_test.c | 397 ++++++++++++++++++
46 files changed, 1652 insertions(+), 107 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.22.0
Patch changelog:
v9:
* Replace resolveat(2) with openat2(2). [Linus]
* Output a warning to dmesg if may_open_magiclink() is violated.
* Add an openat2(O_CREAT) testcase.
v8:
* Default to O_CLOEXEC to match other new fd-creation syscalls
(users can always disable O_CLOEXEC afterwards). [Christian]
* Implement magic-link restrictions based on their mode. This is
done through a series of masks and is designed to avoid breaking
users -- most users don't have chained O_PATH fd re-opens.
* Add O_EMPTYPATH which allows for fd re-opening without needing
procfs. This would help some users of fd re-opening, and with the
changes to magic-link permissions we now have the right semantics
for such a flag.
* Add selftests for resolveat(2), O_EMPTYPATH, and the magic-link
mode semantics.
v7:
* Remove execveat(2) support for these flags since it might
result in some pretty hairy security issues with setuid binaries.
There are other avenues we can go down to solve the issues with
CVE-2019-5736. [Jann]
* Reserve an additional bit in resolveat(2) for the eXecute access
mode if we end up implementing it.
v6:
* Drop O_* flags API to the new LOOKUP_ path scoping bits and
instead introduce resolveat(2) as an alternative method of
obtaining an O_PATH. The justification for this is included in
patch 6 (though switching back to O_* flags is trivial).
v5:
* In response to CVE-2019-5736 (one of the vectors showed that
open(2)+fexec(3) cannot be used to scope binfmt_script's implicit
open_exec()), AT_* flags have been re-added and are now piped
through to binfmt_script (and other binfmt_* that use open_exec)
but are only supported for execveat(2) for now.
v4:
* Remove AT_* flag reservations, as they require more discussion.
* Switch to path_is_under() over __d_path() for breakout checking.
* Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd
is now ignored for absolute paths to match other flags.
* Improve the dirfd_path_init() refactor and move it to a separate
commit.
* Remove reference to Linux-capsicum.
* Switch "proclink" name to magic-link.
v3: [resend]
v2:
* Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
some semi-aggressive __d_path checking (see patch 3).
* Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
hopes they can be re-enabled once safe.
* Removed the selftests as they will be reimplemented as xfstests.
* Removed stat(2) support, since you can already get it through
O_PATH and fstatat(2).
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides an alternative way to get an O_PATH file descriptor (the
reasoning for doing this is included in patch 6). The following new
LOOKUP_ flags are added:
* LOOKUP_XDEV blocks all mountpoint crossings (upwards, downwards, or
through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having O_THISROOT (such as
CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. openat2(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.
In order to make all of the above more usable, I'm working on
libpathrs[7] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm(a)xmission.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Christian Brauner <christian(a)brauner.io>
Cc: David Drysdale <drysdale(a)google.com>
Cc: Tycho Andersen <tycho(a)tycho.ws>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <containers(a)lists.linux-foundation.org>
Cc: <linux-fsdevel(a)vger.kernel.org>
Cc: <linux-api(a)vger.kernel.org>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
[7]: https://github.com/openSUSE/libpathrs
Aleksa Sarai (10):
namei: obey trailing magic-link DAC permissions
procfs: switch magic-link modes to be more sane
open: O_EMPTYPATH: procfs-less file descriptor re-opening
namei: split out nd->dfd handling to dirfd_path_init
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
open: openat2(2) syscall
kselftest: save-and-restore errno to allow for %m formatting
selftests: add openat2(2) selftests
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 333 ++++++++++++---
fs/open.c | 140 +++++--
fs/proc/base.c | 20 +-
fs/proc/fd.c | 23 +-
fs/proc/namespaces.c | 2 +-
include/linux/fcntl.h | 17 +-
include/linux/fs.h | 8 +-
include/linux/namei.h | 8 +
include/linux/syscalls.h | 14 +-
include/uapi/asm-generic/fcntl.h | 5 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 38 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest.h | 15 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 12 +
tools/testing/selftests/openat2/helpers.c | 162 +++++++
tools/testing/selftests/openat2/helpers.h | 114 +++++
.../testing/selftests/openat2/linkmode_test.c | 325 ++++++++++++++
.../selftests/openat2/rename_attack_test.c | 124 ++++++
.../testing/selftests/openat2/resolve_test.c | 395 ++++++++++++++++++
42 files changed, 1667 insertions(+), 125 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.22.0
From: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
[ Upstream commit ee8a84c60bcc1f1615bd9cb3edfe501e26cdc85b ]
Using ".arm .inst" for the arm signature introduces build issues for
programs compiled in Thumb mode because the assembler stays in the
arm mode for the rest of the inline assembly. Revert to using a ".word"
to express the signature as data instead.
The choice of signature is a valid trap instruction on arm32 little
endian, where both code and data are little endian.
ARMv6+ big endian (BE8) generates mixed endianness code vs data:
little-endian code and big-endian data. The data value of the signature
needs to have its byte order reversed to generate the trap instruction.
Prior to ARMv6, -mbig-endian generates big-endian code and data
(which match), so the endianness of the data representation of the
signature should not be reversed. However, the choice between BE32
and BE8 is done by the linker, so we cannot know whether code and
data endianness will be mixed before the linker is invoked. So rather
than try to play tricks with the linker, the rseq signature is simply
data (not a trap instruction) prior to ARMv6 on big endian. This is
why the signature is expressed as data (.word) rather than as
instruction (.inst) in assembler.
Because a ".word" is used to emit the signature, it will be interpreted
as a literal pool by a disassembler, not as an actual instruction.
Considering that the signature is not meant to be executed except in
scenarios where the program execution is completely bogus, this should
not be an issue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Acked-by: Will Deacon <will.deacon(a)arm.com>
CC: Peter Zijlstra <peterz(a)infradead.org>
CC: Thomas Gleixner <tglx(a)linutronix.de>
CC: Joel Fernandes <joelaf(a)google.com>
CC: Catalin Marinas <catalin.marinas(a)arm.com>
CC: Dave Watson <davejwatson(a)fb.com>
CC: Will Deacon <will.deacon(a)arm.com>
CC: Shuah Khan <shuah(a)kernel.org>
CC: Andi Kleen <andi(a)firstfloor.org>
CC: linux-kselftest(a)vger.kernel.org
CC: "H . Peter Anvin" <hpa(a)zytor.com>
CC: Chris Lameter <cl(a)linux.com>
CC: Russell King <linux(a)arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages(a)gmail.com>
CC: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com>
CC: Paul Turner <pjt(a)google.com>
CC: Boqun Feng <boqun.feng(a)gmail.com>
CC: Josh Triplett <josh(a)joshtriplett.org>
CC: Steven Rostedt <rostedt(a)goodmis.org>
CC: Ben Maurer <bmaurer(a)fb.com>
CC: linux-api(a)vger.kernel.org
CC: Andy Lutomirski <luto(a)amacapital.net>
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Linus Torvalds <torvalds(a)linux-foundation.org>
CC: Carlos O'Donell <carlos(a)redhat.com>
CC: Florian Weimer <fweimer(a)redhat.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/rseq/rseq-arm.h | 61 +++++++++++++------------
1 file changed, 33 insertions(+), 28 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5943c816c07c 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -6,6 +6,8 @@
*/
/*
+ * - ARM little endian
+ *
* RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
* value 0x5de3. This traps if user-space reaches this instruction by mistake,
* and the uncommon operand ensures the kernel does not move the instruction
@@ -22,36 +24,40 @@
* def3 udf #243 ; 0xf3
* e7f5 b.n <7f5>
*
- * pre-ARMv6 big endian code:
- * e7f5 b.n <7f5>
- * def3 udf #243 ; 0xf3
+ * - ARMv6+ big endian (BE8):
*
* ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
+ * code and big-endian data. The data value of the signature needs to have its
+ * byte order reversed to generate the trap instruction:
+ *
+ * Data: 0xf3def5e7
+ *
+ * Translates to this A32 instruction pattern:
+ *
+ * e7f5def3 udf #24035 ; 0x5de3
+ *
+ * Translates to this T16 instruction pattern:
+ *
+ * def3 udf #243 ; 0xf3
+ * e7f5 b.n <7f5>
+ *
+ * - Prior to ARMv6 big endian (BE32):
+ *
+ * Prior to ARMv6, -mbig-endian generates big-endian code and data
+ * (which match), so the endianness of the data representation of the
+ * signature should not be reversed. However, the choice between BE32
+ * and BE8 is done by the linker, so we cannot know whether code and
+ * data endianness will be mixed before the linker is invoked. So rather
+ * than try to play tricks with the linker, the rseq signature is simply
+ * data (not a trap instruction) prior to ARMv6 on big endian. This is
+ * why the signature is expressed as data (.word) rather than as
+ * instruction (.inst) in assembler.
*/
-#define RSEQ_SIG_CODE 0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA \
- ({ \
- int sig; \
- asm volatile ("b 2f\n\t" \
- "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
- "2:\n\t" \
- "ldr %[sig], 1b\n\t" \
- : [sig] "=r" (sig)); \
- sig; \
- })
-
-#define RSEQ_SIG RSEQ_SIG_DATA
-
+#ifdef __ARMEB__
+#define RSEQ_SIG 0xf3def5e7 /* udf #24035 ; 0x5de3 (ARMv6+) */
+#else
+#define RSEQ_SIG 0xe7f5def3 /* udf #24035 ; 0x5de3 */
#endif
#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +131,7 @@ do { \
__rseq_str(table_label) ":\n\t" \
".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
- ".arm\n\t" \
- ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+ ".word " __rseq_str(RSEQ_SIG) "\n\t" \
__rseq_str(label) ":\n\t" \
teardown \
"b %l[" __rseq_str(abort_label) "]\n\t"
--
2.20.1
The code in vmx.c does not use "program_invocation_name", so there
is no need to "#define _GNU_SOURCE" here.
Signed-off-by: Thomas Huth <thuth(a)redhat.com>
---
tools/testing/selftests/kvm/lib/x86_64/vmx.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
index fe56d159d65f..204f847bd065 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
@@ -5,8 +5,6 @@
* Copyright (C) 2018, Google LLC.
*/
-#define _GNU_SOURCE /* for program_invocation_name */
-
#include "test_util.h"
#include "kvm_util.h"
#include "processor.h"
--
2.21.0
The kvm_create_max_vcpus is generic enough so that it works out of the
box on POWER, too. We just have to provide some stubs for linking the
code from kvm_util.c.
Note that you also might have to do "ulimit -n 2500" before running the
test, to avoid that it runs out of file handles for the vCPUs.
Signed-off-by: Thomas Huth <thuth(a)redhat.com>
---
RFC since the stubs are a little bit ugly (does someone here like
to implement them?), and since it's a little bit annoying that
you have to raise the ulimit for this test in case the kernel provides
more vCPUs than the default ulimit...
tools/testing/selftests/kvm/Makefile | 6 +++
.../selftests/kvm/lib/powerpc/processor.c | 37 +++++++++++++++++++
2 files changed, 43 insertions(+)
create mode 100644 tools/testing/selftests/kvm/lib/powerpc/processor.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ba7849751989..c92dc78ff74b 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -11,6 +11,8 @@ LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/ucall.c lib/sparsebi
LIBKVM_x86_64 = lib/x86_64/processor.c lib/x86_64/vmx.c
LIBKVM_aarch64 = lib/aarch64/processor.c
LIBKVM_s390x = lib/s390x/processor.c
+LIBKVM_ppc64 = lib/powerpc/processor.c
+LIBKVM_ppc64le = $(LIBKVM_ppc64)
TEST_GEN_PROGS_x86_64 = x86_64/cr4_cpuid_sync_test
TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test
@@ -35,6 +37,10 @@ TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
TEST_GEN_PROGS_s390x += s390x/sync_regs_test
TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
+TEST_GEN_PROGS_ppc64 += kvm_create_max_vcpus
+
+TEST_GEN_PROGS_ppc64le = $(TEST_GEN_PROGS_ppc64)
+
TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
LIBKVM += $(LIBKVM_$(UNAME_M))
diff --git a/tools/testing/selftests/kvm/lib/powerpc/processor.c b/tools/testing/selftests/kvm/lib/powerpc/processor.c
new file mode 100644
index 000000000000..c0b7f06e206e
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/powerpc/processor.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KVM selftest s390x library code - CPU-related functions
+ */
+
+#define _GNU_SOURCE
+
+#include "kvm_util.h"
+#include "../kvm_util_internal.h"
+
+void virt_pgd_alloc(struct kvm_vm *vm, uint32_t memslot)
+{
+ abort(); /* TODO: implement this */
+}
+
+void virt_pg_map(struct kvm_vm *vm, uint64_t gva, uint64_t gpa,
+ uint32_t memslot)
+{
+ abort(); /* TODO: implement this */
+}
+
+vm_paddr_t addr_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva)
+{
+ abort(); /* TODO: implement this */
+
+ return -1;
+}
+
+void virt_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent)
+{
+ abort(); /* TODO: implement this */
+}
+
+void vcpu_dump(FILE *stream, struct kvm_vm *vm, uint32_t vcpuid, uint8_t indent)
+{
+ abort(); /* TODO: implement this */
+}
--
2.21.0
Add a skip() message function that stops the test, logs an explanation,
and sets the "skip" return code (4).
Before loading a livepatch self-test kernel module, first verify that
we've built and installed it by running a 'modprobe --dry-run'. This
should catch a few environment issues, including !CONFIG_LIVEPATCH and
!CONFIG_TEST_LIVEPATCH. In these cases, exit gracefully with the new
skip() function.
Reported-by: Jiri Benc <jbenc(a)redhat.com>
Suggested-by: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Joe Lawrence <joe.lawrence(a)redhat.com>
---
v2: move assert_mod() call into load_mod() and load_lp_nowait(), before
they check whether the module is a livepatch or not (a test-failing
assertion).
.../testing/selftests/livepatch/functions.sh | 20 +++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/tools/testing/selftests/livepatch/functions.sh b/tools/testing/selftests/livepatch/functions.sh
index 30195449c63c..de5a504ffdbc 100644
--- a/tools/testing/selftests/livepatch/functions.sh
+++ b/tools/testing/selftests/livepatch/functions.sh
@@ -13,6 +13,14 @@ function log() {
echo "$1" > /dev/kmsg
}
+# skip(msg) - testing can't proceed
+# msg - explanation
+function skip() {
+ log "SKIP: $1"
+ echo "SKIP: $1" >&2
+ exit 4
+}
+
# die(msg) - game over, man
# msg - dying words
function die() {
@@ -43,6 +51,12 @@ function loop_until() {
done
}
+function assert_mod() {
+ local mod="$1"
+
+ modprobe --dry-run "$mod" &>/dev/null
+}
+
function is_livepatch_mod() {
local mod="$1"
@@ -75,6 +89,9 @@ function __load_mod() {
function load_mod() {
local mod="$1"; shift
+ assert_mod "$mod" ||
+ skip "Failed modprobe --dry-run of module: $mod"
+
is_livepatch_mod "$mod" &&
die "use load_lp() to load the livepatch module $mod"
@@ -88,6 +105,9 @@ function load_mod() {
function load_lp_nowait() {
local mod="$1"; shift
+ assert_mod "$mod" ||
+ skip "Failed modprobe --dry-run of module: $mod"
+
is_livepatch_mod "$mod" ||
die "module $mod is not a livepatch"
--
2.21.0
Hi,
These patches make it possible to attach BPF programs directly to tracepoints
using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
attach to be alive. This has the following benefits:
1. Simplified Security: In Android, we have finer-grained security controls to
specific ftrace trace events using SELinux labels. We control precisely who is
allowed to enable an ftrace event already. By adding a node to ftrace for
attaching BPF programs, we can use the same mechanism to further control who is
allowed to attach to a trace event.
2. Process lifetime: In Android we are adding usecases where a tracing program
needs to be attached all the time to a tracepoint, for the full life time of
the system. Such as to gather statistics where there no need for a detach for
the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
means keeping a process alive all the time. However, in Android our BPF loader
currently (for hardeneded security) involves just starting a process at boot
time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We
don't keep this process alive all the time. It is more suitable to do a
one-shot attach of the program using ftrace and not need to have a process
alive all the time anymore for this. Such process also needs elevated
privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
anyway so by design Android's bpfloader runs once at init and exits.
This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
The following commands can be written into it:
attach:<fd> Attaches BPF prog fd to tracepoint
detach:<fd> Detaches BPF prog fd to tracepoint
Reading the bpf file will show all the attached programs to the tracepoint.
Joel Fernandes (Google) (4):
Move bpf_raw_tracepoint functionality into bpf_trace.c
trace/bpf: Add support for attach/detach of ftrace events to BPF
lib/bpf: Add support for ftrace event attach and detach
selftests/bpf: Add test for ftrace-based BPF attach/detach
include/linux/bpf_trace.h | 16 ++
include/linux/trace_events.h | 1 +
kernel/bpf/syscall.c | 69 +-----
kernel/trace/bpf_trace.c | 225 ++++++++++++++++++
kernel/trace/trace.h | 1 +
kernel/trace/trace_events.c | 8 +
tools/lib/bpf/bpf.c | 53 +++++
tools/lib/bpf/bpf.h | 4 +
tools/lib/bpf/libbpf.map | 2 +
.../raw_tp_writable_test_ftrace_run.c | 89 +++++++
10 files changed, 410 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c
--
2.22.0.410.gd8fdbe21b5-goog
## TL;DR
This patchset addresses comments from Stephen Boyd. No changes affect
the API, and all changes are specific to patches 02, 03, and 04;
however, there were some significant changes to how string_stream and
kunit_stream work under the hood.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in about a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2/v11 branch.
## Changes Since Last Version
- Went back to using spinlock in `struct string_stream`. Needed for so
that it is compatible with different GFP flags to address comment from
Stephen.
- Added string_stream_append function to string_stream API. - suggested
by Stephen.
- Made all string fragments and other allocations internal to
string_stream and kunit_stream managed by the KUnit resource
management API.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2/v11
--
2.22.0.510.g264f2c817a-goog
## TL;DR
This patchset addresses comments from Stephen Boyd. Most changes are
pretty minor, but this does fix a couple of bugs pointed out by Stephen.
I imagine that Stephen will probably have some more comments, but I
wanted to get this out for him to look at as soon as possible.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in about a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2/v10 branch.
## Changes Since Last Version
- Went back to using spinlock in `struct kunit`. Needed for resource
management API. Thanks to Stephen for this change.
- Fixed bug where an init failure may not be recorded as a failure in
patch 01/18.
- Added append method to string_stream as suggested by Stephen.
- Mostly pretty minor changes after that, which mostly pertain to
string_stream and kunit_stream.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2/v10
--
2.22.0.510.g264f2c817a-goog
From: Hechao Li <hechaol(a)fb.com>
[ Upstream commit 89cceaa939171fafa153d4bf637b39e396bbd785 ]
An error "implicit declaration of function 'reallocarray'" can be thrown
with the following steps:
$ cd tools/testing/selftests/bpf
$ make clean && make CC=<Path to GCC 4.8.5>
$ make clean && make CC=<Path to GCC 7.x>
The cause is that the feature folder generated by GCC 4.8.5 is not
removed, leaving feature-reallocarray being 1, which causes reallocarray
not defined when re-compliing with GCC 7.x. This diff adds feature
folder to EXTRA_CLEAN to avoid this problem.
v2: Rephrase the commit message.
Signed-off-by: Hechao Li <hechaol(a)fb.com>
Acked-by: Andrii Nakryiko <andriin(a)fb.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/Makefile | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e36356e2377e..1c9511262947 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -275,4 +275,5 @@ $(OUTPUT)/verifier/tests.h: $(VERIFIER_TESTS_DIR) $(VERIFIER_TEST_FILES)
) > $(VERIFIER_TESTS_H))
EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(ALU32_BUILD_DIR) \
- $(VERIFIER_TESTS_H) $(PROG_TESTS_H) $(MAP_TESTS_H)
+ $(VERIFIER_TESTS_H) $(PROG_TESTS_H) $(MAP_TESTS_H) \
+ feature
--
2.20.1
From: Alexei Starovoitov <ast(a)kernel.org>
[ Upstream commit 7c0c6095d48dcd0e67c917aa73cdbb2715aafc36 ]
Adjust scale tests to check for new jmp sequence limit.
BPF_JGT had to be changed to BPF_JEQ because the verifier was
too smart. It tracked the known safe range of R0 values
and pruned the search earlier before hitting exact 8192 limit.
bpf_semi_rand_get() was too (un)?lucky.
k = 0; was missing in bpf_fill_scale2.
It was testing a bit shorter sequence of jumps than intended.
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Andrii Nakryiko <andriin(a)fb.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_verifier.c | 31 +++++++++++----------
1 file changed, 17 insertions(+), 14 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 288cb740e005..6438d4dc8ae1 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -207,33 +207,35 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
self->retval = (uint32_t)res;
}
-/* test the sequence of 1k jumps */
+#define MAX_JMP_SEQ 8192
+
+/* test the sequence of 8k jumps */
static void bpf_fill_scale1(struct bpf_test *self)
{
struct bpf_insn *insn = self->fill_insns;
int i = 0, k = 0;
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
- /* test to check that the sequence of 1024 jumps is acceptable */
- while (k++ < 1024) {
+ /* test to check that the long sequence of jumps is acceptable */
+ while (k++ < MAX_JMP_SEQ) {
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32);
- insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2);
+ insn[i++] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, bpf_semi_rand_get(), 2);
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10);
insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
-8 * (k % 64 + 1));
}
- /* every jump adds 1024 steps to insn_processed, so to stay exactly
- * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
+ /* every jump adds 1 step to insn_processed, so to stay exactly
+ * within 1m limit add MAX_TEST_INSNS - MAX_JMP_SEQ - 1 MOVs and 1 EXIT
*/
- while (i < MAX_TEST_INSNS - 1025)
+ while (i < MAX_TEST_INSNS - MAX_JMP_SEQ - 1)
insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
self->retval = 42;
}
-/* test the sequence of 1k jumps in inner most function (function depth 8)*/
+/* test the sequence of 8k jumps in inner most function (function depth 8)*/
static void bpf_fill_scale2(struct bpf_test *self)
{
struct bpf_insn *insn = self->fill_insns;
@@ -245,19 +247,20 @@ static void bpf_fill_scale2(struct bpf_test *self)
insn[i++] = BPF_EXIT_INSN();
}
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
- /* test to check that the sequence of 1024 jumps is acceptable */
- while (k++ < 1024) {
+ /* test to check that the long sequence of jumps is acceptable */
+ k = 0;
+ while (k++ < MAX_JMP_SEQ) {
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32);
- insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2);
+ insn[i++] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, bpf_semi_rand_get(), 2);
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10);
insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
-8 * (k % (64 - 4 * FUNC_NEST) + 1));
}
- /* every jump adds 1024 steps to insn_processed, so to stay exactly
- * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
+ /* every jump adds 1 step to insn_processed, so to stay exactly
+ * within 1m limit add MAX_TEST_INSNS - MAX_JMP_SEQ - 1 MOVs and 1 EXIT
*/
- while (i < MAX_TEST_INSNS - 1025)
+ while (i < MAX_TEST_INSNS - MAX_JMP_SEQ - 1)
insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
--
2.20.1
From: Alexei Starovoitov <ast(a)kernel.org>
[ Upstream commit 7c0c6095d48dcd0e67c917aa73cdbb2715aafc36 ]
Adjust scale tests to check for new jmp sequence limit.
BPF_JGT had to be changed to BPF_JEQ because the verifier was
too smart. It tracked the known safe range of R0 values
and pruned the search earlier before hitting exact 8192 limit.
bpf_semi_rand_get() was too (un)?lucky.
k = 0; was missing in bpf_fill_scale2.
It was testing a bit shorter sequence of jumps than intended.
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Andrii Nakryiko <andriin(a)fb.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_verifier.c | 31 +++++++++++----------
1 file changed, 17 insertions(+), 14 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 288cb740e005..6438d4dc8ae1 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -207,33 +207,35 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
self->retval = (uint32_t)res;
}
-/* test the sequence of 1k jumps */
+#define MAX_JMP_SEQ 8192
+
+/* test the sequence of 8k jumps */
static void bpf_fill_scale1(struct bpf_test *self)
{
struct bpf_insn *insn = self->fill_insns;
int i = 0, k = 0;
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
- /* test to check that the sequence of 1024 jumps is acceptable */
- while (k++ < 1024) {
+ /* test to check that the long sequence of jumps is acceptable */
+ while (k++ < MAX_JMP_SEQ) {
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32);
- insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2);
+ insn[i++] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, bpf_semi_rand_get(), 2);
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10);
insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
-8 * (k % 64 + 1));
}
- /* every jump adds 1024 steps to insn_processed, so to stay exactly
- * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
+ /* every jump adds 1 step to insn_processed, so to stay exactly
+ * within 1m limit add MAX_TEST_INSNS - MAX_JMP_SEQ - 1 MOVs and 1 EXIT
*/
- while (i < MAX_TEST_INSNS - 1025)
+ while (i < MAX_TEST_INSNS - MAX_JMP_SEQ - 1)
insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
self->retval = 42;
}
-/* test the sequence of 1k jumps in inner most function (function depth 8)*/
+/* test the sequence of 8k jumps in inner most function (function depth 8)*/
static void bpf_fill_scale2(struct bpf_test *self)
{
struct bpf_insn *insn = self->fill_insns;
@@ -245,19 +247,20 @@ static void bpf_fill_scale2(struct bpf_test *self)
insn[i++] = BPF_EXIT_INSN();
}
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
- /* test to check that the sequence of 1024 jumps is acceptable */
- while (k++ < 1024) {
+ /* test to check that the long sequence of jumps is acceptable */
+ k = 0;
+ while (k++ < MAX_JMP_SEQ) {
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32);
- insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2);
+ insn[i++] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, bpf_semi_rand_get(), 2);
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10);
insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
-8 * (k % (64 - 4 * FUNC_NEST) + 1));
}
- /* every jump adds 1024 steps to insn_processed, so to stay exactly
- * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
+ /* every jump adds 1 step to insn_processed, so to stay exactly
+ * within 1m limit add MAX_TEST_INSNS - MAX_JMP_SEQ - 1 MOVs and 1 EXIT
*/
- while (i < MAX_TEST_INSNS - 1025)
+ while (i < MAX_TEST_INSNS - MAX_JMP_SEQ - 1)
insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
--
2.20.1
Hi Linus,
Please pull the following Kselftest update for Linux 5.3-rc1.
This Kselftest update for Linux 5.3-rc1 consists of build failure
fixes and minor code cleaning patch to remove duplicate headers.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 4b972a01a7da614b4796475f933094751a295a2f:
Linux 5.2-rc6 (2019-06-22 16:01:36 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.3-rc1
for you to fetch changes up to ee8a84c60bcc1f1615bd9cb3edfe501e26cdc85b:
rseq/selftests: Fix Thumb mode build failure on arm32 (2019-07-08
13:00:41 -0600)
----------------------------------------------------------------
linux-kselftest-5.3-rc1
This Kselftest update for Linux 5.3-rc1 consists of build failure
fixes and minor code cleaning patch to remove duplicate headers.
----------------------------------------------------------------
Mathieu Desnoyers (1):
rseq/selftests: Fix Thumb mode build failure on arm32
Naresh Kamboju (1):
selftests: dma-buf: Adding kernel config fragment CONFIG_UDMABUF=y
Shuah Khan (1):
selftests: timestamping: Fix SIOCGSTAMP undeclared build failure
YueHaibing (1):
kselftests: cgroup: remove duplicated include from test_freezer.c
tools/testing/selftests/cgroup/test_freezer.c | 1 -
tools/testing/selftests/drivers/dma-buf/config | 1 +
.../networking/timestamping/timestamping.c | 9 +---
tools/testing/selftests/rseq/rseq-arm.h | 61
++++++++++++----------
4 files changed, 35 insertions(+), 37 deletions(-)
create mode 100644 tools/testing/selftests/drivers/dma-buf/config
----------------------------------------------------------------
From: Paolo Pisati <paolo.pisati(a)canonical.com>
After applying patch 0001, all checksum implementations i could test (x86-64, arm64 and
arm), now agree on the return value.
Patch 0002 fix the expected return value for test #13: i did the calculation manually,
and it correspond.
Unfortunately, after applying patch 0001, other test cases now fail in
test_verifier:
$ sudo ./tools/testing/selftests/bpf/test_verifier
...
#417/p helper access to variable memory: size = 0 allowed on NULL (ARG_PTR_TO_MEM_OR_NULL) FAIL retval 65535 != 0
#419/p helper access to variable memory: size = 0 allowed on != NULL stack pointer (ARG_PTR_TO_MEM_OR_NULL) FAIL retval 65535 != 0
#423/p helper access to variable memory: size possible = 0 allowed on != NULL packet pointer (ARG_PTR_TO_MEM_OR_NULL) FAIL retval 65535 != 0
...
Summary: 1500 PASSED, 0 SKIPPED, 3 FAILED
And there are probably other fallouts in other selftests - someone familiar
should take a look before applying these patches.
Paolo Pisati (2):
bpf: bpf_csum_diff: fold the checksum before returning the
value
bpf, selftest: fix checksum value for test #13
net/core/filter.c | 2 +-
tools/testing/selftests/bpf/verifier/array_access.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--
2.17.1
From: Paolo Pisati <paolo.pisati(a)canonical.com>
After applying patch 0001, all checksum implementations i could test (x86-64, arm64 and
arm), now agree on the return value.
Patch 0002 fix the expected return value for test #13: i did the calculation manually,
and it correspond.
Unfortunately, after applying patch 0001, other test cases now fail in
test_verifier:
$ sudo ./tools/testing/selftests/bpf/test_verifier
...
#417/p helper access to variable memory: size = 0 allowed on NULL (ARG_PTR_TO_MEM_OR_NULL) FAIL retval 65535 != 0
#419/p helper access to variable memory: size = 0 allowed on != NULL stack pointer (ARG_PTR_TO_MEM_OR_NULL) FAIL retval 65535 != 0
#423/p helper access to variable memory: size possible = 0 allowed on != NULL packet pointer (ARG_PTR_TO_MEM_OR_NULL) FAIL retval 65535 != 0
...
Summary: 1500 PASSED, 0 SKIPPED, 3 FAILED
And there are probably other fallouts in other selftests - someone familiar
should take a look before applying these patches.
Paolo Pisati (2):
bpf: bpf_csum_diff: fold the checksum before returning the
value
bpf, selftest: fix checksum value for test #13
net/core/filter.c | 2 +-
tools/testing/selftests/bpf/verifier/array_access.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--
2.17.1
On 32-bit x86 when building with clang-9, the loop gets turned back into
an inefficient division that causes a link error:
kernel/time/vsyscall.o: In function `update_vsyscall':
vsyscall.c:(.text+0xe3): undefined reference to `__udivdi3'
Use the provided __iter_div_u64_rem() function that is meant to address
the same case in other files.
Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
---
kernel/time/vsyscall.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c
index a80893180826..8cf3596a4ce6 100644
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -104,11 +104,7 @@ void update_vsyscall(struct timekeeper *tk)
vdso_ts->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
nsec = nsec + tk->wall_to_monotonic.tv_nsec;
- while (nsec >= NSEC_PER_SEC) {
- nsec = nsec - NSEC_PER_SEC;
- vdso_ts->sec++;
- }
- vdso_ts->nsec = nsec;
+ vdso_ts->sec += __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
if (__arch_use_vsyscall(vdata))
update_vdso_data(vdata, tk);
--
2.20.0
By undefined run_test(), compile of breakpoint_test_arm64.c fails.
This changes arun_test to run_test.
----
reakpoint_test_arm64.c: In function 'main':
breakpoint_test_arm64.c:217:14: warning: implicit declaration of function
'run_test'; did you mean 'arun_test'? [-Wimplicit-function-declaration]
result = run_test(size, MIN(size, 8), wr, wp);
^~~~~~~~
arun_test
----
Fixes: 5821ba969511 ("selftests: Add test plan API to kselftest.h and adjust callers")
Signed-off-by: Nobuhiro Iwamatsu <iwamatsu(a)nigauri.org>
---
tools/testing/selftests/breakpoints/breakpoint_test_arm64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c b/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c
index 58ed5eeab709..ad41ea69001b 100644
--- a/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c
+++ b/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c
@@ -109,7 +109,7 @@ static bool set_watchpoint(pid_t pid, int size, int wp)
return false;
}
-static bool arun_test(int wr_size, int wp_size, int wr, int wp)
+static bool run_test(int wr_size, int wp_size, int wr, int wp)
{
int status;
siginfo_t siginfo;
--
2.20.1
## TL;DR
This is a pretty straightforward follow-up to Luis' comments on PATCH
v6: There is nothing that changes any functionality or usage.
As for our current status, we only need reviews/acks on the following
patches:
- [PATCH v7 06/18] kbuild: enable building KUnit
- Need a review or ack from Masahiro Yamada or Michal Marek
- [PATCH v7 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
- Need a review or ack from Josh Poimboeuf or Peter Zijlstra
Other than that, I think we should be good to go.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in about a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2/v7 branch.
## Changes Since Last Version
Aside from renaming `struct kunit_module` to `struct kunit_suite`, there
isn't really anything in here that changes any functionality:
- Rebased on v5.2
- Added Iurii as a maintainer for PROC SYSCTL, as suggested by Luis.
- Removed some references to spinlock that I failed to remove in the
previous version, as pointed out by Luis.
- Cleaned up some comments, as suggested by Luis.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2/v7
--
2.22.0.410.gd8fdbe21b5-goog
## TL;DR
This new patch set only contains a very minor change suggested by
Masahiro to [PATCH v7 06/18] and is otherwise identical to PATCH v7.
Also, with Josh's ack on the preceding patch set, I think we now have
all necessary reviews and acks from all interested parties.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in about a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2/v8 branch.
## Changes Since Last Version
Like I said in the TL;DR, there is only one minor change since the
previous revision. That change only affects patch 06/18; it makes it so
that make doesn't attempt to scan the kunit/ directory when CONFIG_KUNIT
is not set as suggested by Masahiro.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2/v8
--
2.22.0.410.gd8fdbe21b5-goog
> On Mon, Jul 8, 2019 at 4:16 PM Brendan Higgins <brendanhiggins(a)google.com> wrote:
>>
>> CC'ing Iurii Zaikin
>>
>> On Fri, Jul 5, 2019 at 1:48 PM Luis Chamberlain <mcgrof(a)kernel.org> wrote:
>> >
>> > On Wed, Jul 03, 2019 at 05:36:15PM -0700, Brendan Higgins wrote:
>> > > Add entry for the new proc sysctl KUnit test to the PROC SYSCTL section.
>> > >
>> > > Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
>> > > Reviewed-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
>> > > Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
>> > > Acked-by: Luis Chamberlain <mcgrof(a)kernel.org>
>> >
>> > Come to think of it, I'd welcome Iurii to be added as a maintainer,
>> > with the hope Iurii would be up to review only the kunit changes. Of
>> > course if Iurii would be up to also help review future proc changes,
>> > even better. 3 pair of eyeballs is better than 2 pairs.
>>
>> What do you think, Iurii?
I'm in.
## TL;DR
This is a pretty straightforward follow-up to Stephen and Luis' comments
on PATCH v5: There is nothing that really changes any functionality or
usage with the minor exception that I renamed `struct kunit_module` to
`struct kunit_suite`. Nevertheless, a good deal of clean ups and fixes.
See "Changes Since Last Version" section below.
As for our current status, right now we got Reviewed-bys on all patches
except:
- [PATCH v6 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
However, it would probably be good to get reviews/acks from the
subsystem maintainers on:
- [PATCH v6 06/18] kbuild: enable building KUnit
- [PATCH v6 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
- [PATCH v6 15/18] Documentation: kunit: add documentation for KUnit
- [PATCH v6 17/18] kernel/sysctl-test: Add null pointer test for
sysctl.c:proc_dointvec()
Other than that, I think we should be good to go.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in about a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2-rc7/v6 branch.
## Changes Since Last Version
Aside from renaming `struct kunit_module` to `struct kunit_suite`, there
isn't really anything in here that changes any functionality:
- Rebased on v5.2-rc7
- Got rid of spinlocks.
- Now update success field using WRITE_ONCE. - Suggested by Stephen
- Other instances replaced by mutex. - Suggested by Stephen and Luis
- Renamed `struct kunit_module` to `struct kunit_suite`. - Inspired by
Luis.
- Fixed a broken example of how to use kunit_tool. - Pointed out by Ted
- Added descriptions to unit tests. - Suggested by Luis
- Labeled unit tests which tested the API. - Suggested by Luis
- Made a number of minor cleanups. - Suggested by Luis and Stephen.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2-rc7/v6
--
2.22.0.410.gd8fdbe21b5-goog
Using ".arm .inst" for the arm signature introduces build issues for
programs compiled in Thumb mode because the assembler stays in the
arm mode for the rest of the inline assembly. Revert to using a ".word"
to express the signature as data instead.
The choice of signature is a valid trap instruction on arm32 little
endian, where both code and data are little endian.
ARMv6+ big endian (BE8) generates mixed endianness code vs data:
little-endian code and big-endian data. The data value of the signature
needs to have its byte order reversed to generate the trap instruction.
Prior to ARMv6, -mbig-endian generates big-endian code and data
(which match), so the endianness of the data representation of the
signature should not be reversed. However, the choice between BE32
and BE8 is done by the linker, so we cannot know whether code and
data endianness will be mixed before the linker is invoked. So rather
than try to play tricks with the linker, the rseq signature is simply
data (not a trap instruction) prior to ARMv6 on big endian. This is
why the signature is expressed as data (.word) rather than as
instruction (.inst) in assembler.
Because a ".word" is used to emit the signature, it will be interpreted
as a literal pool by a disassembler, not as an actual instruction.
Considering that the signature is not meant to be executed except in
scenarios where the program execution is completely bogus, this should
not be an issue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Acked-by: Will Deacon <will.deacon(a)arm.com>
CC: Peter Zijlstra <peterz(a)infradead.org>
CC: Thomas Gleixner <tglx(a)linutronix.de>
CC: Joel Fernandes <joelaf(a)google.com>
CC: Catalin Marinas <catalin.marinas(a)arm.com>
CC: Dave Watson <davejwatson(a)fb.com>
CC: Will Deacon <will.deacon(a)arm.com>
CC: Shuah Khan <shuah(a)kernel.org>
CC: Andi Kleen <andi(a)firstfloor.org>
CC: linux-kselftest(a)vger.kernel.org
CC: "H . Peter Anvin" <hpa(a)zytor.com>
CC: Chris Lameter <cl(a)linux.com>
CC: Russell King <linux(a)arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages(a)gmail.com>
CC: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com>
CC: Paul Turner <pjt(a)google.com>
CC: Boqun Feng <boqun.feng(a)gmail.com>
CC: Josh Triplett <josh(a)joshtriplett.org>
CC: Steven Rostedt <rostedt(a)goodmis.org>
CC: Ben Maurer <bmaurer(a)fb.com>
CC: linux-api(a)vger.kernel.org
CC: Andy Lutomirski <luto(a)amacapital.net>
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Linus Torvalds <torvalds(a)linux-foundation.org>
CC: Carlos O'Donell <carlos(a)redhat.com>
CC: Florian Weimer <fweimer(a)redhat.com>
---
tools/testing/selftests/rseq/rseq-arm.h | 61 ++++++++++++++++++---------------
1 file changed, 33 insertions(+), 28 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5943c816c07c 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -6,6 +6,8 @@
*/
/*
+ * - ARM little endian
+ *
* RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
* value 0x5de3. This traps if user-space reaches this instruction by mistake,
* and the uncommon operand ensures the kernel does not move the instruction
@@ -22,36 +24,40 @@
* def3 udf #243 ; 0xf3
* e7f5 b.n <7f5>
*
- * pre-ARMv6 big endian code:
- * e7f5 b.n <7f5>
- * def3 udf #243 ; 0xf3
+ * - ARMv6+ big endian (BE8):
*
* ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
+ * code and big-endian data. The data value of the signature needs to have its
+ * byte order reversed to generate the trap instruction:
+ *
+ * Data: 0xf3def5e7
+ *
+ * Translates to this A32 instruction pattern:
+ *
+ * e7f5def3 udf #24035 ; 0x5de3
+ *
+ * Translates to this T16 instruction pattern:
+ *
+ * def3 udf #243 ; 0xf3
+ * e7f5 b.n <7f5>
+ *
+ * - Prior to ARMv6 big endian (BE32):
+ *
+ * Prior to ARMv6, -mbig-endian generates big-endian code and data
+ * (which match), so the endianness of the data representation of the
+ * signature should not be reversed. However, the choice between BE32
+ * and BE8 is done by the linker, so we cannot know whether code and
+ * data endianness will be mixed before the linker is invoked. So rather
+ * than try to play tricks with the linker, the rseq signature is simply
+ * data (not a trap instruction) prior to ARMv6 on big endian. This is
+ * why the signature is expressed as data (.word) rather than as
+ * instruction (.inst) in assembler.
*/
-#define RSEQ_SIG_CODE 0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA \
- ({ \
- int sig; \
- asm volatile ("b 2f\n\t" \
- "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
- "2:\n\t" \
- "ldr %[sig], 1b\n\t" \
- : [sig] "=r" (sig)); \
- sig; \
- })
-
-#define RSEQ_SIG RSEQ_SIG_DATA
-
+#ifdef __ARMEB__
+#define RSEQ_SIG 0xf3def5e7 /* udf #24035 ; 0x5de3 (ARMv6+) */
+#else
+#define RSEQ_SIG 0xe7f5def3 /* udf #24035 ; 0x5de3 */
#endif
#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +131,7 @@ do { \
__rseq_str(table_label) ":\n\t" \
".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
- ".arm\n\t" \
- ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+ ".word " __rseq_str(RSEQ_SIG) "\n\t" \
__rseq_str(label) ":\n\t" \
teardown \
"b %l[" __rseq_str(abort_label) "]\n\t"
--
2.11.0
The ftrace test will need to have CONFIG_FTRACE enabled to make the
ftrace directory available.
Add an additional check to skip this test if the CONFIG_FTRACE was not
enabled.
This will be helpful to avoid a false-positive test result when testing
it directly with the following commad against a kernel that does not
have CONFIG_FTRACE enabled:
make -C tools/testing/selftests TARGETS=ftrace run_tests
The test result on an Ubuntu KVM kernel will be changed from:
selftests: ftrace: ftracetest
========================================
Error: No ftrace directory found
not ok 1..1 selftests: ftrace: ftracetest [FAIL]
To:
selftests: ftrace: ftracetest
========================================
CONFIG_FTRACE was not enabled, test skipped.
not ok 1..1 selftests: ftrace: ftracetest [SKIP]
Signed-off-by: Po-Hsu Lin <po-hsu.lin(a)canonical.com>
---
tools/testing/selftests/ftrace/ftracetest | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/ftracetest b/tools/testing/selftests/ftrace/ftracetest
index 6d5e9e8..6c8322e 100755
--- a/tools/testing/selftests/ftrace/ftracetest
+++ b/tools/testing/selftests/ftrace/ftracetest
@@ -7,6 +7,9 @@
# Written by Masami Hiramatsu <masami.hiramatsu.pt(a)hitachi.com>
#
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
usage() { # errno [message]
[ ! -z "$2" ] && echo $2
echo "Usage: ftracetest [options] [testcase(s)] [testcase-directory(s)]"
@@ -139,7 +142,13 @@ parse_opts $*
# Verify parameters
if [ -z "$TRACING_DIR" -o ! -d "$TRACING_DIR" ]; then
- errexit "No ftrace directory found"
+ ftrace_enabled=`grep "^CONFIG_FTRACE=y" /lib/modules/$(uname -r)/build/.config`
+ if [ -z "$ftrace_enabled" ]; then
+ echo "CONFIG_FTRACE was not enabled, test skipped."
+ exit $ksft_skip
+ else
+ errexit "No ftrace directory found"
+ fi
fi
# Preparing logs
--
2.7.4
## TL;DR
A not so quick follow-up to Stephen's suggestions on PATCH v4. Nothing
that really changes any functionality or usage with the minor exception
of a couple public functions that Stephen asked me to rename.
Nevertheless, a good deal of clean up and fixes. See changes below.
As for our current status, right now we got Reviewed-bys on all patches
except:
- [PATCH v5 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
However, it would probably be good to get reviews/acks from the
subsystem maintainers on:
- [PATCH v5 06/18] kbuild: enable building KUnit
- [PATCH v5 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
- [PATCH v5 15/18] Documentation: kunit: add documentation for KUnit
- [PATCH v5 17/18] kernel/sysctl-test: Add null pointer test for
sysctl.c:proc_dointvec()
- [PATCH v5 18/18] MAINTAINERS: add proc sysctl KUnit test to PROC
SYSCTL section
Other than that, I think we should be good to go.
One last thing, I updated the background to include my thoughts on KUnit
vs. in kernel testing with kselftest in the background sections as
suggested by Frank in the discussion on PATCH v2.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### But wait! Doesn't kselftest support in kernel testing?!
In a previous version of this patchset Frank pointed out that kselftest
already supports writing a test that resides in the kernel using the
test module feature[2]. LWN did a really great summary on this
discussion here[3].
Kselftest has a feature that allows a test module to be loaded into a
kernel using the kselftest framework; this does allow someone to write
tests against kernel code not directly exposed to userland; however, it
does not provide much of a framework around how to structure the tests.
The kselftest test module feature just provides a header which has a
standardized way of reporting test failures, and then provides
infrastructure to load and run the tests using the kselftest test
harness.
The kselftest test module does not seem to be opinionated at all in
regards to how tests are structured, how they check for failures, how
tests are organized. Even in the method it provides for reporting
failures is pretty simple; it doesn't have any more advanced failure
reporting or logging features. Given what's there, I think it is fair to
say that it is not actually a framework, but a feature that makes it
possible for someone to do some checks in kernel space.
Furthermore, kselftest test module has very few users. I checked for all
the tests that use it using the following grep command:
grep -Hrn -e 'kselftest_module\.h'
and only got three results: lib/test_strscpy.c, lib/test_printf.c, and
lib/test_bitmap.c.
So despite kselftest test module's existence, there really is no feature
overlap between kselftest and KUnit, save one: that you can use either
to write an in-kernel test, but this is a very small feature in
comparison to everything that KUnit allows you to do. KUnit is a full
x-unit style unit testing framework, whereas kselftest looks a lot more
like an end-to-end/functional testing framework, with a feature that
makes it possible to write in-kernel tests.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[4].
Additionally for convenience, I have applied these patches to a
branch[5]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2-rc4/v5 branch.
## Changes Since Last Version
Aside from a couple public function renames, there isn't really anything
in here that changes any functionality.
- Went through and fixed a couple of anti-patterns suggested by Stephen
Boyd. Things like:
- Dropping an else clause at the end of a function.
- Dropping the comma on the closing sentinel, `{}`, of a list.
- Inlines a bunch of functions in the test case running logic in patch
01/18 to make it more readable as suggested by Stephen Boyd
- Found and fixed bug in resource deallocation logic in patch 02/18. Bug
was discovered as a result of making a change suggested by Stephen
Boyd. This does not substantially change how any of the code works
conceptually.
- Renamed new_string_stream() to alloc_string_stream() as suggested by
Stephen Boyd.
- Made string-stream a KUnit managed object - based on a suggestion made
by Stephen Boyd.
- Renamed kunit_new_stream() to alloc_kunit_stream() as suggested by
Stephen Boyd.
- Removed the ability to set log level after allocating a kunit_stream,
as suggested by Stephen Boyd.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html#test-module
[3] https://lwn.net/Articles/790235/
[4] https://google.github.io/kunit-docs/third_party/kernel/docs/
[5] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2-rc4/v5
--
2.22.0.410.gd8fdbe21b5-goog
Hi
this patchset aims to add the initial arch-specific arm64 support to
kselftest starting with signals-related test-cases.
A common internal test-case layout is proposed which then it is
anyway wired-up to the toplevel kselftest Makefile, so that it
should be possible at the end to run it on an arm64 target in the
usual way with KSFT.
~/linux# make TARGETS=arm64 kselftest
New KSFT arm64 testcases live inside tools/testing/selftests/arm64 grouped by
family inside subdirectories: arm64/signal is the first family proposed with
this series. arm64/signal tests can be run via KSFT or standalone.
Thanks
Cristian
Notes:
-----
- further details in the included READMEs
- more tests still to be written (current strategy is going through the related
Kernel signal-handling code and write a test for each possible and sensible code-path)
- a bit of overlap around KSFT arm64/ Makefiles is expected with this recently
proposed patch-series:
http://lists.infradead.org/pipermail/linux-arm-kernel/2019-June/659432.html
Changes:
--------
v1-->v2:
- rebased on 5.2-rc7
- various makefile's cleanups
- mixed READMEs fixes
- fixed test_arm64_signals.sh runner script
- cleaned up assembly code in signal.S
- improved get_current_context() logic
- fixed SAFE_WRITE()
- common support code splitted into more chunks, each one introduced when
needed by some new testcases
- fixed some headers validation routines in testcases.c
- removed some still broken/immature tests:
+ fake_sigreturn_misaligned
+ fake_sigreturn_overflow_reserved
+ mangle_pc_invalid
+ mangle_sp_misaligned
- fixed some other testcases:
+ mangle_pstate_ssbs_regs: better checks of SSBS bit when feature unsupported
+ mangle_pstate_invalid_compat_toggle: name fix
+ mangle_pstate_invalid_mode_el[1-3]: precautionary zeroing PSTATE.MODE
+ fake_sigreturn_bad_magic, fake_sigreturn_bad_size,
fake_sigreturn_bad_size_for_magic0:
- accounting for available space...dropping extra when needed
- keeping alignent
- new testcases on FPSMID context:
+ fake_sigreturn_missing_fpsimd
+ fake_sigreturn_duplicated_fpsimd
Cristian Marussi (10):
kselftest: arm64: introduce new boilerplate code
kselftest: arm64: adds first test and common utils
kselftest: arm64: mangle_pstate_invalid_daif_bits
kselftest: arm64: mangle_pstate_invalid_mode_el
kselftest: arm64: mangle_pstate_ssbs_regs
kselftest: arm64: fake_sigreturn_bad_magic
kselftest: arm64: fake_sigreturn_bad_size_for_magic0
kselftest: arm64: fake_sigreturn_missing_fpsimd
kselftest: arm64: fake_sigreturn_duplicated_fpsimd
kselftest: arm64: fake_sigreturn_bad_size
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/arm64/Makefile | 51 +++
tools/testing/selftests/arm64/README | 43 +++
.../testing/selftests/arm64/signal/.gitignore | 5 +
tools/testing/selftests/arm64/signal/Makefile | 86 +++++
tools/testing/selftests/arm64/signal/README | 59 ++++
.../testing/selftests/arm64/signal/signals.S | 64 ++++
.../arm64/signal/test_arm64_signals.sh | 55 +++
.../selftests/arm64/signal/test_signals.c | 26 ++
.../selftests/arm64/signal/test_signals.h | 141 ++++++++
.../arm64/signal/test_signals_utils.c | 331 ++++++++++++++++++
.../arm64/signal/test_signals_utils.h | 31 ++
.../arm64/signal/testcases/.gitignore | 11 +
.../testcases/fake_sigreturn_bad_magic.c | 63 ++++
.../testcases/fake_sigreturn_bad_size.c | 85 +++++
.../fake_sigreturn_bad_size_for_magic0.c | 57 +++
.../fake_sigreturn_duplicated_fpsimd.c | 62 ++++
.../testcases/fake_sigreturn_missing_fpsimd.c | 44 +++
.../mangle_pstate_invalid_compat_toggle.c | 25 ++
.../mangle_pstate_invalid_daif_bits.c | 28 ++
.../mangle_pstate_invalid_mode_el1.c | 29 ++
.../mangle_pstate_invalid_mode_el2.c | 29 ++
.../mangle_pstate_invalid_mode_el3.c | 29 ++
.../testcases/mangle_pstate_ssbs_regs.c | 56 +++
.../arm64/signal/testcases/testcases.c | 150 ++++++++
.../arm64/signal/testcases/testcases.h | 83 +++++
26 files changed, 1644 insertions(+)
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100644 tools/testing/selftests/arm64/README
create mode 100644 tools/testing/selftests/arm64/signal/.gitignore
create mode 100644 tools/testing/selftests/arm64/signal/Makefile
create mode 100644 tools/testing/selftests/arm64/signal/README
create mode 100644 tools/testing/selftests/arm64/signal/signals.S
create mode 100755 tools/testing/selftests/arm64/signal/test_arm64_signals.sh
create mode 100644 tools/testing/selftests/arm64/signal/test_signals.c
create mode 100644 tools/testing/selftests/arm64/signal/test_signals.h
create mode 100644 tools/testing/selftests/arm64/signal/test_signals_utils.c
create mode 100644 tools/testing/selftests/arm64/signal/test_signals_utils.h
create mode 100644 tools/testing/selftests/arm64/signal/testcases/.gitignore
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_magic.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_size.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_size_for_magic0.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_duplicated_fpsimd.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_missing_fpsimd.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_compat_toggle.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_daif_bits.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el1.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el2.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el3.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_ssbs_regs.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/testcases.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/testcases.h
--
2.17.1
This patch series moves Hyper-V clock/timer code to a separate Hyper-V
clocksource driver. Previously, Hyper-V clock/timer code and data
structures were mixed in with other Hyper-V code in the ISA independent
drivers/hv code as well as in ISA dependent code. The new Hyper-V
clocksource driver is ISA agnostic, with a just few dependencies on
ISA specific functions. The patch series does not change any behavior
or functionality -- it only reorganizes the existing code and fixes up
the linkages. A few places outside of Hyper-V code are fixed up to use
the new #include file structure.
This restructuring is in response to Marc Zyngier's review comments
on supporting Hyper-V running on ARM64, and is a good idea in general.
It increases the amount of code shared between the x86 and ARM64
architectures, and reduces the size of the new code for supporting
Hyper-V on ARM64. A new version of the Hyper-V on ARM64 patches will
follow once this clocksource restructuring is accepted.
The code is diff'ed against the upstream tip tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/vdso
Changes in v5:
* Revised commit summaries [Thomas Gleixner]
* Removed call to clockevents_unbind_device() [Thomas Gleixner]
* Restructured hv_init_clocksource() [Thomas Gleixner]
* Various other small code cleanups [Thomas Gleixner]
Changes in v4:
* Revised commit messages
* Rebased to upstream tip tree
Changes in v3:
* Removed boolean argument to hv_init_clocksource(). Always call
sched_clock_register, which is needed on ARM64 but a no-op on x86.
* Removed separate cpuhp setup in hv_stimer_alloc() and instead
directly call hv_stimer_init() and hv_stimer_cleanup() from
corresponding VMbus functions. This more closely matches original
code and avoids clocksource stop/restart problems on ARM64 when
VMbus code denies CPU offlining request.
Changes in v2:
* Revised commit short descriptions so the distinction between
the first and second patches is clearer [GregKH]
* Renamed new clocksource driver files and functions to use
existing "timer" and "stimer" names instead of introducing
"syntimer". [Vitaly Kuznetsov]
* Introduced CONFIG_HYPER_TIMER to fix build problem when
CONFIG_HYPERV=m [Vitaly Kuznetsov]
* Added "Suggested-by: Marc Zyngier"
Michael Kelley (2):
clocksource/drivers: Make Hyper-V clocksource ISA agnostic
clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
MAINTAINERS | 2 +
arch/x86/entry/vdso/vma.c | 2 +-
arch/x86/hyperv/hv_init.c | 91 +--------
arch/x86/include/asm/hyperv-tlfs.h | 6 +
arch/x86/include/asm/mshyperv.h | 81 +-------
arch/x86/include/asm/vdso/gettimeofday.h | 2 +-
arch/x86/kernel/cpu/mshyperv.c | 4 +-
arch/x86/kvm/x86.c | 1 +
drivers/clocksource/Makefile | 1 +
drivers/clocksource/hyperv_timer.c | 339 +++++++++++++++++++++++++++++++
drivers/hv/Kconfig | 3 +
drivers/hv/hv.c | 156 +-------------
drivers/hv/hv_util.c | 1 +
drivers/hv/hyperv_vmbus.h | 3 -
drivers/hv/vmbus_drv.c | 42 ++--
include/clocksource/hyperv_timer.h | 105 ++++++++++
16 files changed, 503 insertions(+), 336 deletions(-)
create mode 100644 drivers/clocksource/hyperv_timer.c
create mode 100644 include/clocksource/hyperv_timer.h
--
1.8.3.1
The psock_tpacket test will need to access /proc/kallsyms, this would
require the kernel config CONFIG_KALLSYMS to be enabled first.
Apart from adding CONFIG_KALLSYMS to the net/config file here, check the
file existence to determine if we can run this test will be helpful to
avoid a false-positive test result when testing it directly with the
following commad against a kernel that have CONFIG_KALLSYMS disabled:
make -C tools/testing/selftests TARGETS=net run_tests
Signed-off-by: Po-Hsu Lin <po-hsu.lin(a)canonical.com>
---
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/run_afpackettests | 14 +++++++++-----
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 4740404..3dea2cb 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -25,3 +25,4 @@ CONFIG_NF_TABLES_IPV6=y
CONFIG_NF_TABLES_IPV4=y
CONFIG_NFT_CHAIN_NAT_IPV6=m
CONFIG_NFT_CHAIN_NAT_IPV4=m
+CONFIG_KALLSYMS=y
diff --git a/tools/testing/selftests/net/run_afpackettests b/tools/testing/selftests/net/run_afpackettests
index ea5938e..8b42e8b 100755
--- a/tools/testing/selftests/net/run_afpackettests
+++ b/tools/testing/selftests/net/run_afpackettests
@@ -21,12 +21,16 @@ fi
echo "--------------------"
echo "running psock_tpacket test"
echo "--------------------"
-./in_netns.sh ./psock_tpacket
-if [ $? -ne 0 ]; then
- echo "[FAIL]"
- ret=1
+if [ -f /proc/kallsyms ]; then
+ ./in_netns.sh ./psock_tpacket
+ if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ ret=1
+ else
+ echo "[PASS]"
+ fi
else
- echo "[PASS]"
+ echo "[SKIP] CONFIG_KALLSYMS not enabled"
fi
echo "--------------------"
--
2.7.4
From: Colin Ian King <colin.king(a)canonical.com>
There is an spelling mistake in an a test error message. Fix it.
Signed-off-by: Colin Ian King <colin.king(a)canonical.com>
---
tools/testing/selftests/x86/test_vsyscall.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/test_vsyscall.c b/tools/testing/selftests/x86/test_vsyscall.c
index 4602326b8f5b..a4f4d4cf22c3 100644
--- a/tools/testing/selftests/x86/test_vsyscall.c
+++ b/tools/testing/selftests/x86/test_vsyscall.c
@@ -451,7 +451,7 @@ static int test_vsys_x(void)
printf("[OK]\tExecuting the vsyscall page failed: #PF(0x%lx)\n",
segv_err);
} else {
- printf("[FAILT]\tExecution failed with the wrong error: #PF(0x%lx)\n",
+ printf("[FAIL]\tExecution failed with the wrong error: #PF(0x%lx)\n",
segv_err);
return 1;
}
--
2.20.1
Commit a745f7af3cbd ("selftests/harness: Add 30 second timeout per
test") solves that binary tests doesn't hang forever. However, scripts
can still hang forever, this adds an timeout to each test script run. This
assumes that an individual test doesn't take longer than 30 seconds.
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
---
tools/testing/selftests/kselftest/runner.sh | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kselftest/runner.sh b/tools/testing/selftests/kselftest/runner.sh
index 00c9020bdda8..cff7d2d83648 100644
--- a/tools/testing/selftests/kselftest/runner.sh
+++ b/tools/testing/selftests/kselftest/runner.sh
@@ -5,6 +5,7 @@
export skip_rc=4
export logfile=/dev/stdout
export per_test_logging=
+export TEST_TIMEOUT_DEFAULT=30
# There isn't a shell-agnostic way to find the path of a sourced file,
# so we must rely on BASE_DIR being set to find other tools.
@@ -24,6 +25,14 @@ tap_prefix()
fi
}
+tap_timeout()
+{
+ if [ -x /usr/bin/timeout ] && [ -x "$BASENAME_TEST" ] \
+ && file $BASENAME_TEST |grep -q "shell script"; then
+ echo -n "timeout $TEST_TIMEOUT_DEFAULT"
+ fi
+}
+
run_one()
{
DIR="$1"
@@ -44,7 +53,7 @@ run_one()
echo "not ok $test_num $TEST_HDR_MSG"
else
cd `dirname $TEST` > /dev/null
- (((((./$BASENAME_TEST 2>&1; echo $? >&3) |
+ ((((( tap_timeout ./$BASENAME_TEST 2>&1; echo $? >&3) |
tap_prefix >&4) 3>&1) |
(read xs; exit $xs)) 4>>"$logfile" &&
echo "ok $test_num $TEST_HDR_MSG") ||
--
2.11.0
The t->rcu_read_unlock_special union's need_qs bit can be set by the
scheduler tick (in rcu_flavor_sched_clock_irq) to indicate that help is
needed from the rcu_read_unlock path. When this help arrives however, we
can do better to speed up the quiescent state reporting which if
rcu_read_unlock_special::need_qs is set might be quite urgent. Make use
of this information in deciding when to do heavy-weight softirq raising
where possible.
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
kernel/rcu/tree_plugin.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index c588ef98efd3..bff6410fac06 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -622,7 +622,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
t->rcu_read_unlock_special.b.exp_hint = false;
exp = (t->rcu_blocked_node && t->rcu_blocked_node->exp_tasks) ||
(rdp->grpmask & rnp->expmask) ||
- tick_nohz_full_cpu(rdp->cpu);
+ tick_nohz_full_cpu(rdp->cpu) ||
+ t->rcu_read_unlock_special.b.need_qs;
// Need to defer quiescent state until everything is enabled.
if (irqs_were_disabled && use_softirq &&
(in_interrupt() ||
--
2.22.0.410.gd8fdbe21b5-goog
Hello,
The proposed introduction of a relaxed ARM64 ABI [1] will allow tagged memory
addresses to be passed through the user-kernel syscall ABI boundary. Tagged
memory addresses are those which contain a non-zero top byte (the hardware
has always ignored this top byte due to TCR_EL1.TBI0) and may be useful
for features such as HWASan.
To permit this relaxation a proposed patchset [2] strips the top byte (tag)
from user provided memory addresses prior to use in kernel functions which
require untagged addresses (for example comparasion/arithmetic of addresses).
The author of this patchset relied on a variety of techniques [2] (such as
grep, BUG_ON, sparse etc) to identify as many instances of possible where
tags need to be stipped.
To support this effort and to catch future regressions (e.g. in new syscalls
or ioctls), I've devised an additional approach for detecting the use of
tagged addresses in functions that do not want them. This approach makes
use of Smatch [3] and is outlined in this RFC. Due to the ability of Smatch
to do flow analysis I believe we can annotate the kernel in fewer places
than a similar approach in sparse.
I'm keen for feedback on the likely usefulness of this approach.
We first add some new annotations that are exclusively consumed by Smatch:
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -19,6 +19,7 @@
# define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0)
# define __percpu __attribute__((noderef, address_space(3)))
# define __rcu __attribute__((noderef, address_space(4)))
+# define __untagged __attribute__((address_space(5)))
# define __private __attribute__((noderef))
extern void __chk_user_ptr(const volatile void __user *);
extern void __chk_io_ptr(const volatile void __iomem *);
@@ -45,6 +46,7 @@ extern void __chk_io_ptr(const volatile void __iomem *);
...
The purpose of this annotation is to indicate in function prototypes that a
given argument must not be a user tagged memory address. (The address space
number isn't significant here and could be replaced with any other annotation
that we get Smatch to understand).
An example of how we use this annotation is as follows:
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2224,7 +2224,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
EXPORT_SYMBOL(get_unmapped_area);
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long __untagged addr)
{
struct rb_node *rb_node;
struct vm_area_struct *vma;
Thus our intent here is to flag up whenever addr is tagged. Specifically
modifications I've made to Smatch to test this will flag an issue where all
the following conditions are met:
1. the parameter is used in the function
2. the data in the parameter has originated or been derived from userspace
(this relies on existing Smatch functionality to detect where data has
come from)
3. the data's top byte is non-zero (via flow analysis to determine the
range of values that it may be given the known call-tree).
Due to the use of smatch and its flow-analysis we don't need to propogate the
__untagged annotation up the call chain to the callers and their callers - thus
allowing us to only annotate functions that actually does something with the address
and it's only necessary if the function has the potential to receive user data. I.e.
we only need to tag find_vma to catch an issue with mmap_region (because it calls
count_vma_pages_range which calls find_vma_intersection which calls find_vma).
Due to condition 3 above, the use of the existing untagged_addr macro (or anything
that does something similar) will prevent smatch from producing a warning.
Using a vanilla (v5.2-rc2) kernel, and a single find_vma annotation, smatch will
produce the following warnings:
mm/mmap.c:2227 find_vma() warn: Variable addr looks like a tagged address - it is not allowed here
mm/mmap.c:2227 find_vma() warn: Variable addr looks like a tagged address - it is not allowed here
mm/mmap.c:2227 find_vma() warn: Variable addr looks like a tagged address - it is not allowed here
...
The warning is printed for each call site that calls find_vma with a tagged
address from userspace. After 6 runs of smatch, 24 warnings are produced.
The warnings are helpful in detecting issues, but not useful in providing enough
information to debug the issue and find the offending functions higher up the
call stack that call find_vma. Smatch is good at determining the ranges of values
that can be passed to a function, but it doesn't keep track of how it determined
those ranges - this makes it difficult to determine the offending function.
However even this level of limited functionality is helpful - as once the kernel
is initially sanitized of tagged addresses, then the use of smatch here can
spot regressions and offending code identified via git aiaiai/bisect.
Smatch builds a database which includes a table of functions, who they are
called by and with what range of parameters. Smatch also provides a bunch
of perl and python scripts which can be used to extract helpful information
for example to produce a call tree for a given function. I've adapted these
scripts such that for a given function (e.g. find_vma) it will show you the
call tree where callers pass user data and where that data is tagged addresses.
The output looks something like this:
find_vma() - 0-u64max (1)
kvm_arch_prepare_memory_region() - 0-u64max (1)
__kvm_set_memory_region() - 0-u64max (1)
kvm_set_memory_region() - 0-u64max (1)
kvm_vm_ioctl_set_memory_region() - 0-u64max (1)
hugetlb_get_unmapped_area() - 0-u64max (1)
shm_get_unmapped_area() - 0-u64max (1)
shm_get_unmapped_area() - 32785,40977,98321,106513,2097151-u64max[c] (1)
...
In summary the following are found, note this currently unhelpfuly includes
functions inbetween find_vma and the leaf functions:
$ cat find_vma_tree_orig | sed -e 's/^[ \t]*//' | cut -d ' ' -f 1 | sort | uniq
call_mmap()
check_and_migrate_cma_pages()
compat_ipv6_getsockopt()
compat_sock_common_getsockopt()
compat_tcp_getsockopt()
count_vma_pages_range()
__do_compat_sys_get_mempolicy()
do_get_mempolicy()
do_ioctl()
do_mincore()
do_mlock()
do_mmap()
do_mmap_pgoff()
...
As you can see, this gives a good point in the right direction for hunting
down callers of find_vma with tagged addresses.
This can be further improved - the problem here is that for a given function,
e.g. find_vma we look for callers where *any* of the parameters
passed to find_vma are tagged addresses from userspace - i.e. not *just*
the annotated parameter. This is also true for find_vma's callers' callers'.
This results in the call tree having false positives.
It *is* possible to track parameters (e.g. find_vma arg 1 comes from arg 3 of
do_pages_stat_array etc), but this is limited as if functions modify the
data then the tracking is stopped (however this can be fixed).
After apply the patchset ([PATCH v16 00/16] arm64: untag user pointers passed
to the kernel)[2] which untags user addresses, smatch indicates 11 issues. The
call tree is reduced. After grep'ing the call tree output, there are some valid
instances where untagging is needed, e.g:
gntdev_ioctl_get_offset_for_vaddr()
kvm_vm_ioctl_set_memory_region()
privcmd_ioctl_mmap_batch()
privcmd_ioctl_mmap_resource()
__se_sys_brk()
__se_sys_mremap()
__se_sys_munmap()
__se_sys_remap_file_pages()
__se_sys_shmat()
__se_sys_shmdt()
__vm_munmap()
...
An example of a false positve is do_mlock. We untag the address and pass that
to apply_vma_lock_flags - however we also pass a length - because the length
came from userspace and could have the top bits set - it's flagged. However
with improved parameter tracking we can remove this false positive and similar.
Prior to smatch I attempted a similar approach with sparse - however it seemed
necessary to propogate the __untagged annotation in every function up the call tree,
and resulted in adding the __untagged annotation to functions that would never
get near user provided data. This leads to a littering of __untagged all over the
kernel which doesn't seem appealing. Smatch is more capable, however it almost
certainly won't pick up 100% of issues due to the difficulity of making flow
analysis understand everything a compiler can.
Is it likely to be acceptable to use the __untagged annotation in user-path
functions that require untagged addresses across the kernel?
Thanks,
Andrew Murray
[1] https://lkml.org/lkml/2019/6/13/534
[2] https://patchwork.kernel.org/cover/10989517/
[3] http://smatch.sourceforge.net/
vDSO (virtual dynamic shared object) is a mechanism that the Linux
kernel provides as an alternative to system calls to reduce where
possible the costs in terms of cycles.
This is possible because certain syscalls like gettimeofday() do
not write any data and return one or more values that are stored
in the kernel, which makes relatively safe calling them directly
as a library function.
Even if the mechanism is pretty much standard, every architecture
in the last few years ended up implementing their own vDSO library
in the architectural code.
The purpose of this patch-set is to identify the commonalities in
between the architectures and try to consolidate the common code
paths, starting with gettimeofday().
This implementation contains the following design choices:
* Every architecture defines the arch specific code in an header in
"asm/vdso/".
* The generic implementation includes the arch specific one and lives
in "lib/vdso".
* The arch specific code for gettimeofday lives in
"<arch path>/vdso/gettimeofday.c" and includes the generic code only.
* The generic implementation of update_vsyscall and update_vsyscall_tz
lives in kernel/vdso and provide the bindings that can be implemented
by each architecture.
* Each architecture provides its implementation of the bindings in
"asm/vdso/vsyscall.h".
* This approach allows to consolidate the common code in a single place
with the benefit of avoiding code duplication.
This implementation contains the portings to the common library for: arm64,
compat mode for arm64, arm, mips, x86_64, x32, compat mode for x86_64 and
i386.
The mips porting has been tested on qemu for mips32el. A configuration to
repeat the tests can be found at [4].
The x86_64 porting has been tested on an Intel Xeon 5120T based machine
running Ubuntu 18.04 and using the Ubuntu provided defconfig.
The i386 porting has been tested on qemu using the i386_defconfig
configuration.
Last but not least from this porting arm64, compat arm64, arm and mips gain
the support for:
* CLOCK_BOOTTIME that can be useful in certain scenarios since it keeps
track of the time during sleep as well.
* CLOCK_TAI that is like CLOCK_REALTIME, but uses the International
Atomic Time (TAI) reference instead of UTC to avoid jumping on leap
second updates.
for both clock_gettime and clock_getres.
The porting has been validated using the vdsotest test-suite [1] extended
to cover all the clock ids [2].
A new test has been added to the linux kselftest in order to validate the
newly added library.
The porting has been benchmarked and the performance results are
provided as part of this cover letter.
To simplify the testing, a copy of the patchset on top of a recent linux
tree can be found at [3] and [4].
[1] https://github.com/nathanlynch/vdsotest
[2] https://github.com/fvincenzo/vdsotest
[3] git://linux-arm.org/linux-vf.git vdso/v6
[4] git://linux-arm.org/linux-vf.git vdso-mips/v6
Changes:
--------
v6:
- Rebased on 5.2-rc2.
- Added performance numbers.
- Removed vdso_types.h.
- Unified update_vsyscall and update_vsyscall_tz.
- Reworked the kselftest included in this patchset.
- Addressed review comments.
v5:
- Rebased on 5.0-rc7.
- Added x86_64, compat mode for x86_64 and i386 portings.
- Extended vDSO kselftest.
- Addressed review comments.
v4:
- Rebased on 5.0-rc2.
- Addressed review comments.
- Disabled compat vdso on arm64 when the kernel is compiled with
clang.
v3:
- Ported the latest fixes and optimizations done on the x86
architecture to the generic library.
- Addressed review comments.
- Improved the documentation of the interfaces.
- Changed the HAVE_ARCH_TIMER config option to a more generic
HAVE_HW_COUNTER.
v2:
- Added -ffixed-x18 to arm64
- Repleced occurrences of timeval and timespec
- Modified datapage.h to be compliant with y2038 on all the architectures
- Removed __u_vdso type
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Russell King <linux(a)armlinux.org.uk>
Cc: Ralf Baechle <ralf(a)linux-mips.org>
Cc: Paul Burton <paul.burton(a)mips.com>
Cc: Daniel Lezcano <daniel.lezcano(a)linaro.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Mark Salyzyn <salyzyn(a)android.com>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Dmitry Safonov <0x7f454c46(a)gmail.com>
Cc: Rasmus Villemoes <linux(a)rasmusvillemoes.dk>
Cc: Huw Davies <huw(a)codeweavers.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Performance Numbers: Linux 5.2.0-rc2 - Xeon Gold 5120T
======================================================
Unified vDSO:
-------------
clock-gettime-monotonic: syscall: 342 nsec/call
clock-gettime-monotonic: libc: 25 nsec/call
clock-gettime-monotonic: vdso: 24 nsec/call
clock-getres-monotonic: syscall: 296 nsec/call
clock-getres-monotonic: libc: 296 nsec/call
clock-getres-monotonic: vdso: 3 nsec/call
clock-gettime-monotonic-coarse: syscall: 294 nsec/call
clock-gettime-monotonic-coarse: libc: 5 nsec/call
clock-gettime-monotonic-coarse: vdso: 5 nsec/call
clock-getres-monotonic-coarse: syscall: 295 nsec/call
clock-getres-monotonic-coarse: libc: 292 nsec/call
clock-getres-monotonic-coarse: vdso: 5 nsec/call
clock-gettime-monotonic-raw: syscall: 343 nsec/call
clock-gettime-monotonic-raw: libc: 25 nsec/call
clock-gettime-monotonic-raw: vdso: 23 nsec/call
clock-getres-monotonic-raw: syscall: 290 nsec/call
clock-getres-monotonic-raw: libc: 290 nsec/call
clock-getres-monotonic-raw: vdso: 4 nsec/call
clock-gettime-tai: syscall: 332 nsec/call
clock-gettime-tai: libc: 24 nsec/call
clock-gettime-tai: vdso: 23 nsec/call
clock-getres-tai: syscall: 288 nsec/call
clock-getres-tai: libc: 288 nsec/call
clock-getres-tai: vdso: 3 nsec/call
clock-gettime-boottime: syscall: 342 nsec/call
clock-gettime-boottime: libc: 24 nsec/call
clock-gettime-boottime: vdso: 23 nsec/call
clock-getres-boottime: syscall: 284 nsec/call
clock-getres-boottime: libc: 291 nsec/call
clock-getres-boottime: vdso: 3 nsec/call
clock-gettime-realtime: syscall: 337 nsec/call
clock-gettime-realtime: libc: 24 nsec/call
clock-gettime-realtime: vdso: 23 nsec/call
clock-getres-realtime: syscall: 287 nsec/call
clock-getres-realtime: libc: 284 nsec/call
clock-getres-realtime: vdso: 3 nsec/call
clock-gettime-realtime-coarse: syscall: 307 nsec/call
clock-gettime-realtime-coarse: libc: 4 nsec/call
clock-gettime-realtime-coarse: vdso: 4 nsec/call
clock-getres-realtime-coarse: syscall: 294 nsec/call
clock-getres-realtime-coarse: libc: 291 nsec/call
clock-getres-realtime-coarse: vdso: 4 nsec/call
getcpu: syscall: 246 nsec/call
getcpu: libc: 14 nsec/call
getcpu: vdso: 11 nsec/call
gettimeofday: syscall: 293 nsec/call
gettimeofday: libc: 26 nsec/call
gettimeofday: vdso: 25 nsec/call
Stock Kernel:
-------------
clock-gettime-monotonic: syscall: 338 nsec/call
clock-gettime-monotonic: libc: 24 nsec/call
clock-gettime-monotonic: vdso: 23 nsec/call
clock-getres-monotonic: syscall: 291 nsec/call
clock-getres-monotonic: libc: 304 nsec/call
clock-getres-monotonic: vdso: not tested
Note: vDSO version of clock_getres not found
clock-gettime-monotonic-coarse: syscall: 297 nsec/call
clock-gettime-monotonic-coarse: libc: 5 nsec/call
clock-gettime-monotonic-coarse: vdso: 4 nsec/call
clock-getres-monotonic-coarse: syscall: 281 nsec/call
clock-getres-monotonic-coarse: libc: 286 nsec/call
clock-getres-monotonic-coarse: vdso: not tested
Note: vDSO version of clock_getres not found
clock-gettime-monotonic-raw: syscall: 336 nsec/call
clock-gettime-monotonic-raw: libc: 340 nsec/call
clock-gettime-monotonic-raw: vdso: 346 nsec/call
clock-getres-monotonic-raw: syscall: 297 nsec/call
clock-getres-monotonic-raw: libc: 301 nsec/call
clock-getres-monotonic-raw: vdso: not tested
Note: vDSO version of clock_getres not found
clock-gettime-tai: syscall: 351 nsec/call
clock-gettime-tai: libc: 24 nsec/call
clock-gettime-tai: vdso: 23 nsec/call
clock-getres-tai: syscall: 298 nsec/call
clock-getres-tai: libc: 290 nsec/call
clock-getres-tai: vdso: not tested
Note: vDSO version of clock_getres not found
clock-gettime-boottime: syscall: 342 nsec/call
clock-gettime-boottime: libc: 347 nsec/call
clock-gettime-boottime: vdso: 355 nsec/call
clock-getres-boottime: syscall: 296 nsec/call
clock-getres-boottime: libc: 295 nsec/call
clock-getres-boottime: vdso: not tested
Note: vDSO version of clock_getres not found
clock-gettime-realtime: syscall: 346 nsec/call
clock-gettime-realtime: libc: 24 nsec/call
clock-gettime-realtime: vdso: 22 nsec/call
clock-getres-realtime: syscall: 295 nsec/call
clock-getres-realtime: libc: 291 nsec/call
clock-getres-realtime: vdso: not tested
Note: vDSO version of clock_getres not found
clock-gettime-realtime-coarse: syscall: 292 nsec/call
clock-gettime-realtime-coarse: libc: 5 nsec/call
clock-gettime-realtime-coarse: vdso: 4 nsec/call
clock-getres-realtime-coarse: syscall: 300 nsec/call
clock-getres-realtime-coarse: libc: 301 nsec/call
clock-getres-realtime-coarse: vdso: not tested
Note: vDSO version of clock_getres not found
getcpu: syscall: 252 nsec/call
getcpu: libc: 14 nsec/call
getcpu: vdso: 11 nsec/call
gettimeofday: syscall: 293 nsec/call
gettimeofday: libc: 24 nsec/call
gettimeofday: vdso: 25 nsec/call
Peter Collingbourne (1):
arm64: Build vDSO with -ffixed-x18
Vincenzo Frascino (18):
kernel: Standardize vdso_datapage
kernel: Define gettimeofday vdso common code
kernel: Unify update_vsyscall implementation
arm64: Substitute gettimeofday with C implementation
arm64: compat: Add missing syscall numbers
arm64: compat: Expose signal related structures
arm64: compat: Generate asm offsets for signals
lib: vdso: Add compat support
arm64: compat: Add vDSO
arm64: Refactor vDSO code
arm64: compat: vDSO setup for compat layer
arm64: elf: vDSO code page discovery
arm64: compat: Get sigreturn trampolines from vDSO
arm64: Add vDSO compat support
arm: Add support for generic vDSO
mips: Add support for generic vDSO
x86: Add support for generic vDSO
kselftest: Extend vDSO selftest
arch/arm/Kconfig | 3 +
arch/arm/include/asm/vdso/gettimeofday.h | 96 +++++
arch/arm/include/asm/vdso/vsyscall.h | 71 ++++
arch/arm/include/asm/vdso_datapage.h | 29 +-
arch/arm/kernel/vdso.c | 87 +----
arch/arm/vdso/Makefile | 13 +-
arch/arm/vdso/note.c | 15 +
arch/arm/vdso/vdso.lds.S | 2 +
arch/arm/vdso/vgettimeofday.c | 268 +------------
arch/arm64/Kconfig | 3 +
arch/arm64/Makefile | 23 +-
arch/arm64/include/asm/elf.h | 14 +
arch/arm64/include/asm/signal32.h | 46 +++
arch/arm64/include/asm/unistd.h | 5 +
arch/arm64/include/asm/vdso.h | 3 +
arch/arm64/include/asm/vdso/compat_barrier.h | 51 +++
.../include/asm/vdso/compat_gettimeofday.h | 108 ++++++
arch/arm64/include/asm/vdso/gettimeofday.h | 84 +++++
arch/arm64/include/asm/vdso/vsyscall.h | 53 +++
arch/arm64/include/asm/vdso_datapage.h | 48 ---
arch/arm64/kernel/Makefile | 6 +-
arch/arm64/kernel/asm-offsets.c | 39 +-
arch/arm64/kernel/signal32.c | 72 ++--
arch/arm64/kernel/vdso.c | 356 ++++++++++++------
arch/arm64/kernel/vdso/Makefile | 34 +-
arch/arm64/kernel/vdso/gettimeofday.S | 334 ----------------
arch/arm64/kernel/vdso/vgettimeofday.c | 28 ++
arch/arm64/kernel/vdso32/.gitignore | 2 +
arch/arm64/kernel/vdso32/Makefile | 184 +++++++++
arch/arm64/kernel/vdso32/note.c | 15 +
arch/arm64/kernel/vdso32/sigreturn.S | 62 +++
arch/arm64/kernel/vdso32/vdso.S | 19 +
arch/arm64/kernel/vdso32/vdso.lds.S | 82 ++++
arch/arm64/kernel/vdso32/vgettimeofday.c | 59 +++
arch/mips/Kconfig | 2 +
arch/mips/include/asm/vdso.h | 78 +---
arch/mips/include/asm/vdso/gettimeofday.h | 175 +++++++++
arch/mips/{ => include/asm}/vdso/vdso.h | 6 +-
arch/mips/include/asm/vdso/vsyscall.h | 43 +++
arch/mips/kernel/vdso.c | 37 +-
arch/mips/vdso/Makefile | 25 +-
arch/mips/vdso/elf.S | 2 +-
arch/mips/vdso/gettimeofday.c | 273 --------------
arch/mips/vdso/sigreturn.S | 2 +-
arch/mips/vdso/vdso.lds.S | 4 +
arch/mips/vdso/vgettimeofday.c | 57 +++
arch/x86/Kconfig | 3 +
arch/x86/entry/vdso/Makefile | 9 +
arch/x86/entry/vdso/vclock_gettime.c | 251 +++---------
arch/x86/entry/vdso/vdso.lds.S | 2 +
arch/x86/entry/vdso/vdso32/vdso32.lds.S | 2 +
arch/x86/entry/vdso/vdsox32.lds.S | 1 +
arch/x86/entry/vsyscall/Makefile | 2 -
arch/x86/entry/vsyscall/vsyscall_gtod.c | 83 ----
arch/x86/include/asm/mshyperv-tsc.h | 76 ++++
arch/x86/include/asm/mshyperv.h | 70 +---
arch/x86/include/asm/pvclock.h | 2 +-
arch/x86/include/asm/vdso/gettimeofday.h | 203 ++++++++++
arch/x86/include/asm/vdso/vsyscall.h | 44 +++
arch/x86/include/asm/vgtod.h | 75 +---
arch/x86/include/asm/vvar.h | 7 +-
arch/x86/kernel/pvclock.c | 1 +
include/asm-generic/vdso/vsyscall.h | 56 +++
include/linux/hrtimer.h | 15 +-
include/linux/hrtimer_defs.h | 25 ++
include/linux/timekeeper_internal.h | 9 +
include/vdso/datapage.h | 91 +++++
include/vdso/helpers.h | 56 +++
include/vdso/vsyscall.h | 11 +
kernel/Makefile | 1 +
kernel/vdso/Makefile | 2 +
kernel/vdso/vsyscall.c | 139 +++++++
lib/Kconfig | 5 +
lib/vdso/Kconfig | 36 ++
lib/vdso/Makefile | 22 ++
lib/vdso/gettimeofday.c | 229 +++++++++++
tools/testing/selftests/vDSO/Makefile | 2 +
tools/testing/selftests/vDSO/vdso_full_test.c | 261 +++++++++++++
78 files changed, 3042 insertions(+), 1767 deletions(-)
create mode 100644 arch/arm/include/asm/vdso/gettimeofday.h
create mode 100644 arch/arm/include/asm/vdso/vsyscall.h
create mode 100644 arch/arm/vdso/note.c
create mode 100644 arch/arm64/include/asm/vdso/compat_barrier.h
create mode 100644 arch/arm64/include/asm/vdso/compat_gettimeofday.h
create mode 100644 arch/arm64/include/asm/vdso/gettimeofday.h
create mode 100644 arch/arm64/include/asm/vdso/vsyscall.h
delete mode 100644 arch/arm64/include/asm/vdso_datapage.h
delete mode 100644 arch/arm64/kernel/vdso/gettimeofday.S
create mode 100644 arch/arm64/kernel/vdso/vgettimeofday.c
create mode 100644 arch/arm64/kernel/vdso32/.gitignore
create mode 100644 arch/arm64/kernel/vdso32/Makefile
create mode 100644 arch/arm64/kernel/vdso32/note.c
create mode 100644 arch/arm64/kernel/vdso32/sigreturn.S
create mode 100644 arch/arm64/kernel/vdso32/vdso.S
create mode 100644 arch/arm64/kernel/vdso32/vdso.lds.S
create mode 100644 arch/arm64/kernel/vdso32/vgettimeofday.c
create mode 100644 arch/mips/include/asm/vdso/gettimeofday.h
rename arch/mips/{ => include/asm}/vdso/vdso.h (90%)
create mode 100644 arch/mips/include/asm/vdso/vsyscall.h
delete mode 100644 arch/mips/vdso/gettimeofday.c
create mode 100644 arch/mips/vdso/vgettimeofday.c
delete mode 100644 arch/x86/entry/vsyscall/vsyscall_gtod.c
create mode 100644 arch/x86/include/asm/mshyperv-tsc.h
create mode 100644 arch/x86/include/asm/vdso/gettimeofday.h
create mode 100644 arch/x86/include/asm/vdso/vsyscall.h
create mode 100644 include/asm-generic/vdso/vsyscall.h
create mode 100644 include/linux/hrtimer_defs.h
create mode 100644 include/vdso/datapage.h
create mode 100644 include/vdso/helpers.h
create mode 100644 include/vdso/vsyscall.h
create mode 100644 kernel/vdso/Makefile
create mode 100644 kernel/vdso/vsyscall.c
create mode 100644 lib/vdso/Kconfig
create mode 100644 lib/vdso/Makefile
create mode 100644 lib/vdso/gettimeofday.c
create mode 100644 tools/testing/selftests/vDSO/vdso_full_test.c
--
2.21.0
This patch series moves Hyper-V clock/timer code to a separate Hyper-V
clocksource driver. Previously, Hyper-V clock/timer code and data
structures were mixed in with other Hyper-V code in the ISA independent
drivers/hv code as well as in ISA dependent code. The new Hyper-V
clocksource driver is ISA independent, with a just few dependencies on
ISA specific functions. The patch series does not change any behavior
or functionality -- it only reorganizes the existing code and fixes up
the linkages. A few places outside of Hyper-V code are fixed up to use
the new #include file structure.
This restructuring is in response to Marc Zyngier's review comments
on supporting Hyper-V running on ARM64, and is a good idea in general.
It increases the amount of code shared between the x86 and ARM64
architectures, and reduces the size of the new code for supporting
Hyper-V on ARM64. A new version of the Hyper-V on ARM64 patches will
follow once this clocksource restructuring is accepted.
The code is diff'ed against the upstream tip tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/vdso
Changes in v4:
* Revised commit messages
* Rebased to upstream tip tree
Changes in v3:
* Removed boolean argument to hv_init_clocksource(). Always call
sched_clock_register, which is needed on ARM64 but a no-op on x86.
* Removed separate cpuhp setup in hv_stimer_alloc() and instead
directly call hv_stimer_init() and hv_stimer_cleanup() from
corresponding VMbus functions. This more closely matches original
code and avoids clocksource stop/restart problems on ARM64 when
VMbus code denies CPU offlining request.
Changes in v2:
* Revised commit short descriptions so the distinction between
the first and second patches is clearer [GregKH]
* Renamed new clocksource driver files and functions to use
existing "timer" and "stimer" names instead of introducing
"syntimer". [Vitaly Kuznetsov]
* Introduced CONFIG_HYPER_TIMER to fix build problem when
CONFIG_HYPERV=m [Vitaly Kuznetsov]
* Added "Suggested-by: Marc Zyngier"
Michael Kelley (2):
Drivers: hv: Create Hyper-V clocksource driver from existing
clockevents code
Drivers: hv: Move Hyper-V clocksource code to new clocksource driver
MAINTAINERS | 2 +
arch/x86/entry/vdso/vma.c | 2 +-
arch/x86/hyperv/hv_init.c | 91 +--------
arch/x86/include/asm/hyperv-tlfs.h | 6 +
arch/x86/include/asm/mshyperv.h | 81 ++------
arch/x86/include/asm/vdso/gettimeofday.h | 2 +-
arch/x86/kernel/cpu/mshyperv.c | 2 +
arch/x86/kvm/x86.c | 1 +
drivers/clocksource/Makefile | 1 +
drivers/clocksource/hyperv_timer.c | 321 +++++++++++++++++++++++++++++++
drivers/hv/Kconfig | 3 +
drivers/hv/hv.c | 156 +--------------
drivers/hv/hyperv_vmbus.h | 3 -
drivers/hv/vmbus_drv.c | 42 ++--
include/clocksource/hyperv_timer.h | 105 ++++++++++
15 files changed, 483 insertions(+), 335 deletions(-)
create mode 100644 drivers/clocksource/hyperv_timer.c
create mode 100644 include/clocksource/hyperv_timer.h
--
1.8.3.1
The blockdev book basically contains user-faced documentation.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung(a)kernel.org>
---
.../blockdev/drbd/DRBD-8.3-data-packets.svg | 0
.../blockdev/drbd/DRBD-data-packets.svg | 0
.../blockdev/drbd/conn-states-8.dot | 0
.../blockdev/drbd/data-structure-v9.rst | 0
.../blockdev/drbd/disk-states-8.dot | 0
.../drbd/drbd-connection-state-overview.dot | 0
.../blockdev/drbd/figures.rst | 0
.../{ => admin-guide}/blockdev/drbd/index.rst | 0
.../blockdev/drbd/node-states-8.dot | 1 -
.../{ => admin-guide}/blockdev/floppy.rst | 0
.../{ => admin-guide}/blockdev/index.rst | 2 --
.../{ => admin-guide}/blockdev/nbd.rst | 0
.../{ => admin-guide}/blockdev/paride.rst | 0
.../{ => admin-guide}/blockdev/ramdisk.rst | 0
.../{ => admin-guide}/blockdev/zram.rst | 0
Documentation/admin-guide/index.rst | 1 +
.../admin-guide/kernel-parameters.txt | 18 +++++++++---------
MAINTAINERS | 10 +++++-----
drivers/block/Kconfig | 8 ++++----
drivers/block/floppy.c | 2 +-
drivers/block/zram/Kconfig | 6 +++---
tools/testing/selftests/zram/README | 2 +-
22 files changed, 24 insertions(+), 26 deletions(-)
rename Documentation/{ => admin-guide}/blockdev/drbd/DRBD-8.3-data-packets.svg (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/DRBD-data-packets.svg (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/conn-states-8.dot (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/data-structure-v9.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/disk-states-8.dot (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/drbd-connection-state-overview.dot (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/figures.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/index.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/drbd/node-states-8.dot (99%)
rename Documentation/{ => admin-guide}/blockdev/floppy.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/index.rst (94%)
rename Documentation/{ => admin-guide}/blockdev/nbd.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/paride.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/ramdisk.rst (100%)
rename Documentation/{ => admin-guide}/blockdev/zram.rst (100%)
diff --git a/Documentation/blockdev/drbd/DRBD-8.3-data-packets.svg b/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
similarity index 100%
rename from Documentation/blockdev/drbd/DRBD-8.3-data-packets.svg
rename to Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
diff --git a/Documentation/blockdev/drbd/DRBD-data-packets.svg b/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
similarity index 100%
rename from Documentation/blockdev/drbd/DRBD-data-packets.svg
rename to Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
diff --git a/Documentation/blockdev/drbd/conn-states-8.dot b/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
similarity index 100%
rename from Documentation/blockdev/drbd/conn-states-8.dot
rename to Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
diff --git a/Documentation/blockdev/drbd/data-structure-v9.rst b/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
similarity index 100%
rename from Documentation/blockdev/drbd/data-structure-v9.rst
rename to Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
diff --git a/Documentation/blockdev/drbd/disk-states-8.dot b/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
similarity index 100%
rename from Documentation/blockdev/drbd/disk-states-8.dot
rename to Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
diff --git a/Documentation/blockdev/drbd/drbd-connection-state-overview.dot b/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
similarity index 100%
rename from Documentation/blockdev/drbd/drbd-connection-state-overview.dot
rename to Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
diff --git a/Documentation/blockdev/drbd/figures.rst b/Documentation/admin-guide/blockdev/drbd/figures.rst
similarity index 100%
rename from Documentation/blockdev/drbd/figures.rst
rename to Documentation/admin-guide/blockdev/drbd/figures.rst
diff --git a/Documentation/blockdev/drbd/index.rst b/Documentation/admin-guide/blockdev/drbd/index.rst
similarity index 100%
rename from Documentation/blockdev/drbd/index.rst
rename to Documentation/admin-guide/blockdev/drbd/index.rst
diff --git a/Documentation/blockdev/drbd/node-states-8.dot b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
similarity index 99%
rename from Documentation/blockdev/drbd/node-states-8.dot
rename to Documentation/admin-guide/blockdev/drbd/node-states-8.dot
index 4a2b00c23547..bfa54e1f8016 100644
--- a/Documentation/blockdev/drbd/node-states-8.dot
+++ b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
@@ -11,4 +11,3 @@ digraph peer_states {
Unknown -> Primary [ label = "connected" ]
Unknown -> Secondary [ label = "connected" ]
}
-
diff --git a/Documentation/blockdev/floppy.rst b/Documentation/admin-guide/blockdev/floppy.rst
similarity index 100%
rename from Documentation/blockdev/floppy.rst
rename to Documentation/admin-guide/blockdev/floppy.rst
diff --git a/Documentation/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst
similarity index 94%
rename from Documentation/blockdev/index.rst
rename to Documentation/admin-guide/blockdev/index.rst
index a9af6ed8b4aa..20a738d9d047 100644
--- a/Documentation/blockdev/index.rst
+++ b/Documentation/admin-guide/blockdev/index.rst
@@ -1,5 +1,3 @@
-:orphan:
-
===========================
The Linux RapidIO Subsystem
===========================
diff --git a/Documentation/blockdev/nbd.rst b/Documentation/admin-guide/blockdev/nbd.rst
similarity index 100%
rename from Documentation/blockdev/nbd.rst
rename to Documentation/admin-guide/blockdev/nbd.rst
diff --git a/Documentation/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst
similarity index 100%
rename from Documentation/blockdev/paride.rst
rename to Documentation/admin-guide/blockdev/paride.rst
diff --git a/Documentation/blockdev/ramdisk.rst b/Documentation/admin-guide/blockdev/ramdisk.rst
similarity index 100%
rename from Documentation/blockdev/ramdisk.rst
rename to Documentation/admin-guide/blockdev/ramdisk.rst
diff --git a/Documentation/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
similarity index 100%
rename from Documentation/blockdev/zram.rst
rename to Documentation/admin-guide/blockdev/zram.rst
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 65e821a03aca..c073af461cdf 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -73,6 +73,7 @@ configure specific aspects of kernel behavior to your liking.
java
ras
bcache
+ blockdev/index
ext4
pm/index
thunderbolt
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9b535c0e22f3..49ad034c4675 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1249,7 +1249,7 @@
See also Documentation/fault-injection/.
floppy= [HW]
- See Documentation/blockdev/floppy.rst.
+ See Documentation/admin-guide/blockdev/floppy.rst.
force_pal_cache_flush
[IA-64] Avoid check_sal_cache_flush which may hang on
@@ -2247,7 +2247,7 @@
memblock=debug [KNL] Enable memblock debug messages.
load_ramdisk= [RAM] List of ramdisks to load from floppy
- See Documentation/blockdev/ramdisk.rst.
+ See Documentation/admin-guide/blockdev/ramdisk.rst.
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -3294,7 +3294,7 @@
pcd. [PARIDE]
See header of drivers/block/paride/pcd.c.
- See also Documentation/blockdev/paride.rst.
+ See also Documentation/admin-guide/blockdev/paride.rst.
pci=option[,option...] [PCI] various PCI subsystem options.
@@ -3538,7 +3538,7 @@
needed on a platform with proper driver support.
pd. [PARIDE]
- See Documentation/blockdev/paride.rst.
+ See Documentation/admin-guide/blockdev/paride.rst.
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
@@ -3553,10 +3553,10 @@
and performance comparison.
pf. [PARIDE]
- See Documentation/blockdev/paride.rst.
+ See Documentation/admin-guide/blockdev/paride.rst.
pg. [PARIDE]
- See Documentation/blockdev/paride.rst.
+ See Documentation/admin-guide/blockdev/paride.rst.
pirq= [SMP,APIC] Manual mp-table setup
See Documentation/x86/i386/IO-APIC.rst.
@@ -3668,7 +3668,7 @@
prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
before loading.
- See Documentation/blockdev/ramdisk.rst.
+ See Documentation/admin-guide/blockdev/ramdisk.rst.
psi= [KNL] Enable or disable pressure stall information
tracking.
@@ -3690,7 +3690,7 @@
pstore.backend= Specify the name of the pstore backend to use
pt. [PARIDE]
- See Documentation/blockdev/paride.rst.
+ See Documentation/admin-guide/blockdev/paride.rst.
pti= [X86_64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
@@ -3719,7 +3719,7 @@
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
- See Documentation/blockdev/ramdisk.rst.
+ See Documentation/admin-guide/blockdev/ramdisk.rst.
random.trust_cpu={on,off}
[KNL] Enable or disable trusting the use of the
diff --git a/MAINTAINERS b/MAINTAINERS
index 4c622a19ab7d..3f0f654d1166 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4974,7 +4974,7 @@ T: git git://git.linbit.com/drbd-8.4.git
S: Supported
F: drivers/block/drbd/
F: lib/lru_cache.c
-F: Documentation/blockdev/drbd/
+F: Documentation/admin-guide/blockdev/
DRIVER CORE, KOBJECTS, DEBUGFS AND SYSFS
M: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
@@ -11024,7 +11024,7 @@ M: Josef Bacik <josef(a)toxicpanda.com>
S: Maintained
L: linux-block(a)vger.kernel.org
L: nbd(a)other.debian.org
-F: Documentation/blockdev/nbd.rst
+F: Documentation/admin-guide/blockdev/nbd.rst
F: drivers/block/nbd.c
F: include/trace/events/nbd.h
F: include/uapi/linux/nbd.h
@@ -12028,7 +12028,7 @@ PARIDE DRIVERS FOR PARALLEL PORT IDE DEVICES
M: Tim Waugh <tim(a)cyberelk.net>
L: linux-parport(a)lists.infradead.org (subscribers-only)
S: Maintained
-F: Documentation/blockdev/paride.rst
+F: Documentation/admin-guide/blockdev/paride.rst
F: drivers/block/paride/
PARISC ARCHITECTURE
@@ -13310,7 +13310,7 @@ F: drivers/net/wireless/ralink/rt2x00/
RAMDISK RAM BLOCK DEVICE DRIVER
M: Jens Axboe <axboe(a)kernel.dk>
S: Maintained
-F: Documentation/blockdev/ramdisk.rst
+F: Documentation/admin-guide/blockdev/ramdisk.rst
F: drivers/block/brd.c
RANCHU VIRTUAL BOARD FOR MIPS
@@ -17672,7 +17672,7 @@ R: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
L: linux-kernel(a)vger.kernel.org
S: Maintained
F: drivers/block/zram/
-F: Documentation/blockdev/zram.rst
+F: Documentation/admin-guide/blockdev/zram.rst
ZS DECSTATION Z85C30 SERIAL DRIVER
M: "Maciej W. Rozycki" <macro(a)linux-mips.org>
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index c43690b973d8..1bb8ec575352 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -31,7 +31,7 @@ config BLK_DEV_FD
If you want to use the floppy disk drive(s) of your PC under Linux,
say Y. Information about this driver, especially important for IBM
Thinkpad users, is contained in
- <file:Documentation/blockdev/floppy.rst>.
+ <file:Documentation/admin-guide/blockdev/floppy.rst>.
That file also contains the location of the Floppy driver FAQ as
well as location of the fdutils package used to configure additional
parameters of the driver at run time.
@@ -96,7 +96,7 @@ config PARIDE
your computer's parallel port. Most of them are actually IDE devices
using a parallel port IDE adapter. This option enables the PARIDE
subsystem which contains drivers for many of these external drives.
- Read <file:Documentation/blockdev/paride.rst> for more information.
+ Read <file:Documentation/admin-guide/blockdev/paride.rst> for more information.
If you have said Y to the "Parallel-port support" configuration
option, you may share a single port between your printer and other
@@ -261,7 +261,7 @@ config BLK_DEV_NBD
userland (making server and client physically the same computer,
communicating using the loopback network device).
- Read <file:Documentation/blockdev/nbd.rst> for more information,
+ Read <file:Documentation/admin-guide/blockdev/nbd.rst> for more information,
especially about where to find the server code, which runs in user
space and does not need special kernel support.
@@ -303,7 +303,7 @@ config BLK_DEV_RAM
during the initial install of Linux.
Note that the kernel command line option "ramdisk=XX" is now obsolete.
- For details, read <file:Documentation/blockdev/ramdisk.rst>.
+ For details, read <file:Documentation/admin-guide/blockdev/ramdisk.rst>.
To compile this driver as a module, choose M here: the
module will be called brd. An alias "rd" has been defined
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 5c99e52f9dc1..f652c1ac3ae9 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4424,7 +4424,7 @@ static int __init floppy_setup(char *str)
pr_cont("\n");
} else
DPRINT("botched floppy option\n");
- DPRINT("Read Documentation/blockdev/floppy.rst\n");
+ DPRINT("Read Documentation/admin-guide/blockdev/floppy.rst\n");
return 0;
}
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index e06b99d54816..fe7a4b7d30cf 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -12,7 +12,7 @@ config ZRAM
It has several use cases, for example: /tmp storage, use as swap
disks and maybe many more.
- See Documentation/blockdev/zram.rst for more information.
+ See Documentation/admin-guide/blockdev/zram.rst for more information.
config ZRAM_WRITEBACK
bool "Write back incompressible or idle page to backing device"
@@ -26,7 +26,7 @@ config ZRAM_WRITEBACK
With /sys/block/zramX/{idle,writeback}, application could ask
idle page's writeback to the backing device to save in memory.
- See Documentation/blockdev/zram.rst for more information.
+ See Documentation/admin-guide/blockdev/zram.rst for more information.
config ZRAM_MEMORY_TRACKING
bool "Track zRam block status"
@@ -36,4 +36,4 @@ config ZRAM_MEMORY_TRACKING
of zRAM. Admin could see the information via
/sys/kernel/debug/zram/zramX/block_state.
- See Documentation/blockdev/zram.rst for more information.
+ See Documentation/admin-guide/blockdev/zram.rst for more information.
diff --git a/tools/testing/selftests/zram/README b/tools/testing/selftests/zram/README
index 5fa378391d3b..110b34834a6f 100644
--- a/tools/testing/selftests/zram/README
+++ b/tools/testing/selftests/zram/README
@@ -37,4 +37,4 @@ Commands required for testing:
- mkfs/ mkfs.ext4
For more information please refer:
-kernel-source-tree/Documentation/blockdev/zram.rst
+kernel-source-tree/Documentation/admin-guide/blockdev/zram.rst
--
2.21.0
Rename the blockdev documentation files to ReST, add an
index for them and adjust in order to produce a nice html
output via the Sphinx build system.
The drbd sub-directory contains some graphs and data flows.
Add those too to the documentation.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung(a)kernel.org>
---
.../admin-guide/kernel-parameters.txt | 18 +-
...structure-v9.txt => data-structure-v9.rst} | 6 +-
Documentation/blockdev/drbd/figures.rst | 28 +++
.../blockdev/drbd/{README.txt => index.rst} | 15 +-
.../blockdev/{floppy.txt => floppy.rst} | 88 ++++----
Documentation/blockdev/index.rst | 16 ++
Documentation/blockdev/{nbd.txt => nbd.rst} | 2 +-
.../blockdev/{paride.txt => paride.rst} | 194 +++++++++--------
.../blockdev/{ramdisk.txt => ramdisk.rst} | 55 ++---
Documentation/blockdev/{zram.txt => zram.rst} | 195 ++++++++++++------
MAINTAINERS | 8 +-
drivers/block/Kconfig | 8 +-
drivers/block/floppy.c | 2 +-
drivers/block/zram/Kconfig | 6 +-
tools/testing/selftests/zram/README | 2 +-
15 files changed, 398 insertions(+), 245 deletions(-)
rename Documentation/blockdev/drbd/{data-structure-v9.txt => data-structure-v9.rst} (94%)
create mode 100644 Documentation/blockdev/drbd/figures.rst
rename Documentation/blockdev/drbd/{README.txt => index.rst} (55%)
rename Documentation/blockdev/{floppy.txt => floppy.rst} (81%)
create mode 100644 Documentation/blockdev/index.rst
rename Documentation/blockdev/{nbd.txt => nbd.rst} (96%)
rename Documentation/blockdev/{paride.txt => paride.rst} (81%)
rename Documentation/blockdev/{ramdisk.txt => ramdisk.rst} (84%)
rename Documentation/blockdev/{zram.txt => zram.rst} (76%)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 92d335837c52..7ed94527a4a0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1249,7 +1249,7 @@
See also Documentation/fault-injection/.
floppy= [HW]
- See Documentation/blockdev/floppy.txt.
+ See Documentation/blockdev/floppy.rst.
force_pal_cache_flush
[IA-64] Avoid check_sal_cache_flush which may hang on
@@ -2247,7 +2247,7 @@
memblock=debug [KNL] Enable memblock debug messages.
load_ramdisk= [RAM] List of ramdisks to load from floppy
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/blockdev/ramdisk.rst.
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -3294,7 +3294,7 @@
pcd. [PARIDE]
See header of drivers/block/paride/pcd.c.
- See also Documentation/blockdev/paride.txt.
+ See also Documentation/blockdev/paride.rst.
pci=option[,option...] [PCI] various PCI subsystem options.
@@ -3538,7 +3538,7 @@
needed on a platform with proper driver support.
pd. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
@@ -3553,10 +3553,10 @@
and performance comparison.
pf. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pg. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pirq= [SMP,APIC] Manual mp-table setup
See Documentation/x86/i386/IO-APIC.rst.
@@ -3668,7 +3668,7 @@
prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
before loading.
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/blockdev/ramdisk.rst.
psi= [KNL] Enable or disable pressure stall information
tracking.
@@ -3690,7 +3690,7 @@
pstore.backend= Specify the name of the pstore backend to use
pt. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pti= [X86_64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
@@ -3719,7 +3719,7 @@
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/blockdev/ramdisk.rst.
random.trust_cpu={on,off}
[KNL] Enable or disable trusting the use of the
diff --git a/Documentation/blockdev/drbd/data-structure-v9.txt b/Documentation/blockdev/drbd/data-structure-v9.rst
similarity index 94%
rename from Documentation/blockdev/drbd/data-structure-v9.txt
rename to Documentation/blockdev/drbd/data-structure-v9.rst
index 1e52a0e32624..66036b901644 100644
--- a/Documentation/blockdev/drbd/data-structure-v9.txt
+++ b/Documentation/blockdev/drbd/data-structure-v9.rst
@@ -1,3 +1,7 @@
+================================
+kernel data structure for DRBD-9
+================================
+
This describes the in kernel data structure for DRBD-9. Starting with
Linux v3.14 we are reorganizing DRBD to use this data structure.
@@ -10,7 +14,7 @@ device is represented by a block device locally.
The DRBD objects are interconnected to form a matrix as depicted below; a
drbd_peer_device object sits at each intersection between a drbd_device and a
-drbd_connection:
+drbd_connection::
/--------------+---------------+.....+---------------\
| resource | device | | device |
diff --git a/Documentation/blockdev/drbd/figures.rst b/Documentation/blockdev/drbd/figures.rst
new file mode 100644
index 000000000000..3e3fd4b8a478
--- /dev/null
+++ b/Documentation/blockdev/drbd/figures.rst
@@ -0,0 +1,28 @@
+.. The here included files are intended to help understand the implementation
+
+Data flows that Relate some functions, and write packets
+========================================================
+
+.. kernel-figure:: DRBD-8.3-data-packets.svg
+ :alt: DRBD-8.3-data-packets.svg
+ :align: center
+
+.. kernel-figure:: DRBD-data-packets.svg
+ :alt: DRBD-data-packets.svg
+ :align: center
+
+
+Sub graphs of DRBD's state transitions
+======================================
+
+.. kernel-figure:: conn-states-8.dot
+ :alt: conn-states-8.dot
+ :align: center
+
+.. kernel-figure:: disk-states-8.dot
+ :alt: disk-states-8.dot
+ :align: center
+
+.. kernel-figure:: node-states-8.dot
+ :alt: node-states-8.dot
+ :align: center
diff --git a/Documentation/blockdev/drbd/README.txt b/Documentation/blockdev/drbd/index.rst
similarity index 55%
rename from Documentation/blockdev/drbd/README.txt
rename to Documentation/blockdev/drbd/index.rst
index 627b0a1bf35e..68ecd5c113e9 100644
--- a/Documentation/blockdev/drbd/README.txt
+++ b/Documentation/blockdev/drbd/index.rst
@@ -1,4 +1,9 @@
+==========================================
+Distributed Replicated Block Device - DRBD
+==========================================
+
Description
+===========
DRBD is a shared-nothing, synchronously replicated block device. It
is designed to serve as a building block for high availability
@@ -7,10 +12,8 @@ Description
Please visit http://www.drbd.org to find out more.
-The here included files are intended to help understand the implementation
+.. toctree::
+ :maxdepth: 1
-DRBD-8.3-data-packets.svg, DRBD-data-packets.svg
- relates some functions, and write packets.
-
-conn-states-8.dot, disk-states-8.dot, node-states-8.dot
- The sub graphs of DRBD's state transitions
+ data-structure-v9
+ figures
diff --git a/Documentation/blockdev/floppy.txt b/Documentation/blockdev/floppy.rst
similarity index 81%
rename from Documentation/blockdev/floppy.txt
rename to Documentation/blockdev/floppy.rst
index e2240f5ab64d..4a8f31cf4139 100644
--- a/Documentation/blockdev/floppy.txt
+++ b/Documentation/blockdev/floppy.rst
@@ -1,35 +1,37 @@
-This file describes the floppy driver.
+=============
+Floppy Driver
+=============
FAQ list:
=========
- A FAQ list may be found in the fdutils package (see below), and also
+A FAQ list may be found in the fdutils package (see below), and also
at <http://fdutils.linux.lu/faq.html>.
LILO configuration options (Thinkpad users, read this)
======================================================
- The floppy driver is configured using the 'floppy=' option in
+The floppy driver is configured using the 'floppy=' option in
lilo. This option can be typed at the boot prompt, or entered in the
lilo configuration file.
- Example: If your kernel is called linux-2.6.9, type the following line
-at the lilo boot prompt (if you have a thinkpad):
+Example: If your kernel is called linux-2.6.9, type the following line
+at the lilo boot prompt (if you have a thinkpad)::
linux-2.6.9 floppy=thinkpad
You may also enter the following line in /etc/lilo.conf, in the description
-of linux-2.6.9:
+of linux-2.6.9::
append = "floppy=thinkpad"
- Several floppy related options may be given, example:
+Several floppy related options may be given, example::
linux-2.6.9 floppy=daring floppy=two_fdc
append = "floppy=daring floppy=two_fdc"
- If you give options both in the lilo config file and on the boot
+If you give options both in the lilo config file and on the boot
prompt, the option strings of both places are concatenated, the boot
prompt options coming last. That's why there are also options to
restore the default behavior.
@@ -38,21 +40,23 @@ restore the default behavior.
Module configuration options
============================
- If you use the floppy driver as a module, use the following syntax:
-modprobe floppy floppy="<options>"
+If you use the floppy driver as a module, use the following syntax::
-Example:
- modprobe floppy floppy="omnibook messages"
+ modprobe floppy floppy="<options>"
- If you need certain options enabled every time you load the floppy driver,
-you can put:
+Example::
- options floppy floppy="omnibook messages"
+ modprobe floppy floppy="omnibook messages"
+
+If you need certain options enabled every time you load the floppy driver,
+you can put::
+
+ options floppy floppy="omnibook messages"
in a configuration file in /etc/modprobe.d/.
- The floppy driver related options are:
+The floppy driver related options are:
floppy=asus_pci
Sets the bit mask to allow only units 0 and 1. (default)
@@ -70,8 +74,7 @@ in a configuration file in /etc/modprobe.d/.
Tells the floppy driver that you have only one floppy controller.
(default)
- floppy=two_fdc
- floppy=<address>,two_fdc
+ floppy=two_fdc / floppy=<address>,two_fdc
Tells the floppy driver that you have two floppy controllers.
The second floppy controller is assumed to be at <address>.
This option is not needed if the second controller is at address
@@ -84,8 +87,7 @@ in a configuration file in /etc/modprobe.d/.
floppy=0,thinkpad
Tells the floppy driver that you don't have a Thinkpad.
- floppy=omnibook
- floppy=nodma
+ floppy=omnibook / floppy=nodma
Tells the floppy driver not to use Dma for data transfers.
This is needed on HP Omnibooks, which don't have a workable
DMA channel for the floppy driver. This option is also useful
@@ -144,14 +146,16 @@ in a configuration file in /etc/modprobe.d/.
described in the physical CMOS), or if your BIOS uses
non-standard CMOS types. The CMOS types are:
- 0 - Use the value of the physical CMOS
- 1 - 5 1/4 DD
- 2 - 5 1/4 HD
- 3 - 3 1/2 DD
- 4 - 3 1/2 HD
- 5 - 3 1/2 ED
- 6 - 3 1/2 ED
- 16 - unknown or not installed
+ == ==================================
+ 0 Use the value of the physical CMOS
+ 1 5 1/4 DD
+ 2 5 1/4 HD
+ 3 3 1/2 DD
+ 4 3 1/2 HD
+ 5 3 1/2 ED
+ 6 3 1/2 ED
+ 16 unknown or not installed
+ == ==================================
(Note: there are two valid types for ED drives. This is because 5 was
initially chosen to represent floppy *tapes*, and 6 for ED drives.
@@ -162,8 +166,7 @@ in a configuration file in /etc/modprobe.d/.
Print a warning message when an unexpected interrupt is received.
(default)
- floppy=no_unexpected_interrupts
- floppy=L40SX
+ floppy=no_unexpected_interrupts / floppy=L40SX
Don't print a message when an unexpected interrupt is received. This
is needed on IBM L40SX laptops in certain video modes. (There seems
to be an interaction between video and floppy. The unexpected
@@ -199,47 +202,54 @@ in a configuration file in /etc/modprobe.d/.
Sets the floppy DMA channel to <nr> instead of 2.
floppy=slow
- Use PS/2 stepping rate:
- " PS/2 floppies have much slower step rates than regular floppies.
+ Use PS/2 stepping rate::
+
+ PS/2 floppies have much slower step rates than regular floppies.
It's been recommended that take about 1/4 of the default speed
- in some more extreme cases."
+ in some more extreme cases.
Supporting utilities and additional documentation:
==================================================
- Additional parameters of the floppy driver can be configured at
+Additional parameters of the floppy driver can be configured at
runtime. Utilities which do this can be found in the fdutils package.
This package also contains a new version of mtools which allows to
access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
It also contains additional documentation about the floppy driver.
The latest version can be found at fdutils homepage:
+
http://fdutils.linux.lu
The fdutils releases can be found at:
+
http://fdutils.linux.lu/download.html
+
http://www.tux.org/pub/knaff/fdutils/
+
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
Reporting problems about the floppy driver
==========================================
- If you have a question or a bug report about the floppy driver, mail
+If you have a question or a bug report about the floppy driver, mail
me at Alain.Knaff(a)poboxes.com . If you post to Usenet, preferably use
comp.os.linux.hardware. As the volume in these groups is rather high,
be sure to include the word "floppy" (or "FLOPPY") in the subject
line. If the reported problem happens when mounting floppy disks, be
sure to mention also the type of the filesystem in the subject line.
- Be sure to read the FAQ before mailing/posting any bug reports!
+Be sure to read the FAQ before mailing/posting any bug reports!
- Alain
+Alain
Changelog
=========
-10-30-2004 : Cleanup, updating, add reference to module configuration.
+10-30-2004 :
+ Cleanup, updating, add reference to module configuration.
James Nelson <james4765(a)gmail.com>
-6-3-2000 : Original Document
+6-3-2000 :
+ Original Document
diff --git a/Documentation/blockdev/index.rst b/Documentation/blockdev/index.rst
new file mode 100644
index 000000000000..a9af6ed8b4aa
--- /dev/null
+++ b/Documentation/blockdev/index.rst
@@ -0,0 +1,16 @@
+:orphan:
+
+===========================
+The Linux RapidIO Subsystem
+===========================
+
+.. toctree::
+ :maxdepth: 1
+
+ floppy
+ nbd
+ paride
+ ramdisk
+ zram
+
+ drbd/index
diff --git a/Documentation/blockdev/nbd.txt b/Documentation/blockdev/nbd.rst
similarity index 96%
rename from Documentation/blockdev/nbd.txt
rename to Documentation/blockdev/nbd.rst
index db242ea2bce8..d78dfe559dcf 100644
--- a/Documentation/blockdev/nbd.txt
+++ b/Documentation/blockdev/nbd.rst
@@ -1,3 +1,4 @@
+==================================
Network Block Device (TCP version)
==================================
@@ -28,4 +29,3 @@ max_part
nbds_max
Number of block devices that should be initialized (default: 16).
-
diff --git a/Documentation/blockdev/paride.txt b/Documentation/blockdev/paride.rst
similarity index 81%
rename from Documentation/blockdev/paride.txt
rename to Documentation/blockdev/paride.rst
index ee6717e3771d..87b4278bf314 100644
--- a/Documentation/blockdev/paride.txt
+++ b/Documentation/blockdev/paride.rst
@@ -1,15 +1,17 @@
-
- Linux and parallel port IDE devices
+===================================
+Linux and parallel port IDE devices
+===================================
PARIDE v1.03 (c) 1997-8 Grant Guenther <grant(a)torque.net>
1. Introduction
+===============
Owing to the simplicity and near universality of the parallel port interface
to personal computers, many external devices such as portable hard-disk,
CD-ROM, LS-120 and tape drives use the parallel port to connect to their
host computer. While some devices (notably scanners) use ad-hoc methods
-to pass commands and data through the parallel port interface, most
+to pass commands and data through the parallel port interface, most
external devices are actually identical to an internal model, but with
a parallel-port adapter chip added in. Some of the original parallel port
adapters were little more than mechanisms for multiplexing a SCSI bus.
@@ -28,47 +30,50 @@ were to open up a parallel port CD-ROM drive, for instance, one would
find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
that interconnected a standard PC parallel port cable and a standard
IDE cable. It is usually possible to exchange the CD-ROM device with
-any other device using the IDE interface.
+any other device using the IDE interface.
The document describes the support in Linux for parallel port IDE
devices. It does not cover parallel port SCSI devices, "ditto" tape
-drives or scanners. Many different devices are supported by the
+drives or scanners. Many different devices are supported by the
parallel port IDE subsystem, including:
- MicroSolutions backpack CD-ROM
- MicroSolutions backpack PD/CD
- MicroSolutions backpack hard-drives
- MicroSolutions backpack 8000t tape drive
- SyQuest EZ-135, EZ-230 & SparQ drives
- Avatar Shark
- Imation Superdisk LS-120
- Maxell Superdisk LS-120
- FreeCom Power CD
- Hewlett-Packard 5GB and 8GB tape drives
- Hewlett-Packard 7100 and 7200 CD-RW drives
+ - MicroSolutions backpack CD-ROM
+ - MicroSolutions backpack PD/CD
+ - MicroSolutions backpack hard-drives
+ - MicroSolutions backpack 8000t tape drive
+ - SyQuest EZ-135, EZ-230 & SparQ drives
+ - Avatar Shark
+ - Imation Superdisk LS-120
+ - Maxell Superdisk LS-120
+ - FreeCom Power CD
+ - Hewlett-Packard 5GB and 8GB tape drives
+ - Hewlett-Packard 7100 and 7200 CD-RW drives
as well as most of the clone and no-name products on the market.
To support such a wide range of devices, PARIDE, the parallel port IDE
subsystem, is actually structured in three parts. There is a base
paride module which provides a registry and some common methods for
-accessing the parallel ports. The second component is a set of
-high-level drivers for each of the different types of supported devices:
+accessing the parallel ports. The second component is a set of
+high-level drivers for each of the different types of supported devices:
+ === =============
pd IDE disk
pcd ATAPI CD-ROM
pf ATAPI disk
pt ATAPI tape
pg ATAPI generic
+ === =============
(Currently, the pg driver is only used with CD-R drives).
The high-level drivers function according to the relevant standards.
The third component of PARIDE is a set of low-level protocol drivers
for each of the parallel port IDE adapter chips. Thanks to the interest
-and encouragement of Linux users from many parts of the world,
+and encouragement of Linux users from many parts of the world,
support is available for almost all known adapter protocols:
+ ==== ====================================== ====
aten ATEN EH-100 (HK)
bpck Microsolutions backpack (US)
comm DataStor (old-type) "commuter" adapter (TW)
@@ -83,9 +88,11 @@ support is available for almost all known adapter protocols:
ktti KT Technology PHd adapter (SG)
on20 OnSpec 90c20 (US)
on26 OnSpec 90c26 (US)
+ ==== ====================================== ====
2. Using the PARIDE subsystem
+=============================
While configuring the Linux kernel, you may choose either to build
the PARIDE drivers into your kernel, or to build them as modules.
@@ -94,10 +101,10 @@ In either case, you will need to select "Parallel port IDE device support"
as well as at least one of the high-level drivers and at least one
of the parallel port communication protocols. If you do not know
what kind of parallel port adapter is used in your drive, you could
-begin by checking the file names and any text files on your DOS
+begin by checking the file names and any text files on your DOS
installation floppy. Alternatively, you can look at the markings on
the adapter chip itself. That's usually sufficient to identify the
-correct device.
+correct device.
You can actually select all the protocol modules, and allow the PARIDE
subsystem to try them all for you.
@@ -105,8 +112,9 @@ subsystem to try them all for you.
For the "brand-name" products listed above, here are the protocol
and high-level drivers that you would use:
+ ================ ============ ====== ========
Manufacturer Model Driver Protocol
-
+ ================ ============ ====== ========
MicroSolutions CD-ROM pcd bpck
MicroSolutions PD drive pf bpck
MicroSolutions hard-drive pd bpck
@@ -119,8 +127,10 @@ and high-level drivers that you would use:
Hewlett-Packard 5GB Tape pt epat
Hewlett-Packard 7200e (CD) pcd epat
Hewlett-Packard 7200e (CD-R) pg epat
+ ================ ============ ====== ========
2.1 Configuring built-in drivers
+---------------------------------
We recommend that you get to know how the drivers work and how to
configure them as loadable modules, before attempting to compile a
@@ -143,7 +153,7 @@ protocol identification number and, for some devices, the drive's
chain ID. While your system is booting, a number of messages are
displayed on the console. Like all such messages, they can be
reviewed with the 'dmesg' command. Among those messages will be
-some lines like:
+some lines like::
paride: bpck registered as protocol 0
paride: epat registered as protocol 1
@@ -158,10 +168,10 @@ the last two digits of the drive's serial number (but read MicroSolutions'
documentation about this).
As an example, let's assume that you have a MicroSolutions PD/CD drive
-with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
-EZ-135 connected to the chained port on the PD/CD drive and also an
-Imation Superdisk connected to port 0x278. You could give the following
-options on your boot command:
+with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
+EZ-135 connected to the chained port on the PD/CD drive and also an
+Imation Superdisk connected to port 0x278. You could give the following
+options on your boot command::
pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
@@ -169,24 +179,27 @@ In the last option, pf.drive1 configures device /dev/pf1, the 0x378
is the parallel port base address, the 0 is the protocol registration
number and 36 is the chain ID.
-Please note: while PARIDE will work both with and without the
+Please note: while PARIDE will work both with and without the
PARPORT parallel port sharing system that is included by the
"Parallel port support" option, PARPORT must be included and enabled
if you want to use chains of devices on the same parallel port.
2.2 Loading and configuring PARIDE as modules
+----------------------------------------------
It is much faster and simpler to get to understand the PARIDE drivers
-if you use them as loadable kernel modules.
+if you use them as loadable kernel modules.
-Note 1: using these drivers with the "kerneld" automatic module loading
-system is not recommended for beginners, and is not documented here.
+Note 1:
+ using these drivers with the "kerneld" automatic module loading
+ system is not recommended for beginners, and is not documented here.
-Note 2: if you build PARPORT support as a loadable module, PARIDE must
-also be built as loadable modules, and PARPORT must be loaded before the
-PARIDE modules.
+Note 2:
+ if you build PARPORT support as a loadable module, PARIDE must
+ also be built as loadable modules, and PARPORT must be loaded before
+ the PARIDE modules.
-To use PARIDE, you must begin by
+To use PARIDE, you must begin by::
insmod paride
@@ -195,8 +208,8 @@ among other tasks.
Then, load as many of the protocol modules as you think you might need.
As you load each module, it will register the protocols that it supports,
-and print a log message to your kernel log file and your console. For
-example:
+and print a log message to your kernel log file and your console. For
+example::
# insmod epat
paride: epat registered as protocol 0
@@ -205,22 +218,22 @@ example:
paride: k971 registered as protocol 2
Finally, you can load high-level drivers for each kind of device that
-you have connected. By default, each driver will autoprobe for a single
+you have connected. By default, each driver will autoprobe for a single
device, but you can support up to four similar devices by giving their
individual co-ordinates when you load the driver.
For example, if you had two no-name CD-ROM drives both using the
KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
-you could give the following command:
+you could give the following command::
# insmod pcd drive0=0x378,1 drive1=0x3bc,1
For most adapters, giving a port address and protocol number is sufficient,
-but check the source files in linux/drivers/block/paride for more
+but check the source files in linux/drivers/block/paride for more
information. (Hopefully someone will write some man pages one day !).
As another example, here's what happens when PARPORT is installed, and
-a SyQuest EZ-135 is attached to port 0x378:
+a SyQuest EZ-135 is attached to port 0x378::
# insmod paride
paride: version 1.0 installed
@@ -237,46 +250,47 @@ Note that the last line is the output from the generic partition table
scanner - in this case it reports that it has found a disk with one partition.
2.3 Using a PARIDE device
+--------------------------
Once the drivers have been loaded, you can access PARIDE devices in the
same way as their traditional counterparts. You will probably need to
create the device "special files". Here is a simple script that you can
-cut to a file and execute:
+cut to a file and execute::
-#!/bin/bash
-#
-# mkd -- a script to create the device special files for the PARIDE subsystem
-#
-function mkdev {
- mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
-}
-#
-function pd {
- D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
- mkdev pd$D b 45 $[ $1 * 16 ]
- for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
- do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
- done
-}
-#
-cd /dev
-#
-for u in 0 1 2 3 ; do pd $u ; done
-for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
-for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
-for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
-for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
-for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
-#
-# end of mkd
+ #!/bin/bash
+ #
+ # mkd -- a script to create the device special files for the PARIDE subsystem
+ #
+ function mkdev {
+ mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
+ }
+ #
+ function pd {
+ D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
+ mkdev pd$D b 45 $[ $1 * 16 ]
+ for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+ do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
+ done
+ }
+ #
+ cd /dev
+ #
+ for u in 0 1 2 3 ; do pd $u ; done
+ for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
+ for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
+ for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
+ for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
+ for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
+ #
+ # end of mkd
With the device files and drivers in place, you can access PARIDE devices
-like any other Linux device. For example, to mount a CD-ROM in pcd0, use:
+like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
mount /dev/pcd0 /cdrom
If you have a fresh Avatar Shark cartridge, and the drive is pda, you
-might do something like:
+might do something like::
fdisk /dev/pda -- make a new partition table with
partition 1 of type 83
@@ -289,41 +303,46 @@ might do something like:
Devices like the Imation superdisk work in the same way, except that
they do not have a partition table. For example to make a 120MB
-floppy that you could share with a DOS system:
+floppy that you could share with a DOS system::
mkdosfs /dev/pf0
mount /dev/pf0 /mnt
2.4 The pf driver
+------------------
The pf driver is intended for use with parallel port ATAPI disk
devices. The most common devices in this category are PD drives
and LS-120 drives. Traditionally, media for these devices are not
partitioned. Consequently, the pf driver does not support partitioned
-media. This may be changed in a future version of the driver.
+media. This may be changed in a future version of the driver.
2.5 Using the pt driver
+------------------------
The pt driver for parallel port ATAPI tape drives is a minimal driver.
-It does not yet support many of the standard tape ioctl operations.
+It does not yet support many of the standard tape ioctl operations.
For best performance, a block size of 32KB should be used. You will
probably want to set the parallel port delay to 0, if you can.
2.6 Using the pg driver
+------------------------
The pg driver can be used in conjunction with the cdrecord program
to create CD-ROMs. Please get cdrecord version 1.6.1 or later
-from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
-your parallel port should ideally be set to EPP mode, and the "port delay"
-should be set to 0. With those settings it is possible to record at 2x
+from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
+your parallel port should ideally be set to EPP mode, and the "port delay"
+should be set to 0. With those settings it is possible to record at 2x
speed without any buffer underruns. If you cannot get the driver to work
in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
3. Troubleshooting
+==================
3.1 Use EPP mode if you can
+----------------------------
The most common problems that people report with the PARIDE drivers
concern the parallel port CMOS settings. At this time, none of the
@@ -332,6 +351,7 @@ If you are able to do so, please set your parallel port into EPP mode
using your CMOS setup procedure.
3.2 Check the port delay
+-------------------------
Some parallel ports cannot reliably transfer data at full speed. To
offset the errors, the PARIDE protocol modules introduce a "port
@@ -347,23 +367,25 @@ read the comments at the beginning of the driver source files in
linux/drivers/block/paride.
3.3 Some drives need a printer reset
+-------------------------------------
There appear to be a number of "noname" external drives on the market
that do not always power up correctly. We have noticed this with some
drives based on OnSpec and older Freecom adapters. In these rare cases,
the adapter can often be reinitialised by issuing a "printer reset" on
-the parallel port. As the reset operation is potentially disruptive in
-multiple device environments, the PARIDE drivers will not do it
-automatically. You can however, force a printer reset by doing:
+the parallel port. As the reset operation is potentially disruptive in
+multiple device environments, the PARIDE drivers will not do it
+automatically. You can however, force a printer reset by doing::
insmod lp reset=1
rmmod lp
If you have one of these marginal cases, you should probably build
your paride drivers as modules, and arrange to do the printer reset
-before loading the PARIDE drivers.
+before loading the PARIDE drivers.
3.4 Use the verbose option and dmesg if you need help
+------------------------------------------------------
While a lot of testing has gone into these drivers to make them work
as smoothly as possible, problems will arise. If you do have problems,
@@ -373,7 +395,7 @@ clues, then please make sure that only one drive is hooked to your system,
and that either (a) PARPORT is enabled or (b) no other device driver
is using your parallel port (check in /proc/ioports). Then, load the
appropriate drivers (you can load several protocol modules if you want)
-as in:
+as in::
# insmod paride
# insmod epat
@@ -394,12 +416,14 @@ by e-mail to grant(a)torque.net, or join the linux-parport mailing list
and post your report there.
3.5 For more information or help
+---------------------------------
You can join the linux-parport mailing list by sending a mail message
-to
+to:
+
linux-parport-request(a)torque.net
-with the single word
+with the single word::
subscribe
@@ -412,6 +436,4 @@ have in your mail headers, when sending mail to the list server.
You might also find some useful information on the linux-parport
web pages (although they are not always up to date) at
- http://web.archive.org/web/*/http://www.torque.net/parport/
-
-
+ http://web.archive.org/web/%2E/http://www.torque.net/parport/
diff --git a/Documentation/blockdev/ramdisk.txt b/Documentation/blockdev/ramdisk.rst
similarity index 84%
rename from Documentation/blockdev/ramdisk.txt
rename to Documentation/blockdev/ramdisk.rst
index 501e12e0323e..b7c2268f8dec 100644
--- a/Documentation/blockdev/ramdisk.txt
+++ b/Documentation/blockdev/ramdisk.rst
@@ -1,7 +1,8 @@
+==========================================
Using the RAM disk block device with Linux
-------------------------------------------
+==========================================
-Contents:
+.. Contents:
1) Overview
2) Kernel Command Line Parameters
@@ -42,7 +43,7 @@ rescue floppy disk.
2a) Kernel Command Line Parameters
ramdisk_size=N
- ==============
+ Size of the ramdisk.
This parameter tells the RAM disk driver to set up RAM disks of N k size. The
default is 4096 (4 MB).
@@ -50,16 +51,13 @@ default is 4096 (4 MB).
2b) Module parameters
rd_nr
- =====
- /dev/ramX devices created.
+ /dev/ramX devices created.
max_part
- ========
- Maximum partition number.
+ Maximum partition number.
rd_size
- =======
- See ramdisk_size.
+ See ramdisk_size.
3) Using "rdev -r"
------------------
@@ -71,11 +69,11 @@ to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
prompt/wait sequence is to be given before trying to read the RAM disk. Since
the RAM disk dynamically grows as data is being written into it, a size field
is not required. Bits 11 to 13 are not currently used and may as well be zero.
-These numbers are no magical secrets, as seen below:
+These numbers are no magical secrets, as seen below::
-./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
-./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
-./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
+ ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
+ ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
+ ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
Consider a typical two floppy disk setup, where you will have the
kernel on disk one, and have already put a RAM disk image onto disk #2.
@@ -92,20 +90,23 @@ sequence so that you have a chance to switch floppy disks.
The command line equivalent is: "prompt_ramdisk=1"
Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
-So to create disk one of the set, you would do:
+So to create disk one of the set, you would do::
/usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
/usr/src/linux# rdev /dev/fd0 /dev/fd0
/usr/src/linux# rdev -r /dev/fd0 49152
-If you make a boot disk that has LILO, then for the above, you would use:
+If you make a boot disk that has LILO, then for the above, you would use::
+
append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
-Since the default start = 0 and the default prompt = 1, you could use:
+
+Since the default start = 0 and the default prompt = 1, you could use::
+
append = "load_ramdisk=1"
4) An Example of Creating a Compressed RAM Disk
-----------------------------------------------
+-----------------------------------------------
To create a RAM disk image, you will need a spare block device to
construct it on. This can be the RAM disk device itself, or an
@@ -120,11 +121,11 @@ a) Decide on the RAM disk size that you want. Say 2 MB for this example.
Create it by writing to the RAM disk device. (This step is not currently
required, but may be in the future.) It is wise to zero out the
area (esp. for disks) so that maximal compression is achieved for
- the unused blocks of the image that you are about to create.
+ the unused blocks of the image that you are about to create::
dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
-b) Make a filesystem on it. Say ext2fs for this example.
+b) Make a filesystem on it. Say ext2fs for this example::
mke2fs -vm0 /dev/ram0 2048
@@ -133,11 +134,11 @@ c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
d) Compress the contents of the RAM disk. The level of compression
will be approximately 50% of the space used by the files. Unused
- space on the RAM disk will compress to almost nothing.
+ space on the RAM disk will compress to almost nothing::
dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
-e) Put the kernel onto the floppy
+e) Put the kernel onto the floppy::
dd if=zImage of=/dev/fd0 bs=1k
@@ -146,13 +147,13 @@ f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
(possibly larger) kernel onto the same floppy later without overlapping
the RAM disk image. An offset of 400 kB for kernels about 350 kB in
size would be reasonable. Make sure offset+size of ram_image.gz is
- not larger than the total space on your floppy (usually 1440 kB).
+ not larger than the total space on your floppy (usually 1440 kB)::
dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
- have 2^15 + 2^14 + 400 = 49552.
+ have 2^15 + 2^14 + 400 = 49552::
rdev /dev/fd0 /dev/fd0
rdev -r /dev/fd0 49552
@@ -160,15 +161,17 @@ g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
That is it. You now have your boot/root compressed RAM disk floppy. Some
users may wish to combine steps (d) and (f) by using a pipe.
---------------------------------------------------------------------------
+
Paul Gortmaker 12/95
Changelog:
----------
-10-22-04 : Updated to reflect changes in command line options, remove
+10-22-04 :
+ Updated to reflect changes in command line options, remove
obsolete references, general cleanup.
James Nelson (james4765(a)gmail.com)
-12-95 : Original Document
+12-95 :
+ Original Document
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.rst
similarity index 76%
rename from Documentation/blockdev/zram.txt
rename to Documentation/blockdev/zram.rst
index 4df0ce271085..2111231c9c0f 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.rst
@@ -1,7 +1,9 @@
+========================================
zram: Compressed RAM based block devices
-----------------------------------------
+========================================
-* Introduction
+Introduction
+============
The zram module creates RAM based block devices named /dev/zram<id>
(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
@@ -12,9 +14,11 @@ use as swap disks, various caches under /var and maybe many more :)
Statistics for individual zram devices are exported through sysfs nodes at
/sys/block/zram<id>/
-* Usage
+Usage
+=====
There are several ways to configure and manage zram device(-s):
+
a) using zram and zram_control sysfs attributes
b) using zramctl utility, provided by util-linux (util-linux(a)vger.kernel.org).
@@ -22,7 +26,7 @@ In this document we will describe only 'manual' zram configuration steps,
IOW, zram and zram_control sysfs attributes.
In order to get a better idea about zramctl please consult util-linux
-documentation, zramctl man-page or `zramctl --help'. Please be informed
+documentation, zramctl man-page or `zramctl --help`. Please be informed
that zram maintainers do not develop/maintain util-linux or zramctl, should
you have any questions please contact util-linux(a)vger.kernel.org
@@ -30,19 +34,23 @@ Following shows a typical sequence of steps for using zram.
WARNING
=======
+
For the sake of simplicity we skip error checking parts in most of the
examples below. However, it is your sole responsibility to handle errors.
zram sysfs attributes always return negative values in case of errors.
The list of possible return codes:
--EBUSY -- an attempt to modify an attribute that cannot be changed once
-the device has been initialised. Please reset device first;
--ENOMEM -- zram was not able to allocate enough memory to fulfil your
-needs;
--EINVAL -- invalid input has been provided.
+
+======== =============================================================
+-EBUSY an attempt to modify an attribute that cannot be changed once
+ the device has been initialised. Please reset device first;
+-ENOMEM zram was not able to allocate enough memory to fulfil your
+ needs;
+-EINVAL invalid input has been provided.
+======== =============================================================
If you use 'echo', the returned value that is changed by 'echo' utility,
-and, in general case, something like:
+and, in general case, something like::
echo 3 > /sys/block/zram0/max_comp_streams
if [ $? -ne 0 ];
@@ -51,7 +59,11 @@ and, in general case, something like:
should suffice.
-1) Load Module:
+1) Load Module
+==============
+
+::
+
modprobe zram num_devices=4
This creates 4 devices: /dev/zram{0,1,2,3}
@@ -59,6 +71,8 @@ num_devices parameter is optional and tells zram how many devices should be
pre-created. Default: 1.
2) Set max number of compression streams
+========================================
+
Regardless the value passed to this attribute, ZRAM will always
allocate multiple compression streams - one per online CPUs - thus
allowing several concurrent compression operations. The number of
@@ -66,16 +80,20 @@ allocated compression streams goes down when some of the CPUs
become offline. There is no single-compression-stream mode anymore,
unless you are running a UP system or has only 1 CPU online.
-To find out how many streams are currently available:
+To find out how many streams are currently available::
+
cat /sys/block/zram0/max_comp_streams
3) Select compression algorithm
+===============================
+
Using comp_algorithm device attribute one can see available and
currently selected (shown in square brackets) compression algorithms,
change selected compression algorithm (once the device is initialised
there is no way to change compression algorithm).
-Examples:
+Examples::
+
#show supported compression algorithms
cat /sys/block/zram0/comp_algorithm
lzo [lz4]
@@ -83,20 +101,23 @@ Examples:
#select lzo compression algorithm
echo lzo > /sys/block/zram0/comp_algorithm
-For the time being, the `comp_algorithm' content does not necessarily
+For the time being, the `comp_algorithm` content does not necessarily
show every compression algorithm supported by the kernel. We keep this
list primarily to simplify device configuration and one can configure
a new device with a compression algorithm that is not listed in
-`comp_algorithm'. The thing is that, internally, ZRAM uses Crypto API
+`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
and, if some of the algorithms were built as modules, it's impossible
to list all of them using, for instance, /proc/crypto or any other
method. This, however, has an advantage of permitting the usage of
custom crypto compression modules (implementing S/W or H/W compression).
4) Set Disksize
+===============
+
Set disk size by writing the value to sysfs node 'disksize'.
The value can be either in bytes or you can use mem suffixes.
-Examples:
+Examples::
+
# Initialize /dev/zram0 with 50MB disksize
echo $((50*1024*1024)) > /sys/block/zram0/disksize
@@ -111,10 +132,13 @@ since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful.
5) Set memory limit: Optional
+=============================
+
Set memory limit by writing the value to sysfs node 'mem_limit'.
The value can be either in bytes or you can use mem suffixes.
In addition, you could change the value in runtime.
-Examples:
+Examples::
+
# limit /dev/zram0 with 50MB memory
echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
@@ -126,7 +150,11 @@ Examples:
# To disable memory limit
echo 0 > /sys/block/zram0/mem_limit
-6) Activate:
+6) Activate
+===========
+
+::
+
mkswap /dev/zram0
swapon /dev/zram0
@@ -134,6 +162,7 @@ Examples:
mount /dev/zram1 /tmp
7) Add/remove zram devices
+==========================
zram provides a control interface, which enables dynamic (on-demand) device
addition and removal.
@@ -142,37 +171,44 @@ In order to add a new /dev/zramX device, perform read operation on hot_add
attribute. This will return either new device's device id (meaning that you
can use /dev/zram<id>) or error code.
-Example:
+Example::
+
cat /sys/class/zram-control/hot_add
1
To remove the existing /dev/zramX device (where X is a device id)
-execute
+execute::
+
echo X > /sys/class/zram-control/hot_remove
-8) Stats:
+8) Stats
+========
+
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
A brief description of exported device attributes. For more details please
read Documentation/ABI/testing/sysfs-block-zram.
+====================== ====== ===============================================
Name access description
----- ------ -----------
+====================== ====== ===============================================
disksize RW show and set the device's disk size
initstate RO shows the initialization state of the device
reset WO trigger device reset
-mem_used_max WO reset the `mem_used_max' counter (see later)
-mem_limit WO specifies the maximum amount of memory ZRAM can use
- to store the compressed data
-writeback_limit WO specifies the maximum amount of write IO zram can
- write out to backing device as 4KB unit
+mem_used_max WO reset the `mem_used_max` counter (see later)
+mem_limit WO specifies the maximum amount of memory ZRAM can
+ use to store the compressed data
+writeback_limit WO specifies the maximum amount of write IO zram
+ can write out to backing device as 4KB unit
writeback_limit_enable RW show and set writeback_limit feature
-max_comp_streams RW the number of possible concurrent compress operations
+max_comp_streams RW the number of possible concurrent compress
+ operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
idle WO mark allocated slot as idle
+====================== ====== ===============================================
User space is advised to use the following files to read the device statistics.
@@ -188,23 +224,31 @@ The stat file represents device's I/O statistics not accounted by block
layer and, thus, not available in zram<id>/stat file. It consists of a
single line of text and contains the following stats separated by
whitespace:
- failed_reads the number of failed reads
- failed_writes the number of failed writes
- invalid_io the number of non-page-size-aligned I/O requests
+
+ ============= =============================================================
+ failed_reads The number of failed reads
+ failed_writes The number of failed writes
+ invalid_io The number of non-page-size-aligned I/O requests
notify_free Depending on device usage scenario it may account
+
a) the number of pages freed because of swap slot free
- notifications or b) the number of pages freed because of
- REQ_OP_DISCARD requests sent by bio. The former ones are
- sent to a swap block device when a swap slot is freed,
- which implies that this disk is being used as a swap disk.
+ notifications
+ b) the number of pages freed because of
+ REQ_OP_DISCARD requests sent by bio. The former ones are
+ sent to a swap block device when a swap slot is freed,
+ which implies that this disk is being used as a swap disk.
+
The latter ones are sent by filesystem mounted with
discard option, whenever some data blocks are getting
discarded.
+ ============= =============================================================
File /sys/block/zram<id>/mm_stat
The stat file represents device's mm statistics. It consists of a single
line of text and contains the following stats separated by whitespace:
+
+ ================ =============================================================
orig_data_size uncompressed size of data stored in this disk.
This excludes same-element-filled pages (same_pages) since
no memory is allocated for them.
@@ -223,58 +267,71 @@ line of text and contains the following stats separated by whitespace:
No memory is allocated for such pages.
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
+ ================ =============================================================
File /sys/block/zram<id>/bd_stat
The stat file represents device's backing device statistics. It consists of
a single line of text and contains the following stats separated by whitespace:
+
+ ============== =============================================================
bd_count size of data written in backing device.
Unit: 4K bytes
bd_reads the number of reads from backing device
Unit: 4K bytes
bd_writes the number of writes to backing device
Unit: 4K bytes
+ ============== =============================================================
+
+9) Deactivate
+=============
+
+::
-9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
-10) Reset:
- Write any positive value to 'reset' sysfs node
- echo 1 > /sys/block/zram0/reset
- echo 1 > /sys/block/zram1/reset
+10) Reset
+=========
+
+ Write any positive value to 'reset' sysfs node::
+
+ echo 1 > /sys/block/zram0/reset
+ echo 1 > /sys/block/zram1/reset
This frees all the memory allocated for the given device and
resets the disksize to zero. You must set the disksize again
before reusing the device.
-* Optional Feature
+Optional Feature
+================
-= writeback
+writeback
+---------
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
-To use the feature, admin should set up backing device via
+To use the feature, admin should set up backing device via::
- "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+ echo /dev/sda5 > /sys/block/zramX/backing_dev
before disksize setting. It supports only partition at this moment.
-If admin want to use incompressible page writeback, they could do via
+If admin want to use incompressible page writeback, they could do via::
- "echo huge > /sys/block/zramX/write"
+ echo huge > /sys/block/zramX/write
To use idle page writeback, first, user need to declare zram pages
-as idle.
+as idle::
- "echo all > /sys/block/zramX/idle"
+ echo all > /sys/block/zramX/idle
From now on, any pages on zram are idle pages. The idle mark
will be removed until someone request access of the block.
IOW, unless there is access request, those pages are still idle pages.
-Admin can request writeback of those idle pages at right timing via
+Admin can request writeback of those idle pages at right timing via::
- "echo idle > /sys/block/zramX/writeback"
+ echo idle > /sys/block/zramX/writeback
With the command, zram writeback idle pages from memory to the storage.
@@ -285,7 +342,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
any writeback. IOW, if admin want to apply writeback budget, he should
-enable writeback_limit_enable via
+enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
@@ -296,7 +353,7 @@ until admin set the budget via /sys/block/zramX/writeback_limit.
assigned via /sys/block/zramX/writeback_limit is meaninless.)
If admin want to limit writeback as per-day 400M, he could do it
-like below.
+like below::
$ MB_SHIFT=20
$ 4K_SHIFT=12
@@ -305,16 +362,16 @@ like below.
$ echo 1 > /sys/block/zram0/writeback_limit_enable
If admin want to allow further write again once the bugdet is exausted,
-he could do it like below
+he could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
-If admin want to see remaining writeback budget since he set,
+If admin want to see remaining writeback budget since he set::
$ cat /sys/block/zramX/writeback_limit
-If admin want to disable writeback limit, he could do
+If admin want to disable writeback limit, he could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
@@ -326,25 +383,35 @@ budget in next setting is user's job.
If admin want to measure writeback count in a certain period, he could
know it via /sys/block/zram0/bd_stat's 3rd column.
-= memory tracking
+memory tracking
+===============
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
zram block. It could be useful to catch cold or incompressible
pages of the process with*pagemap.
+
If you enable the feature, you could see block state via
-/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
300 75.033841 .wh.
301 63.806904 s...
302 63.806919 ..hi
-First column is zram's block index.
-Second column is access time since the system was booted
-Third column is state of the block.
-(s: same page
-w: written page to backing store
-h: huge page
-i: idle page)
+First column
+ zram's block index.
+Second column
+ access time since the system was booted
+Third column
+ state of the block:
+
+ s:
+ same page
+ w:
+ written page to backing store
+ h:
+ huge page
+ i:
+ idle page
First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
diff --git a/MAINTAINERS b/MAINTAINERS
index 85648a944446..808f65e06ad8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11024,7 +11024,7 @@ M: Josef Bacik <josef(a)toxicpanda.com>
S: Maintained
L: linux-block(a)vger.kernel.org
L: nbd(a)other.debian.org
-F: Documentation/blockdev/nbd.txt
+F: Documentation/blockdev/nbd.rst
F: drivers/block/nbd.c
F: include/trace/events/nbd.h
F: include/uapi/linux/nbd.h
@@ -12028,7 +12028,7 @@ PARIDE DRIVERS FOR PARALLEL PORT IDE DEVICES
M: Tim Waugh <tim(a)cyberelk.net>
L: linux-parport(a)lists.infradead.org (subscribers-only)
S: Maintained
-F: Documentation/blockdev/paride.txt
+F: Documentation/blockdev/paride.rst
F: drivers/block/paride/
PARISC ARCHITECTURE
@@ -13310,7 +13310,7 @@ F: drivers/net/wireless/ralink/rt2x00/
RAMDISK RAM BLOCK DEVICE DRIVER
M: Jens Axboe <axboe(a)kernel.dk>
S: Maintained
-F: Documentation/blockdev/ramdisk.txt
+F: Documentation/blockdev/ramdisk.rst
F: drivers/block/brd.c
RANCHU VIRTUAL BOARD FOR MIPS
@@ -17672,7 +17672,7 @@ R: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
L: linux-kernel(a)vger.kernel.org
S: Maintained
F: drivers/block/zram/
-F: Documentation/blockdev/zram.txt
+F: Documentation/blockdev/zram.rst
ZS DECSTATION Z85C30 SERIAL DRIVER
M: "Maciej W. Rozycki" <macro(a)linux-mips.org>
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 96ec7e0fc1ea..c43690b973d8 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -31,7 +31,7 @@ config BLK_DEV_FD
If you want to use the floppy disk drive(s) of your PC under Linux,
say Y. Information about this driver, especially important for IBM
Thinkpad users, is contained in
- <file:Documentation/blockdev/floppy.txt>.
+ <file:Documentation/blockdev/floppy.rst>.
That file also contains the location of the Floppy driver FAQ as
well as location of the fdutils package used to configure additional
parameters of the driver at run time.
@@ -96,7 +96,7 @@ config PARIDE
your computer's parallel port. Most of them are actually IDE devices
using a parallel port IDE adapter. This option enables the PARIDE
subsystem which contains drivers for many of these external drives.
- Read <file:Documentation/blockdev/paride.txt> for more information.
+ Read <file:Documentation/blockdev/paride.rst> for more information.
If you have said Y to the "Parallel-port support" configuration
option, you may share a single port between your printer and other
@@ -261,7 +261,7 @@ config BLK_DEV_NBD
userland (making server and client physically the same computer,
communicating using the loopback network device).
- Read <file:Documentation/blockdev/nbd.txt> for more information,
+ Read <file:Documentation/blockdev/nbd.rst> for more information,
especially about where to find the server code, which runs in user
space and does not need special kernel support.
@@ -303,7 +303,7 @@ config BLK_DEV_RAM
during the initial install of Linux.
Note that the kernel command line option "ramdisk=XX" is now obsolete.
- For details, read <file:Documentation/blockdev/ramdisk.txt>.
+ For details, read <file:Documentation/blockdev/ramdisk.rst>.
To compile this driver as a module, choose M here: the
module will be called brd. An alias "rd" has been defined
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index b933a7eea52b..5c99e52f9dc1 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4424,7 +4424,7 @@ static int __init floppy_setup(char *str)
pr_cont("\n");
} else
DPRINT("botched floppy option\n");
- DPRINT("Read Documentation/blockdev/floppy.txt\n");
+ DPRINT("Read Documentation/blockdev/floppy.rst\n");
return 0;
}
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 1ffc64770643..e06b99d54816 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -12,7 +12,7 @@ config ZRAM
It has several use cases, for example: /tmp storage, use as swap
disks and maybe many more.
- See Documentation/blockdev/zram.txt for more information.
+ See Documentation/blockdev/zram.rst for more information.
config ZRAM_WRITEBACK
bool "Write back incompressible or idle page to backing device"
@@ -26,7 +26,7 @@ config ZRAM_WRITEBACK
With /sys/block/zramX/{idle,writeback}, application could ask
idle page's writeback to the backing device to save in memory.
- See Documentation/blockdev/zram.txt for more information.
+ See Documentation/blockdev/zram.rst for more information.
config ZRAM_MEMORY_TRACKING
bool "Track zRam block status"
@@ -36,4 +36,4 @@ config ZRAM_MEMORY_TRACKING
of zRAM. Admin could see the information via
/sys/kernel/debug/zram/zramX/block_state.
- See Documentation/blockdev/zram.txt for more information.
+ See Documentation/blockdev/zram.rst for more information.
diff --git a/tools/testing/selftests/zram/README b/tools/testing/selftests/zram/README
index 7972cc512408..5fa378391d3b 100644
--- a/tools/testing/selftests/zram/README
+++ b/tools/testing/selftests/zram/README
@@ -37,4 +37,4 @@ Commands required for testing:
- mkfs/ mkfs.ext4
For more information please refer:
-kernel-source-tree/Documentation/blockdev/zram.txt
+kernel-source-tree/Documentation/blockdev/zram.rst
--
2.21.0
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
v1:
- kbuild test robot <lkp(a)intel.com>:
- add missing entry for pidfd_open to arch/arm/tools/syscall.tbl
- Oleg Nesterov <oleg(a)redhat.com>:
- use simpler thread-group leader check
v2:
- Oleg Nesterov <oleg(a)redhat.com>:
- avoid using additional variable
- remove unneeded comment
- Arnd Bergmann <arnd(a)arndb.de>:
- switch from 428 to 434 since the new mount api has taken it
- bump syscall numbers in arch/arm64/include/asm/unistd.h
- Joel Fernandes (Google) <joel(a)joelfernandes.org>:
- switch from ESRCH to EINVAL when the passed-in pid does not refer to a
thread-group leader
- Christian Brauner <christian(a)brauner.io>:
- rebase on v5.2-rc1
- adapt syscall number to account for new mount api syscalls
v3:
- Arnd Bergmann <arnd(a)arndb.de>:
- add missing syscall entries for mips-o32 and mips-n64
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 43 +++++++++++++++++++++
23 files changed, 68 insertions(+), 3 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..1db9bbcfb84e 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+544 common pidfd_open sys_pidfd_open
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..81e6e1817c45 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 70e6882853c0..e8f7d95a1481 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -44,7 +44,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 434
+#define __NR_compat_syscalls 435
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..7a3158ccd68e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_pidfd_open 434
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..ecc44926737b 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..9a3eb2558568 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..ad706f83c755 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..97035e19ad03 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+434 n32 pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..d7292722d3b0 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
+434 n64 pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..dba084c92f14 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
431 o32 fsconfig sys_fsconfig
432 o32 fsmount sys_fsmount
433 o32 fspick sys_fspick
+434 o32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..5022b9e179c2 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..f2c3bda2d39f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..6ebacfeaf853 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+434 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..834c9c7d79fa 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..c58e71f21129 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..43e4429a5272 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..1bee0a77fdd3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+434 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..782b81945ccc 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..e5684a4512c0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_pidfd_open 434
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..c3df226f47a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1724,7 +1724,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 89548d35eefb..8fc9d94f6ac1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -37,6 +37,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -450,6 +451,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ ret = 0;
+ rcu_read_lock();
+ if (!pid_task(p, PIDTYPE_TGID))
+ ret = -EINVAL;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
This reverts commit 2b845d4b4acd9422bbb668989db8dc36dfc8f438.
That commit introduces build issues for programs compiled in Thumb mode.
Rather than try to be clever and emit a valid trap instruction on arm32,
which requires special care about big/little endian handling on that
architecture, just emit plain data. Data in the instruction stream is
technically expected on arm32: this is how literal pools are
implemented. Reverting to the prior behavior does exactly that.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
CC: Peter Zijlstra <peterz(a)infradead.org>
CC: Thomas Gleixner <tglx(a)linutronix.de>
CC: Joel Fernandes <joelaf(a)google.com>
CC: Catalin Marinas <catalin.marinas(a)arm.com>
CC: Dave Watson <davejwatson(a)fb.com>
CC: Will Deacon <will.deacon(a)arm.com>
CC: Shuah Khan <shuah(a)kernel.org>
CC: Andi Kleen <andi(a)firstfloor.org>
CC: linux-kselftest(a)vger.kernel.org
CC: "H . Peter Anvin" <hpa(a)zytor.com>
CC: Chris Lameter <cl(a)linux.com>
CC: Russell King <linux(a)arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages(a)gmail.com>
CC: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com>
CC: Paul Turner <pjt(a)google.com>
CC: Boqun Feng <boqun.feng(a)gmail.com>
CC: Josh Triplett <josh(a)joshtriplett.org>
CC: Steven Rostedt <rostedt(a)goodmis.org>
CC: Ben Maurer <bmaurer(a)fb.com>
CC: linux-api(a)vger.kernel.org
CC: Andy Lutomirski <luto(a)amacapital.net>
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Linus Torvalds <torvalds(a)linux-foundation.org>
CC: Carlos O'Donell <carlos(a)redhat.com>
CC: Florian Weimer <fweimer(a)redhat.com>
---
tools/testing/selftests/rseq/rseq-arm.h | 52 ++-------------------------------
1 file changed, 2 insertions(+), 50 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5f262c54364f 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -5,54 +5,7 @@
* (C) Copyright 2016-2018 - Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
*/
-/*
- * RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
- * value 0x5de3. This traps if user-space reaches this instruction by mistake,
- * and the uncommon operand ensures the kernel does not move the instruction
- * pointer to attacker-controlled code on rseq abort.
- *
- * The instruction pattern in the A32 instruction set is:
- *
- * e7f5def3 udf #24035 ; 0x5de3
- *
- * This translates to the following instruction pattern in the T16 instruction
- * set:
- *
- * little endian:
- * def3 udf #243 ; 0xf3
- * e7f5 b.n <7f5>
- *
- * pre-ARMv6 big endian code:
- * e7f5 b.n <7f5>
- * def3 udf #243 ; 0xf3
- *
- * ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
- */
-
-#define RSEQ_SIG_CODE 0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA \
- ({ \
- int sig; \
- asm volatile ("b 2f\n\t" \
- "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
- "2:\n\t" \
- "ldr %[sig], 1b\n\t" \
- : [sig] "=r" (sig)); \
- sig; \
- })
-
-#define RSEQ_SIG RSEQ_SIG_DATA
-
-#endif
+#define RSEQ_SIG 0x53053053
#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
#define rseq_smp_rmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +78,7 @@ do { \
__rseq_str(table_label) ":\n\t" \
".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
- ".arm\n\t" \
- ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+ ".word " __rseq_str(RSEQ_SIG) "\n\t" \
__rseq_str(label) ":\n\t" \
teardown \
"b %l[" __rseq_str(abort_label) "]\n\t"
--
2.11.0
selftests: bpf test_libbpf.sh failed running Linux -next kernel
20190618 and 20190619.
Here is the log from x86_64,
# selftests bpf test_libbpf.sh
bpf: test_libbpf.sh_ #
# [0] libbpf BTF is required, but is missing or corrupted.
libbpf: BTF_is #
# test_libbpf failed at file test_l4lb.o
failed: at_file #
# selftests test_libbpf [FAILED]
test_libbpf: [FAILED]_ #
[FAIL] 29 selftests bpf test_libbpf.sh
selftests: bpf_test_libbpf.sh [FAIL]
Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20190619/testru…
Test results comparison,
https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/bpf_test_l…
Good linux -next tag: next-20190617
Bad linux -next tag: next-20190618
git branch master
git commit 1c6b40509daf5190b1fd2c758649f7df1da4827b
git repo
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Best regards
Naresh Kamboju
---------- Forwarded message ---------
From: Jeffrin Thalakkottoor <jeffrin(a)rajagiritech.edu.in>
Date: Fri, Jun 14, 2019 at 12:06 AM
Subject: BUG: KASAN: global-out-of-bounds in ata_exec_internal_sg+0x50f/0xc70
To: <axboe(a)kernel.dk>, <jejb(a)linux.ibm.com>, <martin.petersen(a)oracle.com>
Cc: lkml <linux-kernel(a)vger.kernel.org>, <linux-ide(a)vger.kernel.org>,
<linux-scsi(a)vger.kernel.org>
hello ,
[ 55.169278] ==================================================================
[ 55.169899] BUG: KASAN: global-out-of-bounds in
ata_exec_internal_sg+0x50f/0xc70 [libata]
[ 55.170039] Read of size 16 at addr ffffffffc0723500 by task scsi_eh_1/149
[ 55.186354] The buggy address belongs to the variable:
[ 55.186972] cdb.48295+0x0/0xfffffffffffeab00 [libata]
[ 55.187171] Memory state around the buggy address:
[ 55.187290] ffffffffc0723400: 00 00 fa fa fa fa fa fa 00 00 fa fa
fa fa fa fa
[ 55.187417] ffffffffc0723480: 00 00 00 00 00 00 00 00 05 fa fa fa
fa fa fa fa
[ 55.187544] >ffffffffc0723500: 00 04 fa fa fa fa fa fa 00 00 fa fa
fa fa fa fa
[ 55.187686] ^
[ 55.187810] ffffffffc0723580: 00 00 05 fa fa fa fa fa 00 00 00 00
00 00 00 00
[ 55.187940] ffffffffc0723600: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00
[ 55.188060] ==================================================================
output of dmesg with the filename kasan.txt attached
--
software engineer
rajagiri school of engineering and technology
=== Overview
arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:
1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
pointers")
This patchset extends tagged pointer support to syscall arguments.
As per the proposed ABI change [3], tagged pointers are only allowed to be
passed to syscalls when they point to memory ranges obtained by anonymous
mmap() or sbrk() (see the patchset [3] for more details).
For non-memory syscalls this is done by untaging user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel and stays tagged when the kernel
dereferences the pointer when perfoming user memory accesses.
The mmap and mremap (only new_addr) syscalls do not currently accept
tagged addresses. Architectures may interpret the tag as a background
colour for the corresponding vma.
Other memory syscalls (mprotect, etc.) don't do user memory accesses but
rather deal with memory ranges, and untagged pointers are better suited to
describe memory ranges internally. Thus for memory syscalls we untag
pointers completely when they enter the kernel.
=== Other approaches
One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.
An alternative approach to untagging pointers in memory syscalls prologues
is to inspead allow tagged pointers to be passed to find_vma() (and other
vma related functions) and untag them there. Unfortunately, a lot of
find_vma() callers then compare or subtract the returned vma start and end
fields against the pointer that was being searched. Thus this approach
would still require changing all find_vma() callers.
=== Testing
The following testing approaches has been taken to find potential issues
with user pointer untagging:
1. Static testing (with sparse [2] and separately with a custom static
analyzer based on Clang) to track casts of __user pointers to integer
types to find places where untagging needs to be done.
2. Static testing with grep to find parts of the kernel that call
find_vma() (and other similar functions) or directly compare against
vm_start/vm_end fields of vma.
3. Static testing with grep to find parts of the kernel that compare
user pointers with TASK_SIZE or other similar consts and macros.
4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added
to the patchset.
=== Notes
This patchset is meant to be merged together with "arm64 relaxed ABI" [3].
This patchset is a prerequisite for ARM's memory tagging hardware feature
support [4].
This patchset has been merged into the Pixel 2 & 3 kernel trees and is
now being used to enable testing of Pixel phones with HWASan.
Thanks!
[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060…
[3] https://lkml.org/lkml/2019/3/18/819
[4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architectur…
=== History
Changes in v17:
- The "uaccess: add noop untagged_addr definition" patch is dropped, as it
was merged into upstream named as "uaccess: add noop untagged_addr
definition".
- Merged "mm, arm64: untag user pointers in do_pages_move" into
"mm, arm64: untag user pointers passed to memory syscalls".
- Added "arm64: Introduce prctl() options to control the tagged user
addresses ABI" patch from Catalin.
- Add tags_lib.so to tools/testing/selftests/arm64/.gitignore.
- Added a comment clarifying untagged in mremap.
- Moved untagging back into mlx4_get_umem_mr() for the IB patch.
Changes in v16:
- Moved untagging for memory syscalls from arm64 wrappers back to generic
code.
- Dropped untagging for the following memory syscalls: brk, mmap, munmap;
mremap (only dropped for new_address); mmap_pgoff (not used on arm64);
remap_file_pages (deprecated); shmat, shmdt (work on shared memory).
- Changed kselftest to LD_PRELOAD a shared library that overrides malloc
to return tagged pointers.
- Rebased onto 5.2-rc3.
Changes in v15:
- Removed unnecessary untagging from radeon_ttm_tt_set_userptr().
- Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr().
- Moved untagging to validate_range() in userfaultfd code.
- Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr().
- Rebased onto 5.1.
Changes in v14:
- Moved untagging for most memory syscalls to an arm64 specific
implementation, instead of doing that in the common code.
- Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since
the provided user pointers don't come from an anonymous map and thus are
not covered by this ABI relaxation.
- Dropped "kernel, arm64: untag user pointers in prctl_set_mm*".
- Moved untagging from __check_mem_type() to tee_shm_register().
- Updated untagging for the amdgpu and radeon drivers to cover the MMU
notifier, as suggested by Felix.
- Since this ABI relaxation doesn't actually allow tagged instruction
pointers, dropped the following patches:
- Dropped "tracing, arm64: untag user pointers in seq_print_user_ip".
- Dropped "uprobes, arm64: untag user pointers in find_active_uprobe".
- Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset".
- Rebased onto 5.1-rc7 (37624b58).
Changes in v13:
- Simplified untagging in tcp_zerocopy_receive().
- Looked at find_vma() callers in drivers/, which allowed to identify a
few other places where untagging is needed.
- Added patch "mm, arm64: untag user pointers in get_vaddr_frames".
- Added patch "drm/amdgpu, arm64: untag user pointers in
amdgpu_ttm_tt_get_user_pages".
- Added patch "drm/radeon, arm64: untag user pointers in
radeon_ttm_tt_pin_userptr".
- Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr".
- Added patch "media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get".
- Added patch "tee/optee, arm64: untag user pointers in check_mem_type".
- Added patch "vfio/type1, arm64: untag user pointers".
Changes in v12:
- Changed untagging in tcp_zerocopy_receive() to also untag zc->address.
- Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups
and validity checks, but leave them as is for actual user space accesses.
- Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3].
- Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3]
handles that.
Changes in v11:
- Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch.
- Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset"
patch.
- Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to
correctly perform subtration with a tagged addr.
- Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and
SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey().
- Moved untagged_addr() definition for other arches from
include/linux/memory.h to include/linux/mm.h.
- Changed untagging in strn*_user() to perform userspace accesses through
tagged pointers.
- Updated the documentation to mention that passing tagged pointers to
memory syscalls is allowed.
- Updated the test to use malloc'ed memory instead of stack memory.
Changes in v10:
- Added "mm, arm64: untag user pointers passed to memory syscalls" back.
- New patch "fs, arm64: untag user pointers in fs/userfaultfd.c".
- New patch "net, arm64: untag user pointers in tcp_zerocopy_receive".
- New patch "kernel, arm64: untag user pointers in prctl_set_mm*".
- New patch "tracing, arm64: untag user pointers in seq_print_user_ip".
Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.
Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
support tagged user pointers.
Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.
Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).
Changes in v5:
- Added 3 new patches that add untagging to places found with static
analysis.
- Rebased onto 44c929e1 (4.18-rc8).
Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).
Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.
Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.
Changes in v1:
- Rebased onto 4.17-rc1.
Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Andrey Konovalov (14):
arm64: untag user pointers in access_ok and __uaccess_mask_ptr
lib, arm64: untag user pointers in strn*_user
mm, arm64: untag user pointers passed to memory syscalls
mm, arm64: untag user pointers in mm/gup.c
mm, arm64: untag user pointers in get_vaddr_frames
fs, arm64: untag user pointers in copy_mount_options
userfaultfd, arm64: untag user pointers
drm/amdgpu, arm64: untag user pointers
drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl
IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr
media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get
tee/shm, arm64: untag user pointers in tee_shm_register
vfio/type1, arm64: untag user pointers in vaddr_get_pfn
selftests, arm64: add a selftest for passing tagged pointers to kernel
Catalin Marinas (1):
arm64: Introduce prctl() options to control the tagged user addresses
ABI
arch/arm64/include/asm/processor.h | 6 ++
arch/arm64/include/asm/thread_info.h | 1 +
arch/arm64/include/asm/uaccess.h | 11 ++-
arch/arm64/kernel/process.c | 67 +++++++++++++++++++
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +
drivers/gpu/drm/radeon/radeon_gem.c | 2 +
drivers/infiniband/hw/mlx4/mr.c | 7 +-
drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +--
drivers/tee/tee_shm.c | 1 +
drivers/vfio/vfio_iommu_type1.c | 2 +
fs/namespace.c | 2 +-
fs/userfaultfd.c | 22 +++---
include/uapi/linux/prctl.h | 5 ++
kernel/sys.c | 16 +++++
lib/strncpy_from_user.c | 3 +-
lib/strnlen_user.c | 3 +-
mm/frame_vector.c | 2 +
mm/gup.c | 4 ++
mm/madvise.c | 2 +
mm/mempolicy.c | 3 +
mm/migrate.c | 2 +-
mm/mincore.c | 2 +
mm/mlock.c | 4 ++
mm/mprotect.c | 2 +
mm/mremap.c | 7 ++
mm/msync.c | 2 +
tools/testing/selftests/arm64/.gitignore | 2 +
tools/testing/selftests/arm64/Makefile | 22 ++++++
.../testing/selftests/arm64/run_tags_test.sh | 12 ++++
tools/testing/selftests/arm64/tags_lib.c | 62 +++++++++++++++++
tools/testing/selftests/arm64/tags_test.c | 18 +++++
32 files changed, 282 insertions(+), 25 deletions(-)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_lib.c
create mode 100644 tools/testing/selftests/arm64/tags_test.c
--
2.22.0.rc2.383.gf4fbbf30c2-goog
[ Upstream commit 00e38a5d753d7788852f81703db804a60a84c26e ]
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:
$sudo ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..d78f1c5366d3 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
[ Upstream commit f6131f28057d4fd8922599339e701a2504e0f23d ]
The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..c19a97dd02d4 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
[ Upstream commit 00e38a5d753d7788852f81703db804a60a84c26e ]
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:
$sudo ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..d78f1c5366d3 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
[ Upstream commit f6131f28057d4fd8922599339e701a2504e0f23d ]
The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..c19a97dd02d4 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
When ordinarily running the tests, upon `make install', the
following error is encountered:
ImportError: No module named tpm2_tests
because the Python files are not installed at the moment.
Fix this by adding both Python modules as accompanying
TEST_FILES in the Makefile.
Signed-off-by: Daniel Díaz <daniel.diaz(a)linaro.org>
---
tools/testing/selftests/tpm2/Makefile | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/tpm2/Makefile b/tools/testing/selftests/tpm2/Makefile
index 9dd848427a7b..bf401f725eef 100644
--- a/tools/testing/selftests/tpm2/Makefile
+++ b/tools/testing/selftests/tpm2/Makefile
@@ -2,3 +2,4 @@
include ../lib.mk
TEST_PROGS := test_smoke.sh test_space.sh
+TEST_FILES := tpm2.py tpm2_tests.py
--
2.20.1
[resubmitted with the missing patch]
Hi,
here are the rest and the main part of patches to add the support for
loading the compressed firmware files. The patch was slightly
refactored for more easily enhancing for other compression formats (if
anyone wants). Also the selftest patch is included. The
functionality doesn't change from the previous patchset.
thanks,
Takashi
===
Takashi Iwai (3):
firmware: Factor out the paged buffer handling code
firmware: Add support for loading compressed files
selftests: firmware: Add compressed firmware tests
drivers/base/firmware_loader/Kconfig | 18 ++
drivers/base/firmware_loader/fallback.c | 61 +------
drivers/base/firmware_loader/firmware.h | 12 +-
drivers/base/firmware_loader/main.c | 199 +++++++++++++++++++++-
tools/testing/selftests/firmware/fw_filesystem.sh | 73 ++++++--
tools/testing/selftests/firmware/fw_lib.sh | 7 +
tools/testing/selftests/firmware/fw_run_tests.sh | 1 +
7 files changed, 295 insertions(+), 76 deletions(-)
--
2.16.4
Rename the blockdev documentation files to ReST, add an
index for them and adjust in order to produce a nice html
output via the Sphinx build system.
The drbd sub-directory contains some graphs and data flows.
Add those too to the documentation.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung(a)kernel.org>
---
.../admin-guide/kernel-parameters.txt | 18 +-
...structure-v9.txt => data-structure-v9.rst} | 6 +-
Documentation/blockdev/drbd/figures.rst | 28 +++
.../blockdev/drbd/{README.txt => index.rst} | 15 +-
.../blockdev/{floppy.txt => floppy.rst} | 88 ++++----
Documentation/blockdev/index.rst | 16 ++
Documentation/blockdev/{nbd.txt => nbd.rst} | 2 +-
.../blockdev/{paride.txt => paride.rst} | 194 +++++++++--------
.../blockdev/{ramdisk.txt => ramdisk.rst} | 55 ++---
Documentation/blockdev/{zram.txt => zram.rst} | 195 ++++++++++++------
MAINTAINERS | 8 +-
drivers/block/Kconfig | 8 +-
drivers/block/floppy.c | 2 +-
drivers/block/zram/Kconfig | 6 +-
tools/testing/selftests/zram/README | 2 +-
15 files changed, 398 insertions(+), 245 deletions(-)
rename Documentation/blockdev/drbd/{data-structure-v9.txt => data-structure-v9.rst} (94%)
create mode 100644 Documentation/blockdev/drbd/figures.rst
rename Documentation/blockdev/drbd/{README.txt => index.rst} (55%)
rename Documentation/blockdev/{floppy.txt => floppy.rst} (81%)
create mode 100644 Documentation/blockdev/index.rst
rename Documentation/blockdev/{nbd.txt => nbd.rst} (96%)
rename Documentation/blockdev/{paride.txt => paride.rst} (81%)
rename Documentation/blockdev/{ramdisk.txt => ramdisk.rst} (84%)
rename Documentation/blockdev/{zram.txt => zram.rst} (76%)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 873062810484..20780fbc948d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1249,7 +1249,7 @@
See also Documentation/fault-injection/.
floppy= [HW]
- See Documentation/blockdev/floppy.txt.
+ See Documentation/blockdev/floppy.rst.
force_pal_cache_flush
[IA-64] Avoid check_sal_cache_flush which may hang on
@@ -2238,7 +2238,7 @@
memblock=debug [KNL] Enable memblock debug messages.
load_ramdisk= [RAM] List of ramdisks to load from floppy
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/blockdev/ramdisk.rst.
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -3283,7 +3283,7 @@
pcd. [PARIDE]
See header of drivers/block/paride/pcd.c.
- See also Documentation/blockdev/paride.txt.
+ See also Documentation/blockdev/paride.rst.
pci=option[,option...] [PCI] various PCI subsystem options.
@@ -3527,7 +3527,7 @@
needed on a platform with proper driver support.
pd. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
@@ -3542,10 +3542,10 @@
and performance comparison.
pf. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pg. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pirq= [SMP,APIC] Manual mp-table setup
See Documentation/x86/i386/IO-APIC.rst.
@@ -3657,7 +3657,7 @@
prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
before loading.
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/blockdev/ramdisk.rst.
psi= [KNL] Enable or disable pressure stall information
tracking.
@@ -3679,7 +3679,7 @@
pstore.backend= Specify the name of the pstore backend to use
pt. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/blockdev/paride.rst.
pti= [X86_64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
@@ -3708,7 +3708,7 @@
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/blockdev/ramdisk.rst.
random.trust_cpu={on,off}
[KNL] Enable or disable trusting the use of the
diff --git a/Documentation/blockdev/drbd/data-structure-v9.txt b/Documentation/blockdev/drbd/data-structure-v9.rst
similarity index 94%
rename from Documentation/blockdev/drbd/data-structure-v9.txt
rename to Documentation/blockdev/drbd/data-structure-v9.rst
index 1e52a0e32624..66036b901644 100644
--- a/Documentation/blockdev/drbd/data-structure-v9.txt
+++ b/Documentation/blockdev/drbd/data-structure-v9.rst
@@ -1,3 +1,7 @@
+================================
+kernel data structure for DRBD-9
+================================
+
This describes the in kernel data structure for DRBD-9. Starting with
Linux v3.14 we are reorganizing DRBD to use this data structure.
@@ -10,7 +14,7 @@ device is represented by a block device locally.
The DRBD objects are interconnected to form a matrix as depicted below; a
drbd_peer_device object sits at each intersection between a drbd_device and a
-drbd_connection:
+drbd_connection::
/--------------+---------------+.....+---------------\
| resource | device | | device |
diff --git a/Documentation/blockdev/drbd/figures.rst b/Documentation/blockdev/drbd/figures.rst
new file mode 100644
index 000000000000..3e3fd4b8a478
--- /dev/null
+++ b/Documentation/blockdev/drbd/figures.rst
@@ -0,0 +1,28 @@
+.. The here included files are intended to help understand the implementation
+
+Data flows that Relate some functions, and write packets
+========================================================
+
+.. kernel-figure:: DRBD-8.3-data-packets.svg
+ :alt: DRBD-8.3-data-packets.svg
+ :align: center
+
+.. kernel-figure:: DRBD-data-packets.svg
+ :alt: DRBD-data-packets.svg
+ :align: center
+
+
+Sub graphs of DRBD's state transitions
+======================================
+
+.. kernel-figure:: conn-states-8.dot
+ :alt: conn-states-8.dot
+ :align: center
+
+.. kernel-figure:: disk-states-8.dot
+ :alt: disk-states-8.dot
+ :align: center
+
+.. kernel-figure:: node-states-8.dot
+ :alt: node-states-8.dot
+ :align: center
diff --git a/Documentation/blockdev/drbd/README.txt b/Documentation/blockdev/drbd/index.rst
similarity index 55%
rename from Documentation/blockdev/drbd/README.txt
rename to Documentation/blockdev/drbd/index.rst
index 627b0a1bf35e..68ecd5c113e9 100644
--- a/Documentation/blockdev/drbd/README.txt
+++ b/Documentation/blockdev/drbd/index.rst
@@ -1,4 +1,9 @@
+==========================================
+Distributed Replicated Block Device - DRBD
+==========================================
+
Description
+===========
DRBD is a shared-nothing, synchronously replicated block device. It
is designed to serve as a building block for high availability
@@ -7,10 +12,8 @@ Description
Please visit http://www.drbd.org to find out more.
-The here included files are intended to help understand the implementation
+.. toctree::
+ :maxdepth: 1
-DRBD-8.3-data-packets.svg, DRBD-data-packets.svg
- relates some functions, and write packets.
-
-conn-states-8.dot, disk-states-8.dot, node-states-8.dot
- The sub graphs of DRBD's state transitions
+ data-structure-v9
+ figures
diff --git a/Documentation/blockdev/floppy.txt b/Documentation/blockdev/floppy.rst
similarity index 81%
rename from Documentation/blockdev/floppy.txt
rename to Documentation/blockdev/floppy.rst
index e2240f5ab64d..4a8f31cf4139 100644
--- a/Documentation/blockdev/floppy.txt
+++ b/Documentation/blockdev/floppy.rst
@@ -1,35 +1,37 @@
-This file describes the floppy driver.
+=============
+Floppy Driver
+=============
FAQ list:
=========
- A FAQ list may be found in the fdutils package (see below), and also
+A FAQ list may be found in the fdutils package (see below), and also
at <http://fdutils.linux.lu/faq.html>.
LILO configuration options (Thinkpad users, read this)
======================================================
- The floppy driver is configured using the 'floppy=' option in
+The floppy driver is configured using the 'floppy=' option in
lilo. This option can be typed at the boot prompt, or entered in the
lilo configuration file.
- Example: If your kernel is called linux-2.6.9, type the following line
-at the lilo boot prompt (if you have a thinkpad):
+Example: If your kernel is called linux-2.6.9, type the following line
+at the lilo boot prompt (if you have a thinkpad)::
linux-2.6.9 floppy=thinkpad
You may also enter the following line in /etc/lilo.conf, in the description
-of linux-2.6.9:
+of linux-2.6.9::
append = "floppy=thinkpad"
- Several floppy related options may be given, example:
+Several floppy related options may be given, example::
linux-2.6.9 floppy=daring floppy=two_fdc
append = "floppy=daring floppy=two_fdc"
- If you give options both in the lilo config file and on the boot
+If you give options both in the lilo config file and on the boot
prompt, the option strings of both places are concatenated, the boot
prompt options coming last. That's why there are also options to
restore the default behavior.
@@ -38,21 +40,23 @@ restore the default behavior.
Module configuration options
============================
- If you use the floppy driver as a module, use the following syntax:
-modprobe floppy floppy="<options>"
+If you use the floppy driver as a module, use the following syntax::
-Example:
- modprobe floppy floppy="omnibook messages"
+ modprobe floppy floppy="<options>"
- If you need certain options enabled every time you load the floppy driver,
-you can put:
+Example::
- options floppy floppy="omnibook messages"
+ modprobe floppy floppy="omnibook messages"
+
+If you need certain options enabled every time you load the floppy driver,
+you can put::
+
+ options floppy floppy="omnibook messages"
in a configuration file in /etc/modprobe.d/.
- The floppy driver related options are:
+The floppy driver related options are:
floppy=asus_pci
Sets the bit mask to allow only units 0 and 1. (default)
@@ -70,8 +74,7 @@ in a configuration file in /etc/modprobe.d/.
Tells the floppy driver that you have only one floppy controller.
(default)
- floppy=two_fdc
- floppy=<address>,two_fdc
+ floppy=two_fdc / floppy=<address>,two_fdc
Tells the floppy driver that you have two floppy controllers.
The second floppy controller is assumed to be at <address>.
This option is not needed if the second controller is at address
@@ -84,8 +87,7 @@ in a configuration file in /etc/modprobe.d/.
floppy=0,thinkpad
Tells the floppy driver that you don't have a Thinkpad.
- floppy=omnibook
- floppy=nodma
+ floppy=omnibook / floppy=nodma
Tells the floppy driver not to use Dma for data transfers.
This is needed on HP Omnibooks, which don't have a workable
DMA channel for the floppy driver. This option is also useful
@@ -144,14 +146,16 @@ in a configuration file in /etc/modprobe.d/.
described in the physical CMOS), or if your BIOS uses
non-standard CMOS types. The CMOS types are:
- 0 - Use the value of the physical CMOS
- 1 - 5 1/4 DD
- 2 - 5 1/4 HD
- 3 - 3 1/2 DD
- 4 - 3 1/2 HD
- 5 - 3 1/2 ED
- 6 - 3 1/2 ED
- 16 - unknown or not installed
+ == ==================================
+ 0 Use the value of the physical CMOS
+ 1 5 1/4 DD
+ 2 5 1/4 HD
+ 3 3 1/2 DD
+ 4 3 1/2 HD
+ 5 3 1/2 ED
+ 6 3 1/2 ED
+ 16 unknown or not installed
+ == ==================================
(Note: there are two valid types for ED drives. This is because 5 was
initially chosen to represent floppy *tapes*, and 6 for ED drives.
@@ -162,8 +166,7 @@ in a configuration file in /etc/modprobe.d/.
Print a warning message when an unexpected interrupt is received.
(default)
- floppy=no_unexpected_interrupts
- floppy=L40SX
+ floppy=no_unexpected_interrupts / floppy=L40SX
Don't print a message when an unexpected interrupt is received. This
is needed on IBM L40SX laptops in certain video modes. (There seems
to be an interaction between video and floppy. The unexpected
@@ -199,47 +202,54 @@ in a configuration file in /etc/modprobe.d/.
Sets the floppy DMA channel to <nr> instead of 2.
floppy=slow
- Use PS/2 stepping rate:
- " PS/2 floppies have much slower step rates than regular floppies.
+ Use PS/2 stepping rate::
+
+ PS/2 floppies have much slower step rates than regular floppies.
It's been recommended that take about 1/4 of the default speed
- in some more extreme cases."
+ in some more extreme cases.
Supporting utilities and additional documentation:
==================================================
- Additional parameters of the floppy driver can be configured at
+Additional parameters of the floppy driver can be configured at
runtime. Utilities which do this can be found in the fdutils package.
This package also contains a new version of mtools which allows to
access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
It also contains additional documentation about the floppy driver.
The latest version can be found at fdutils homepage:
+
http://fdutils.linux.lu
The fdutils releases can be found at:
+
http://fdutils.linux.lu/download.html
+
http://www.tux.org/pub/knaff/fdutils/
+
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
Reporting problems about the floppy driver
==========================================
- If you have a question or a bug report about the floppy driver, mail
+If you have a question or a bug report about the floppy driver, mail
me at Alain.Knaff(a)poboxes.com . If you post to Usenet, preferably use
comp.os.linux.hardware. As the volume in these groups is rather high,
be sure to include the word "floppy" (or "FLOPPY") in the subject
line. If the reported problem happens when mounting floppy disks, be
sure to mention also the type of the filesystem in the subject line.
- Be sure to read the FAQ before mailing/posting any bug reports!
+Be sure to read the FAQ before mailing/posting any bug reports!
- Alain
+Alain
Changelog
=========
-10-30-2004 : Cleanup, updating, add reference to module configuration.
+10-30-2004 :
+ Cleanup, updating, add reference to module configuration.
James Nelson <james4765(a)gmail.com>
-6-3-2000 : Original Document
+6-3-2000 :
+ Original Document
diff --git a/Documentation/blockdev/index.rst b/Documentation/blockdev/index.rst
new file mode 100644
index 000000000000..a9af6ed8b4aa
--- /dev/null
+++ b/Documentation/blockdev/index.rst
@@ -0,0 +1,16 @@
+:orphan:
+
+===========================
+The Linux RapidIO Subsystem
+===========================
+
+.. toctree::
+ :maxdepth: 1
+
+ floppy
+ nbd
+ paride
+ ramdisk
+ zram
+
+ drbd/index
diff --git a/Documentation/blockdev/nbd.txt b/Documentation/blockdev/nbd.rst
similarity index 96%
rename from Documentation/blockdev/nbd.txt
rename to Documentation/blockdev/nbd.rst
index db242ea2bce8..d78dfe559dcf 100644
--- a/Documentation/blockdev/nbd.txt
+++ b/Documentation/blockdev/nbd.rst
@@ -1,3 +1,4 @@
+==================================
Network Block Device (TCP version)
==================================
@@ -28,4 +29,3 @@ max_part
nbds_max
Number of block devices that should be initialized (default: 16).
-
diff --git a/Documentation/blockdev/paride.txt b/Documentation/blockdev/paride.rst
similarity index 81%
rename from Documentation/blockdev/paride.txt
rename to Documentation/blockdev/paride.rst
index ee6717e3771d..87b4278bf314 100644
--- a/Documentation/blockdev/paride.txt
+++ b/Documentation/blockdev/paride.rst
@@ -1,15 +1,17 @@
-
- Linux and parallel port IDE devices
+===================================
+Linux and parallel port IDE devices
+===================================
PARIDE v1.03 (c) 1997-8 Grant Guenther <grant(a)torque.net>
1. Introduction
+===============
Owing to the simplicity and near universality of the parallel port interface
to personal computers, many external devices such as portable hard-disk,
CD-ROM, LS-120 and tape drives use the parallel port to connect to their
host computer. While some devices (notably scanners) use ad-hoc methods
-to pass commands and data through the parallel port interface, most
+to pass commands and data through the parallel port interface, most
external devices are actually identical to an internal model, but with
a parallel-port adapter chip added in. Some of the original parallel port
adapters were little more than mechanisms for multiplexing a SCSI bus.
@@ -28,47 +30,50 @@ were to open up a parallel port CD-ROM drive, for instance, one would
find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
that interconnected a standard PC parallel port cable and a standard
IDE cable. It is usually possible to exchange the CD-ROM device with
-any other device using the IDE interface.
+any other device using the IDE interface.
The document describes the support in Linux for parallel port IDE
devices. It does not cover parallel port SCSI devices, "ditto" tape
-drives or scanners. Many different devices are supported by the
+drives or scanners. Many different devices are supported by the
parallel port IDE subsystem, including:
- MicroSolutions backpack CD-ROM
- MicroSolutions backpack PD/CD
- MicroSolutions backpack hard-drives
- MicroSolutions backpack 8000t tape drive
- SyQuest EZ-135, EZ-230 & SparQ drives
- Avatar Shark
- Imation Superdisk LS-120
- Maxell Superdisk LS-120
- FreeCom Power CD
- Hewlett-Packard 5GB and 8GB tape drives
- Hewlett-Packard 7100 and 7200 CD-RW drives
+ - MicroSolutions backpack CD-ROM
+ - MicroSolutions backpack PD/CD
+ - MicroSolutions backpack hard-drives
+ - MicroSolutions backpack 8000t tape drive
+ - SyQuest EZ-135, EZ-230 & SparQ drives
+ - Avatar Shark
+ - Imation Superdisk LS-120
+ - Maxell Superdisk LS-120
+ - FreeCom Power CD
+ - Hewlett-Packard 5GB and 8GB tape drives
+ - Hewlett-Packard 7100 and 7200 CD-RW drives
as well as most of the clone and no-name products on the market.
To support such a wide range of devices, PARIDE, the parallel port IDE
subsystem, is actually structured in three parts. There is a base
paride module which provides a registry and some common methods for
-accessing the parallel ports. The second component is a set of
-high-level drivers for each of the different types of supported devices:
+accessing the parallel ports. The second component is a set of
+high-level drivers for each of the different types of supported devices:
+ === =============
pd IDE disk
pcd ATAPI CD-ROM
pf ATAPI disk
pt ATAPI tape
pg ATAPI generic
+ === =============
(Currently, the pg driver is only used with CD-R drives).
The high-level drivers function according to the relevant standards.
The third component of PARIDE is a set of low-level protocol drivers
for each of the parallel port IDE adapter chips. Thanks to the interest
-and encouragement of Linux users from many parts of the world,
+and encouragement of Linux users from many parts of the world,
support is available for almost all known adapter protocols:
+ ==== ====================================== ====
aten ATEN EH-100 (HK)
bpck Microsolutions backpack (US)
comm DataStor (old-type) "commuter" adapter (TW)
@@ -83,9 +88,11 @@ support is available for almost all known adapter protocols:
ktti KT Technology PHd adapter (SG)
on20 OnSpec 90c20 (US)
on26 OnSpec 90c26 (US)
+ ==== ====================================== ====
2. Using the PARIDE subsystem
+=============================
While configuring the Linux kernel, you may choose either to build
the PARIDE drivers into your kernel, or to build them as modules.
@@ -94,10 +101,10 @@ In either case, you will need to select "Parallel port IDE device support"
as well as at least one of the high-level drivers and at least one
of the parallel port communication protocols. If you do not know
what kind of parallel port adapter is used in your drive, you could
-begin by checking the file names and any text files on your DOS
+begin by checking the file names and any text files on your DOS
installation floppy. Alternatively, you can look at the markings on
the adapter chip itself. That's usually sufficient to identify the
-correct device.
+correct device.
You can actually select all the protocol modules, and allow the PARIDE
subsystem to try them all for you.
@@ -105,8 +112,9 @@ subsystem to try them all for you.
For the "brand-name" products listed above, here are the protocol
and high-level drivers that you would use:
+ ================ ============ ====== ========
Manufacturer Model Driver Protocol
-
+ ================ ============ ====== ========
MicroSolutions CD-ROM pcd bpck
MicroSolutions PD drive pf bpck
MicroSolutions hard-drive pd bpck
@@ -119,8 +127,10 @@ and high-level drivers that you would use:
Hewlett-Packard 5GB Tape pt epat
Hewlett-Packard 7200e (CD) pcd epat
Hewlett-Packard 7200e (CD-R) pg epat
+ ================ ============ ====== ========
2.1 Configuring built-in drivers
+---------------------------------
We recommend that you get to know how the drivers work and how to
configure them as loadable modules, before attempting to compile a
@@ -143,7 +153,7 @@ protocol identification number and, for some devices, the drive's
chain ID. While your system is booting, a number of messages are
displayed on the console. Like all such messages, they can be
reviewed with the 'dmesg' command. Among those messages will be
-some lines like:
+some lines like::
paride: bpck registered as protocol 0
paride: epat registered as protocol 1
@@ -158,10 +168,10 @@ the last two digits of the drive's serial number (but read MicroSolutions'
documentation about this).
As an example, let's assume that you have a MicroSolutions PD/CD drive
-with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
-EZ-135 connected to the chained port on the PD/CD drive and also an
-Imation Superdisk connected to port 0x278. You could give the following
-options on your boot command:
+with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
+EZ-135 connected to the chained port on the PD/CD drive and also an
+Imation Superdisk connected to port 0x278. You could give the following
+options on your boot command::
pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
@@ -169,24 +179,27 @@ In the last option, pf.drive1 configures device /dev/pf1, the 0x378
is the parallel port base address, the 0 is the protocol registration
number and 36 is the chain ID.
-Please note: while PARIDE will work both with and without the
+Please note: while PARIDE will work both with and without the
PARPORT parallel port sharing system that is included by the
"Parallel port support" option, PARPORT must be included and enabled
if you want to use chains of devices on the same parallel port.
2.2 Loading and configuring PARIDE as modules
+----------------------------------------------
It is much faster and simpler to get to understand the PARIDE drivers
-if you use them as loadable kernel modules.
+if you use them as loadable kernel modules.
-Note 1: using these drivers with the "kerneld" automatic module loading
-system is not recommended for beginners, and is not documented here.
+Note 1:
+ using these drivers with the "kerneld" automatic module loading
+ system is not recommended for beginners, and is not documented here.
-Note 2: if you build PARPORT support as a loadable module, PARIDE must
-also be built as loadable modules, and PARPORT must be loaded before the
-PARIDE modules.
+Note 2:
+ if you build PARPORT support as a loadable module, PARIDE must
+ also be built as loadable modules, and PARPORT must be loaded before
+ the PARIDE modules.
-To use PARIDE, you must begin by
+To use PARIDE, you must begin by::
insmod paride
@@ -195,8 +208,8 @@ among other tasks.
Then, load as many of the protocol modules as you think you might need.
As you load each module, it will register the protocols that it supports,
-and print a log message to your kernel log file and your console. For
-example:
+and print a log message to your kernel log file and your console. For
+example::
# insmod epat
paride: epat registered as protocol 0
@@ -205,22 +218,22 @@ example:
paride: k971 registered as protocol 2
Finally, you can load high-level drivers for each kind of device that
-you have connected. By default, each driver will autoprobe for a single
+you have connected. By default, each driver will autoprobe for a single
device, but you can support up to four similar devices by giving their
individual co-ordinates when you load the driver.
For example, if you had two no-name CD-ROM drives both using the
KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
-you could give the following command:
+you could give the following command::
# insmod pcd drive0=0x378,1 drive1=0x3bc,1
For most adapters, giving a port address and protocol number is sufficient,
-but check the source files in linux/drivers/block/paride for more
+but check the source files in linux/drivers/block/paride for more
information. (Hopefully someone will write some man pages one day !).
As another example, here's what happens when PARPORT is installed, and
-a SyQuest EZ-135 is attached to port 0x378:
+a SyQuest EZ-135 is attached to port 0x378::
# insmod paride
paride: version 1.0 installed
@@ -237,46 +250,47 @@ Note that the last line is the output from the generic partition table
scanner - in this case it reports that it has found a disk with one partition.
2.3 Using a PARIDE device
+--------------------------
Once the drivers have been loaded, you can access PARIDE devices in the
same way as their traditional counterparts. You will probably need to
create the device "special files". Here is a simple script that you can
-cut to a file and execute:
+cut to a file and execute::
-#!/bin/bash
-#
-# mkd -- a script to create the device special files for the PARIDE subsystem
-#
-function mkdev {
- mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
-}
-#
-function pd {
- D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
- mkdev pd$D b 45 $[ $1 * 16 ]
- for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
- do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
- done
-}
-#
-cd /dev
-#
-for u in 0 1 2 3 ; do pd $u ; done
-for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
-for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
-for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
-for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
-for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
-#
-# end of mkd
+ #!/bin/bash
+ #
+ # mkd -- a script to create the device special files for the PARIDE subsystem
+ #
+ function mkdev {
+ mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
+ }
+ #
+ function pd {
+ D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
+ mkdev pd$D b 45 $[ $1 * 16 ]
+ for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+ do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
+ done
+ }
+ #
+ cd /dev
+ #
+ for u in 0 1 2 3 ; do pd $u ; done
+ for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
+ for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
+ for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
+ for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
+ for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
+ #
+ # end of mkd
With the device files and drivers in place, you can access PARIDE devices
-like any other Linux device. For example, to mount a CD-ROM in pcd0, use:
+like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
mount /dev/pcd0 /cdrom
If you have a fresh Avatar Shark cartridge, and the drive is pda, you
-might do something like:
+might do something like::
fdisk /dev/pda -- make a new partition table with
partition 1 of type 83
@@ -289,41 +303,46 @@ might do something like:
Devices like the Imation superdisk work in the same way, except that
they do not have a partition table. For example to make a 120MB
-floppy that you could share with a DOS system:
+floppy that you could share with a DOS system::
mkdosfs /dev/pf0
mount /dev/pf0 /mnt
2.4 The pf driver
+------------------
The pf driver is intended for use with parallel port ATAPI disk
devices. The most common devices in this category are PD drives
and LS-120 drives. Traditionally, media for these devices are not
partitioned. Consequently, the pf driver does not support partitioned
-media. This may be changed in a future version of the driver.
+media. This may be changed in a future version of the driver.
2.5 Using the pt driver
+------------------------
The pt driver for parallel port ATAPI tape drives is a minimal driver.
-It does not yet support many of the standard tape ioctl operations.
+It does not yet support many of the standard tape ioctl operations.
For best performance, a block size of 32KB should be used. You will
probably want to set the parallel port delay to 0, if you can.
2.6 Using the pg driver
+------------------------
The pg driver can be used in conjunction with the cdrecord program
to create CD-ROMs. Please get cdrecord version 1.6.1 or later
-from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
-your parallel port should ideally be set to EPP mode, and the "port delay"
-should be set to 0. With those settings it is possible to record at 2x
+from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
+your parallel port should ideally be set to EPP mode, and the "port delay"
+should be set to 0. With those settings it is possible to record at 2x
speed without any buffer underruns. If you cannot get the driver to work
in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
3. Troubleshooting
+==================
3.1 Use EPP mode if you can
+----------------------------
The most common problems that people report with the PARIDE drivers
concern the parallel port CMOS settings. At this time, none of the
@@ -332,6 +351,7 @@ If you are able to do so, please set your parallel port into EPP mode
using your CMOS setup procedure.
3.2 Check the port delay
+-------------------------
Some parallel ports cannot reliably transfer data at full speed. To
offset the errors, the PARIDE protocol modules introduce a "port
@@ -347,23 +367,25 @@ read the comments at the beginning of the driver source files in
linux/drivers/block/paride.
3.3 Some drives need a printer reset
+-------------------------------------
There appear to be a number of "noname" external drives on the market
that do not always power up correctly. We have noticed this with some
drives based on OnSpec and older Freecom adapters. In these rare cases,
the adapter can often be reinitialised by issuing a "printer reset" on
-the parallel port. As the reset operation is potentially disruptive in
-multiple device environments, the PARIDE drivers will not do it
-automatically. You can however, force a printer reset by doing:
+the parallel port. As the reset operation is potentially disruptive in
+multiple device environments, the PARIDE drivers will not do it
+automatically. You can however, force a printer reset by doing::
insmod lp reset=1
rmmod lp
If you have one of these marginal cases, you should probably build
your paride drivers as modules, and arrange to do the printer reset
-before loading the PARIDE drivers.
+before loading the PARIDE drivers.
3.4 Use the verbose option and dmesg if you need help
+------------------------------------------------------
While a lot of testing has gone into these drivers to make them work
as smoothly as possible, problems will arise. If you do have problems,
@@ -373,7 +395,7 @@ clues, then please make sure that only one drive is hooked to your system,
and that either (a) PARPORT is enabled or (b) no other device driver
is using your parallel port (check in /proc/ioports). Then, load the
appropriate drivers (you can load several protocol modules if you want)
-as in:
+as in::
# insmod paride
# insmod epat
@@ -394,12 +416,14 @@ by e-mail to grant(a)torque.net, or join the linux-parport mailing list
and post your report there.
3.5 For more information or help
+---------------------------------
You can join the linux-parport mailing list by sending a mail message
-to
+to:
+
linux-parport-request(a)torque.net
-with the single word
+with the single word::
subscribe
@@ -412,6 +436,4 @@ have in your mail headers, when sending mail to the list server.
You might also find some useful information on the linux-parport
web pages (although they are not always up to date) at
- http://web.archive.org/web/*/http://www.torque.net/parport/
-
-
+ http://web.archive.org/web/%2E/http://www.torque.net/parport/
diff --git a/Documentation/blockdev/ramdisk.txt b/Documentation/blockdev/ramdisk.rst
similarity index 84%
rename from Documentation/blockdev/ramdisk.txt
rename to Documentation/blockdev/ramdisk.rst
index 501e12e0323e..b7c2268f8dec 100644
--- a/Documentation/blockdev/ramdisk.txt
+++ b/Documentation/blockdev/ramdisk.rst
@@ -1,7 +1,8 @@
+==========================================
Using the RAM disk block device with Linux
-------------------------------------------
+==========================================
-Contents:
+.. Contents:
1) Overview
2) Kernel Command Line Parameters
@@ -42,7 +43,7 @@ rescue floppy disk.
2a) Kernel Command Line Parameters
ramdisk_size=N
- ==============
+ Size of the ramdisk.
This parameter tells the RAM disk driver to set up RAM disks of N k size. The
default is 4096 (4 MB).
@@ -50,16 +51,13 @@ default is 4096 (4 MB).
2b) Module parameters
rd_nr
- =====
- /dev/ramX devices created.
+ /dev/ramX devices created.
max_part
- ========
- Maximum partition number.
+ Maximum partition number.
rd_size
- =======
- See ramdisk_size.
+ See ramdisk_size.
3) Using "rdev -r"
------------------
@@ -71,11 +69,11 @@ to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
prompt/wait sequence is to be given before trying to read the RAM disk. Since
the RAM disk dynamically grows as data is being written into it, a size field
is not required. Bits 11 to 13 are not currently used and may as well be zero.
-These numbers are no magical secrets, as seen below:
+These numbers are no magical secrets, as seen below::
-./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
-./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
-./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
+ ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
+ ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
+ ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
Consider a typical two floppy disk setup, where you will have the
kernel on disk one, and have already put a RAM disk image onto disk #2.
@@ -92,20 +90,23 @@ sequence so that you have a chance to switch floppy disks.
The command line equivalent is: "prompt_ramdisk=1"
Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
-So to create disk one of the set, you would do:
+So to create disk one of the set, you would do::
/usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
/usr/src/linux# rdev /dev/fd0 /dev/fd0
/usr/src/linux# rdev -r /dev/fd0 49152
-If you make a boot disk that has LILO, then for the above, you would use:
+If you make a boot disk that has LILO, then for the above, you would use::
+
append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
-Since the default start = 0 and the default prompt = 1, you could use:
+
+Since the default start = 0 and the default prompt = 1, you could use::
+
append = "load_ramdisk=1"
4) An Example of Creating a Compressed RAM Disk
-----------------------------------------------
+-----------------------------------------------
To create a RAM disk image, you will need a spare block device to
construct it on. This can be the RAM disk device itself, or an
@@ -120,11 +121,11 @@ a) Decide on the RAM disk size that you want. Say 2 MB for this example.
Create it by writing to the RAM disk device. (This step is not currently
required, but may be in the future.) It is wise to zero out the
area (esp. for disks) so that maximal compression is achieved for
- the unused blocks of the image that you are about to create.
+ the unused blocks of the image that you are about to create::
dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
-b) Make a filesystem on it. Say ext2fs for this example.
+b) Make a filesystem on it. Say ext2fs for this example::
mke2fs -vm0 /dev/ram0 2048
@@ -133,11 +134,11 @@ c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
d) Compress the contents of the RAM disk. The level of compression
will be approximately 50% of the space used by the files. Unused
- space on the RAM disk will compress to almost nothing.
+ space on the RAM disk will compress to almost nothing::
dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
-e) Put the kernel onto the floppy
+e) Put the kernel onto the floppy::
dd if=zImage of=/dev/fd0 bs=1k
@@ -146,13 +147,13 @@ f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
(possibly larger) kernel onto the same floppy later without overlapping
the RAM disk image. An offset of 400 kB for kernels about 350 kB in
size would be reasonable. Make sure offset+size of ram_image.gz is
- not larger than the total space on your floppy (usually 1440 kB).
+ not larger than the total space on your floppy (usually 1440 kB)::
dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
- have 2^15 + 2^14 + 400 = 49552.
+ have 2^15 + 2^14 + 400 = 49552::
rdev /dev/fd0 /dev/fd0
rdev -r /dev/fd0 49552
@@ -160,15 +161,17 @@ g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
That is it. You now have your boot/root compressed RAM disk floppy. Some
users may wish to combine steps (d) and (f) by using a pipe.
---------------------------------------------------------------------------
+
Paul Gortmaker 12/95
Changelog:
----------
-10-22-04 : Updated to reflect changes in command line options, remove
+10-22-04 :
+ Updated to reflect changes in command line options, remove
obsolete references, general cleanup.
James Nelson (james4765(a)gmail.com)
-12-95 : Original Document
+12-95 :
+ Original Document
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.rst
similarity index 76%
rename from Documentation/blockdev/zram.txt
rename to Documentation/blockdev/zram.rst
index 4df0ce271085..2111231c9c0f 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.rst
@@ -1,7 +1,9 @@
+========================================
zram: Compressed RAM based block devices
-----------------------------------------
+========================================
-* Introduction
+Introduction
+============
The zram module creates RAM based block devices named /dev/zram<id>
(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
@@ -12,9 +14,11 @@ use as swap disks, various caches under /var and maybe many more :)
Statistics for individual zram devices are exported through sysfs nodes at
/sys/block/zram<id>/
-* Usage
+Usage
+=====
There are several ways to configure and manage zram device(-s):
+
a) using zram and zram_control sysfs attributes
b) using zramctl utility, provided by util-linux (util-linux(a)vger.kernel.org).
@@ -22,7 +26,7 @@ In this document we will describe only 'manual' zram configuration steps,
IOW, zram and zram_control sysfs attributes.
In order to get a better idea about zramctl please consult util-linux
-documentation, zramctl man-page or `zramctl --help'. Please be informed
+documentation, zramctl man-page or `zramctl --help`. Please be informed
that zram maintainers do not develop/maintain util-linux or zramctl, should
you have any questions please contact util-linux(a)vger.kernel.org
@@ -30,19 +34,23 @@ Following shows a typical sequence of steps for using zram.
WARNING
=======
+
For the sake of simplicity we skip error checking parts in most of the
examples below. However, it is your sole responsibility to handle errors.
zram sysfs attributes always return negative values in case of errors.
The list of possible return codes:
--EBUSY -- an attempt to modify an attribute that cannot be changed once
-the device has been initialised. Please reset device first;
--ENOMEM -- zram was not able to allocate enough memory to fulfil your
-needs;
--EINVAL -- invalid input has been provided.
+
+======== =============================================================
+-EBUSY an attempt to modify an attribute that cannot be changed once
+ the device has been initialised. Please reset device first;
+-ENOMEM zram was not able to allocate enough memory to fulfil your
+ needs;
+-EINVAL invalid input has been provided.
+======== =============================================================
If you use 'echo', the returned value that is changed by 'echo' utility,
-and, in general case, something like:
+and, in general case, something like::
echo 3 > /sys/block/zram0/max_comp_streams
if [ $? -ne 0 ];
@@ -51,7 +59,11 @@ and, in general case, something like:
should suffice.
-1) Load Module:
+1) Load Module
+==============
+
+::
+
modprobe zram num_devices=4
This creates 4 devices: /dev/zram{0,1,2,3}
@@ -59,6 +71,8 @@ num_devices parameter is optional and tells zram how many devices should be
pre-created. Default: 1.
2) Set max number of compression streams
+========================================
+
Regardless the value passed to this attribute, ZRAM will always
allocate multiple compression streams - one per online CPUs - thus
allowing several concurrent compression operations. The number of
@@ -66,16 +80,20 @@ allocated compression streams goes down when some of the CPUs
become offline. There is no single-compression-stream mode anymore,
unless you are running a UP system or has only 1 CPU online.
-To find out how many streams are currently available:
+To find out how many streams are currently available::
+
cat /sys/block/zram0/max_comp_streams
3) Select compression algorithm
+===============================
+
Using comp_algorithm device attribute one can see available and
currently selected (shown in square brackets) compression algorithms,
change selected compression algorithm (once the device is initialised
there is no way to change compression algorithm).
-Examples:
+Examples::
+
#show supported compression algorithms
cat /sys/block/zram0/comp_algorithm
lzo [lz4]
@@ -83,20 +101,23 @@ Examples:
#select lzo compression algorithm
echo lzo > /sys/block/zram0/comp_algorithm
-For the time being, the `comp_algorithm' content does not necessarily
+For the time being, the `comp_algorithm` content does not necessarily
show every compression algorithm supported by the kernel. We keep this
list primarily to simplify device configuration and one can configure
a new device with a compression algorithm that is not listed in
-`comp_algorithm'. The thing is that, internally, ZRAM uses Crypto API
+`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
and, if some of the algorithms were built as modules, it's impossible
to list all of them using, for instance, /proc/crypto or any other
method. This, however, has an advantage of permitting the usage of
custom crypto compression modules (implementing S/W or H/W compression).
4) Set Disksize
+===============
+
Set disk size by writing the value to sysfs node 'disksize'.
The value can be either in bytes or you can use mem suffixes.
-Examples:
+Examples::
+
# Initialize /dev/zram0 with 50MB disksize
echo $((50*1024*1024)) > /sys/block/zram0/disksize
@@ -111,10 +132,13 @@ since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful.
5) Set memory limit: Optional
+=============================
+
Set memory limit by writing the value to sysfs node 'mem_limit'.
The value can be either in bytes or you can use mem suffixes.
In addition, you could change the value in runtime.
-Examples:
+Examples::
+
# limit /dev/zram0 with 50MB memory
echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
@@ -126,7 +150,11 @@ Examples:
# To disable memory limit
echo 0 > /sys/block/zram0/mem_limit
-6) Activate:
+6) Activate
+===========
+
+::
+
mkswap /dev/zram0
swapon /dev/zram0
@@ -134,6 +162,7 @@ Examples:
mount /dev/zram1 /tmp
7) Add/remove zram devices
+==========================
zram provides a control interface, which enables dynamic (on-demand) device
addition and removal.
@@ -142,37 +171,44 @@ In order to add a new /dev/zramX device, perform read operation on hot_add
attribute. This will return either new device's device id (meaning that you
can use /dev/zram<id>) or error code.
-Example:
+Example::
+
cat /sys/class/zram-control/hot_add
1
To remove the existing /dev/zramX device (where X is a device id)
-execute
+execute::
+
echo X > /sys/class/zram-control/hot_remove
-8) Stats:
+8) Stats
+========
+
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
A brief description of exported device attributes. For more details please
read Documentation/ABI/testing/sysfs-block-zram.
+====================== ====== ===============================================
Name access description
----- ------ -----------
+====================== ====== ===============================================
disksize RW show and set the device's disk size
initstate RO shows the initialization state of the device
reset WO trigger device reset
-mem_used_max WO reset the `mem_used_max' counter (see later)
-mem_limit WO specifies the maximum amount of memory ZRAM can use
- to store the compressed data
-writeback_limit WO specifies the maximum amount of write IO zram can
- write out to backing device as 4KB unit
+mem_used_max WO reset the `mem_used_max` counter (see later)
+mem_limit WO specifies the maximum amount of memory ZRAM can
+ use to store the compressed data
+writeback_limit WO specifies the maximum amount of write IO zram
+ can write out to backing device as 4KB unit
writeback_limit_enable RW show and set writeback_limit feature
-max_comp_streams RW the number of possible concurrent compress operations
+max_comp_streams RW the number of possible concurrent compress
+ operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
idle WO mark allocated slot as idle
+====================== ====== ===============================================
User space is advised to use the following files to read the device statistics.
@@ -188,23 +224,31 @@ The stat file represents device's I/O statistics not accounted by block
layer and, thus, not available in zram<id>/stat file. It consists of a
single line of text and contains the following stats separated by
whitespace:
- failed_reads the number of failed reads
- failed_writes the number of failed writes
- invalid_io the number of non-page-size-aligned I/O requests
+
+ ============= =============================================================
+ failed_reads The number of failed reads
+ failed_writes The number of failed writes
+ invalid_io The number of non-page-size-aligned I/O requests
notify_free Depending on device usage scenario it may account
+
a) the number of pages freed because of swap slot free
- notifications or b) the number of pages freed because of
- REQ_OP_DISCARD requests sent by bio. The former ones are
- sent to a swap block device when a swap slot is freed,
- which implies that this disk is being used as a swap disk.
+ notifications
+ b) the number of pages freed because of
+ REQ_OP_DISCARD requests sent by bio. The former ones are
+ sent to a swap block device when a swap slot is freed,
+ which implies that this disk is being used as a swap disk.
+
The latter ones are sent by filesystem mounted with
discard option, whenever some data blocks are getting
discarded.
+ ============= =============================================================
File /sys/block/zram<id>/mm_stat
The stat file represents device's mm statistics. It consists of a single
line of text and contains the following stats separated by whitespace:
+
+ ================ =============================================================
orig_data_size uncompressed size of data stored in this disk.
This excludes same-element-filled pages (same_pages) since
no memory is allocated for them.
@@ -223,58 +267,71 @@ line of text and contains the following stats separated by whitespace:
No memory is allocated for such pages.
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
+ ================ =============================================================
File /sys/block/zram<id>/bd_stat
The stat file represents device's backing device statistics. It consists of
a single line of text and contains the following stats separated by whitespace:
+
+ ============== =============================================================
bd_count size of data written in backing device.
Unit: 4K bytes
bd_reads the number of reads from backing device
Unit: 4K bytes
bd_writes the number of writes to backing device
Unit: 4K bytes
+ ============== =============================================================
+
+9) Deactivate
+=============
+
+::
-9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
-10) Reset:
- Write any positive value to 'reset' sysfs node
- echo 1 > /sys/block/zram0/reset
- echo 1 > /sys/block/zram1/reset
+10) Reset
+=========
+
+ Write any positive value to 'reset' sysfs node::
+
+ echo 1 > /sys/block/zram0/reset
+ echo 1 > /sys/block/zram1/reset
This frees all the memory allocated for the given device and
resets the disksize to zero. You must set the disksize again
before reusing the device.
-* Optional Feature
+Optional Feature
+================
-= writeback
+writeback
+---------
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
-To use the feature, admin should set up backing device via
+To use the feature, admin should set up backing device via::
- "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+ echo /dev/sda5 > /sys/block/zramX/backing_dev
before disksize setting. It supports only partition at this moment.
-If admin want to use incompressible page writeback, they could do via
+If admin want to use incompressible page writeback, they could do via::
- "echo huge > /sys/block/zramX/write"
+ echo huge > /sys/block/zramX/write
To use idle page writeback, first, user need to declare zram pages
-as idle.
+as idle::
- "echo all > /sys/block/zramX/idle"
+ echo all > /sys/block/zramX/idle
From now on, any pages on zram are idle pages. The idle mark
will be removed until someone request access of the block.
IOW, unless there is access request, those pages are still idle pages.
-Admin can request writeback of those idle pages at right timing via
+Admin can request writeback of those idle pages at right timing via::
- "echo idle > /sys/block/zramX/writeback"
+ echo idle > /sys/block/zramX/writeback
With the command, zram writeback idle pages from memory to the storage.
@@ -285,7 +342,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
any writeback. IOW, if admin want to apply writeback budget, he should
-enable writeback_limit_enable via
+enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
@@ -296,7 +353,7 @@ until admin set the budget via /sys/block/zramX/writeback_limit.
assigned via /sys/block/zramX/writeback_limit is meaninless.)
If admin want to limit writeback as per-day 400M, he could do it
-like below.
+like below::
$ MB_SHIFT=20
$ 4K_SHIFT=12
@@ -305,16 +362,16 @@ like below.
$ echo 1 > /sys/block/zram0/writeback_limit_enable
If admin want to allow further write again once the bugdet is exausted,
-he could do it like below
+he could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
-If admin want to see remaining writeback budget since he set,
+If admin want to see remaining writeback budget since he set::
$ cat /sys/block/zramX/writeback_limit
-If admin want to disable writeback limit, he could do
+If admin want to disable writeback limit, he could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
@@ -326,25 +383,35 @@ budget in next setting is user's job.
If admin want to measure writeback count in a certain period, he could
know it via /sys/block/zram0/bd_stat's 3rd column.
-= memory tracking
+memory tracking
+===============
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
zram block. It could be useful to catch cold or incompressible
pages of the process with*pagemap.
+
If you enable the feature, you could see block state via
-/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
300 75.033841 .wh.
301 63.806904 s...
302 63.806919 ..hi
-First column is zram's block index.
-Second column is access time since the system was booted
-Third column is state of the block.
-(s: same page
-w: written page to backing store
-h: huge page
-i: idle page)
+First column
+ zram's block index.
+Second column
+ access time since the system was booted
+Third column
+ state of the block:
+
+ s:
+ same page
+ w:
+ written page to backing store
+ h:
+ huge page
+ i:
+ idle page
First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
diff --git a/MAINTAINERS b/MAINTAINERS
index b2254bc8e495..163327d6a856 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10961,7 +10961,7 @@ M: Josef Bacik <josef(a)toxicpanda.com>
S: Maintained
L: linux-block(a)vger.kernel.org
L: nbd(a)other.debian.org
-F: Documentation/blockdev/nbd.txt
+F: Documentation/blockdev/nbd.rst
F: drivers/block/nbd.c
F: include/trace/events/nbd.h
F: include/uapi/linux/nbd.h
@@ -11963,7 +11963,7 @@ PARIDE DRIVERS FOR PARALLEL PORT IDE DEVICES
M: Tim Waugh <tim(a)cyberelk.net>
L: linux-parport(a)lists.infradead.org (subscribers-only)
S: Maintained
-F: Documentation/blockdev/paride.txt
+F: Documentation/blockdev/paride.rst
F: drivers/block/paride/
PARISC ARCHITECTURE
@@ -13245,7 +13245,7 @@ F: drivers/net/wireless/ralink/rt2x00/
RAMDISK RAM BLOCK DEVICE DRIVER
M: Jens Axboe <axboe(a)kernel.dk>
S: Maintained
-F: Documentation/blockdev/ramdisk.txt
+F: Documentation/blockdev/ramdisk.rst
F: drivers/block/brd.c
RANCHU VIRTUAL BOARD FOR MIPS
@@ -17584,7 +17584,7 @@ R: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
L: linux-kernel(a)vger.kernel.org
S: Maintained
F: drivers/block/zram/
-F: Documentation/blockdev/zram.txt
+F: Documentation/blockdev/zram.rst
ZS DECSTATION Z85C30 SERIAL DRIVER
M: "Maciej W. Rozycki" <macro(a)linux-mips.org>
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 96ec7e0fc1ea..c43690b973d8 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -31,7 +31,7 @@ config BLK_DEV_FD
If you want to use the floppy disk drive(s) of your PC under Linux,
say Y. Information about this driver, especially important for IBM
Thinkpad users, is contained in
- <file:Documentation/blockdev/floppy.txt>.
+ <file:Documentation/blockdev/floppy.rst>.
That file also contains the location of the Floppy driver FAQ as
well as location of the fdutils package used to configure additional
parameters of the driver at run time.
@@ -96,7 +96,7 @@ config PARIDE
your computer's parallel port. Most of them are actually IDE devices
using a parallel port IDE adapter. This option enables the PARIDE
subsystem which contains drivers for many of these external drives.
- Read <file:Documentation/blockdev/paride.txt> for more information.
+ Read <file:Documentation/blockdev/paride.rst> for more information.
If you have said Y to the "Parallel-port support" configuration
option, you may share a single port between your printer and other
@@ -261,7 +261,7 @@ config BLK_DEV_NBD
userland (making server and client physically the same computer,
communicating using the loopback network device).
- Read <file:Documentation/blockdev/nbd.txt> for more information,
+ Read <file:Documentation/blockdev/nbd.rst> for more information,
especially about where to find the server code, which runs in user
space and does not need special kernel support.
@@ -303,7 +303,7 @@ config BLK_DEV_RAM
during the initial install of Linux.
Note that the kernel command line option "ramdisk=XX" is now obsolete.
- For details, read <file:Documentation/blockdev/ramdisk.txt>.
+ For details, read <file:Documentation/blockdev/ramdisk.rst>.
To compile this driver as a module, choose M here: the
module will be called brd. An alias "rd" has been defined
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 9fb9b312ab6b..af02ca97dcd6 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4424,7 +4424,7 @@ static int __init floppy_setup(char *str)
pr_cont("\n");
} else
DPRINT("botched floppy option\n");
- DPRINT("Read Documentation/blockdev/floppy.txt\n");
+ DPRINT("Read Documentation/blockdev/floppy.rst\n");
return 0;
}
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 1ffc64770643..e06b99d54816 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -12,7 +12,7 @@ config ZRAM
It has several use cases, for example: /tmp storage, use as swap
disks and maybe many more.
- See Documentation/blockdev/zram.txt for more information.
+ See Documentation/blockdev/zram.rst for more information.
config ZRAM_WRITEBACK
bool "Write back incompressible or idle page to backing device"
@@ -26,7 +26,7 @@ config ZRAM_WRITEBACK
With /sys/block/zramX/{idle,writeback}, application could ask
idle page's writeback to the backing device to save in memory.
- See Documentation/blockdev/zram.txt for more information.
+ See Documentation/blockdev/zram.rst for more information.
config ZRAM_MEMORY_TRACKING
bool "Track zRam block status"
@@ -36,4 +36,4 @@ config ZRAM_MEMORY_TRACKING
of zRAM. Admin could see the information via
/sys/kernel/debug/zram/zramX/block_state.
- See Documentation/blockdev/zram.txt for more information.
+ See Documentation/blockdev/zram.rst for more information.
diff --git a/tools/testing/selftests/zram/README b/tools/testing/selftests/zram/README
index 7972cc512408..5fa378391d3b 100644
--- a/tools/testing/selftests/zram/README
+++ b/tools/testing/selftests/zram/README
@@ -37,4 +37,4 @@ Commands required for testing:
- mkfs/ mkfs.ext4
For more information please refer:
-kernel-source-tree/Documentation/blockdev/zram.txt
+kernel-source-tree/Documentation/blockdev/zram.rst
--
2.21.0
Hello,
As my very first contribution to the Linux Kernel, I would like to fix six Python 3 syntax errors in the file ./tools/testing/selftests/tpm2/tpm2_tests.py
All six of these errors are of the same form: except ProtocolError, e: To fix the syntax errors, I propose to change the comma (,) to “as” like: except ProtocolError as e:
These changes are important because the current form is compatible with Python 2 but is a syntax error in Python 3. The proposed form is compatible with both Python 2 and Python3. This conversion is required because Python 2 will reach its end of life in less than 200 days.
The kernel contains at least five other files where I am able to detect Python 3 syntax errors but after studying in detail the process of making kernel modification, I believe that it is best to start with tpm2_tests.py because the changes are straightforward and uncontroversial — all issues are of the same form, and I can detect no other issues (such as undefined names) to fix in that file.
If I succeed in getting the modifications to tpm2_tests.py through the review process then I can try the remaining files in turn.
I would be interested to know if this is a worthwhile effort. Is there already another initiative to resolve these issue before yearend?
Thanks for any advise that you can provide, Chris Clauss
Updates to UDP GSO selftests ot optionally stress test CMSG
subsytem, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.
Fred Klassen (3):
net/udpgso_bench_tx: options to exercise TX CMSG
net/udpgso_bench.sh add UDP GSO audit tests
net/udpgso_bench.sh test fails on error
tools/testing/selftests/net/udpgso_bench.sh | 52 ++++-
tools/testing/selftests/net/udpgso_bench_tx.c | 291 ++++++++++++++++++++++++--
2 files changed, 327 insertions(+), 16 deletions(-)
--
2.11.0
selftests: ptrace: peeksiginfo failed on x86_64, i386, arm64 and arm.
FAILED on stable rc branches 4.19, 4.14, 4.9 and 4.4.
PASS on mainline, next and 5.1 stable rc branch.
Test output:
------------------
cd /opt/kselftests/mainline/ptrace
./peeksiginfo
Error (peeksiginfo.c:143): Only 0 signals were read
The git bisect show that below commit caused this test to fail.
git bisect bad
5b6b0eac235ef1f915f24eda6d501a754022cbf0 is the first bad commit
commit 5b6b0eac235ef1f915f24eda6d501a754022cbf0
Author: Eric W. Biederman <ebiederm(a)xmission.com>
Date: Tue May 28 18:46:37 2019 -0500
signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO
commit f6e2aa91a46d2bc79fce9b93a988dbe7655c90c0 upstream.
Recently syzbot in conjunction with KMSAN reported that
ptrace_peek_siginfo can copy an uninitialized siginfo to userspace.
Inspecting ptrace_peek_siginfo confirms this.
The problem is that off when initialized from args.off can be
initialized to a negaive value. At which point the "if (off >= 0)"
test to see if off became negative fails because off started off
negative.
Prevent the core problem by adding a variable found that is only true
if a siginfo is found and copied to a temporary in preparation for
being copied to userspace.
Prevent args.off from being truncated when being assigned to off by
testing that off is <= the maximum possible value of off. Convert off
to an unsigned long so that we should not have to truncate args.off,
we have well defined overflow behavior so if we add another check we
won't risk fighting undefined compiler behavior, and so that we have a
type whose maximum value is easy to test for.
Cc: Andrei Vagin <avagin(a)gmail.com>
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+0d602a1b0d8c95bdf299(a)syzkaller.appspotmail.com
Fixes: 84c751bd4aeb ("ptrace: add ability to retrieve signals
without removing from a queue (v4)")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
:040000 040000 ff9f3109f210274d0b87851d226c35e7305ce44a
b36de2c855fe2a0b332f145f0966dc1a0304d4bd M kernel
Test case link,
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/too…
Test output log link,
https://lkft.validation.linaro.org/scheduler/job/777223#L1084
Test results comparison on different branches,
https://qa-reports.linaro.org/_/comparetest/?project=22&project=6&project=5…
Best regards
Naresh Kamboju
## TLDR
A quick follow up to yesterday's revision. I got some feedback that I
wanted to incorporate before anyone else read the update. For this
reason, I will leave a TLDR of the biggest changes since v2.
Biggest things to look out for (since v2):
- KUnit core now outputs results in TAP14.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Greg, Logan, you might want to re-review this.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
There is still some discussion going on on the [PATCH v2 00/17] thread,
but I wanted to get some of these updates out before they got too stale
(and too difficult for me to keep track of). I hope no one minds.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3].
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1/v4 branch.
## Changes Since Last Version
As I mentioned above, there are a significant number of updates since
v2:
- Converted KUnit core to print test results in TAP14 format as
suggested by Greg and Frank.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
- Added a new set of EXPECTs and ASSERTs for pointer comparison.
- Removed more function indirection as suggested by Logan.
- Added a new patch that adds `kunit_try_catch_throw` to objtool's
noreturn list.
- Fixed a number of minorish issues pointed out by Shuah, Masahiro, and
kbuild bot.
Nevertheless, there are only a couple of minor updates since v3:
- Added more context to the changelog on the objtool patch, as per
Peter's request.
- Moved all KUnit documentation under the Documentation/dev-tools/
directory as per Jonathan's suggestion.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1/v4
--
2.21.0.1020.gf2820cf01a-goog
From: Alex Shi <alex.shi(a)linux.alibaba.com>
[ Upstream commit 00e38a5d753d7788852f81703db804a60a84c26e ]
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:
$sudo ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..d78f1c5366d3 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
From: Alex Shi <alex.shi(a)linux.alibaba.com>
[ Upstream commit f6131f28057d4fd8922599339e701a2504e0f23d ]
The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..c19a97dd02d4 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
From: Alex Shi <alex.shi(a)linux.alibaba.com>
[ Upstream commit 00e38a5d753d7788852f81703db804a60a84c26e ]
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:
$sudo ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..d78f1c5366d3 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
From: Alex Shi <alex.shi(a)linux.alibaba.com>
[ Upstream commit f6131f28057d4fd8922599339e701a2504e0f23d ]
The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..c19a97dd02d4 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.20.1
Updates to UDP GSO selftests ot optionally stress test CMSG
subsytem, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.
Fred Klassen (3):
net/udpgso_bench_tx: options to exercise TX CMSG
net/udpgso_bench.sh add UDP GSO audit tests
net/udpgso_bench.sh test fails on error
tools/testing/selftests/net/udpgso_bench.sh | 51 +++-
tools/testing/selftests/net/udpgso_bench_tx.c | 324 ++++++++++++++++++++++++--
2 files changed, 357 insertions(+), 18 deletions(-)
--
2.11.0
Hi,
I've tried to build kselftests for several years now, but I always
find the build broken. Which makes me wonder if the instructions are
broken or something. I follow the instructions in
Documentation/dev-tools/kselftest.rst and start with "make -C
tools/testing/selftests". Here is the errors I get on the upstream
commit 16d72dd4891fecc1e1bf7ca193bb7d5b9804c038:
error: unable to create target: 'No available targets are compatible
with triple "bpf"'
1 error generated.
Makefile:259: recipe for target 'elfdep' failed
Makefile:156: recipe for target 'all' failed
Makefile:106: recipe for target
'/linux/tools/testing/selftests/bpf/libbpf.a' failed
test_execve.c:4:10: fatal error: cap-ng.h: No such file or directory
../lib.mk:138: recipe for target
'/linux/tools/testing/selftests/capabilities/test_execve' failed
gpio-mockup-chardev.c:20:10: fatal error: libmount.h: No such file or directory
<builtin>: recipe for target 'gpio-mockup-chardev' failed
fuse_mnt.c:17:10: fatal error: fuse.h: No such file or directory
../lib.mk:138: recipe for target
'/linux/tools/testing/selftests/memfd/fuse_mnt' failed
collect2: error: ld returned 1 exit status
../lib.mk:138: recipe for target
'/linux/tools/testing/selftests/mqueue/mq_open_tests' failed
reuseport_bpf_numa.c:24:10: fatal error: numa.h: No such file or directory
../lib.mk:138: recipe for target
'/linux/tools/testing/selftests/net/reuseport_bpf_numa' failed
mlock-random-test.c:8:10: fatal error: sys/capability.h: No such file
or directory
../lib.mk:138: recipe for target
'/linux/tools/testing/selftests/vm/mlock-random-test' failed
Here is full log:
https://gist.githubusercontent.com/dvyukov/47430636e160f297b657df5ba2efa82b…
I have libelf-dev installed. Do I need to install something else? Or
run some other command?
Thanks
This is another resend as there has been no feedback since v4.
Seems Jon has been MIA this past cycle so hopefully he appears on the
list soon.
I've addressed the feedback so far and rebased on the latest kernel
and would like this to be considered for merging this cycle.
The only outstanding issue I know of is that it still will not work
with IDT hardware, but ntb_transport doesn't work with IDT hardware
and there is still no sensible common infrastructure to support
ntb_peer_mw_set_trans(). Thus, I decline to consider that complication
in this patchset. However, I'll be happy to review work that adds this
feature in the future.
Also, as the port number and resource index stuff is a bit complicated,
I made a quick out of tree test fixture to ensure it's correct[1]. As
an excerise I also wrote some test code[2] using the upcomming KUnit
feature.
Logan
[1] https://repl.it/repls/ExcitingPresentFile
[2] https://github.com/sbates130272/linux-p2pmem/commits/ntb_kunit
--
Changes in v5:
* Rebased onto v5.2-rc1 (plus the patches in ntb-next)
--
Changes in v4:
* Rebased onto v5.1-rc6 (No changes)
* Numerous grammar and spelling mistakes spotted by Bjorn
--
Changes in v3:
* Rebased onto v5.1-rc1 (Dropped the first two patches as they have
been merged, and cleaned up some minor conflicts in the PCI tree)
* Added a new patch (#3) to calculate logical port numbers that
are port numbers from 0 to (number of ports - 1). This is
then used in ntb_peer_resource_idx() to fix the issues brought
up by Serge.
* Fixed missing __iomem and iowrite calls (as noticed by Serge)
* Added patch 10 which describes ntb_msi_test in the documentation
file (as requested by Serge)
* A couple other minor nits and documentation fixes
--
Changes in v2:
* Cleaned up the changes in intel_irq_remapping.c to make them
less confusing and add a comment. (Per discussion with Jacob and
Joerg)
* Fixed a nit from Bjorn and collected his Ack
* Added a Kconfig dependancy on CONFIG_PCI_MSI for CONFIG_NTB_MSI
as the Kbuild robot hit a random config that didn't build
without it.
* Worked in a callback for when the MSI descriptor changes so that
the clients can resend the new address and data values to the peer.
On my test system this was never necessary, but there may be
other platforms where this can occur. I tested this by hacking
in a path to rewrite the MSI descriptor when I change the cpu
affinity of an IRQ. There's a bit of uncertainty over the latency
of the change, but without hardware this can acctually occur on
we can't test this. This was the result of a discussion with Dave.
--
This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.
To do this, a couple changes are required outside of the NTB tree:
1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.
2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.
The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.
The series is based off of v5.1-rc6 plus the patches in ntb-next.
A git repo is available here:
https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v4
Thanks,
Logan
--
Logan Gunthorpe (10):
PCI/MSI: Support allocating virtual MSI interrupts
PCI/switchtec: Add module parameter to request more interrupts
NTB: Introduce helper functions to calculate logical port number
NTB: Introduce functions to calculate multi-port resource index
NTB: Rename ntb.c to support multiple source files in the module
NTB: Introduce MSI library
NTB: Introduce NTB MSI Test Client
NTB: Add ntb_msi_test support to ntb_test
NTB: Add MSI interrupt support to ntb_transport
NTB: Describe the ntb_msi_test client in the documentation.
Documentation/ntb.txt | 27 ++
drivers/ntb/Kconfig | 11 +
drivers/ntb/Makefile | 3 +
drivers/ntb/{ntb.c => core.c} | 0
drivers/ntb/msi.c | 415 +++++++++++++++++++++++
drivers/ntb/ntb_transport.c | 169 ++++++++-
drivers/ntb/test/Kconfig | 9 +
drivers/ntb/test/Makefile | 1 +
drivers/ntb/test/ntb_msi_test.c | 433 ++++++++++++++++++++++++
drivers/pci/msi.c | 54 ++-
drivers/pci/switch/switchtec.c | 12 +-
include/linux/msi.h | 8 +
include/linux/ntb.h | 196 ++++++++++-
include/linux/pci.h | 9 +
tools/testing/selftests/ntb/ntb_test.sh | 54 ++-
15 files changed, 1386 insertions(+), 15 deletions(-)
rename drivers/ntb/{ntb.c => core.c} (100%)
create mode 100644 drivers/ntb/msi.c
create mode 100644 drivers/ntb/test/ntb_msi_test.c
--
2.20.1
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().
In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.
A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].
This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.
Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.
Note: arm64 and nds32 respective fixes have been merged in 5.2-rc1,
hence they have been removed from this series.
[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2
Changes:
--------
v4:
- Address review comments.
v3:
- Rebased on 5.2-rc1.
- Address review comments.
v2:
- Rebased on 5.1-rc5.
- Addressed review comments.
Cc: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Vincenzo Frascino (3):
powerpc: Fix vDSO clock_getres()
s390: Fix vDSO clock_getres()
kselftest: Extend vDSO selftest to clock_getres
arch/powerpc/include/asm/vdso_datapage.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +-
arch/powerpc/kernel/time.c | 1 +
arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +-
arch/s390/include/asm/vdso.h | 1 +
arch/s390/kernel/asm-offsets.c | 2 +-
arch/s390/kernel/time.c | 1 +
arch/s390/kernel/vdso32/clock_getres.S | 12 +-
arch/s390/kernel/vdso64/clock_getres.S | 10 +-
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_clock_getres.c | 124 ++++++++++++++++++
12 files changed, 155 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c
--
2.21.0
The psock_tpacket test will need to access /proc/kallsyms, this would
require the kernel config CONFIG_KALLSYMS to be enabled first.
Check the file existence to determine if we can run this test.
Signed-off-by: Po-Hsu Lin <po-hsu.lin(a)canonical.com>
---
tools/testing/selftests/net/run_afpackettests | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/run_afpackettests b/tools/testing/selftests/net/run_afpackettests
index ea5938e..8b42e8b 100755
--- a/tools/testing/selftests/net/run_afpackettests
+++ b/tools/testing/selftests/net/run_afpackettests
@@ -21,12 +21,16 @@ fi
echo "--------------------"
echo "running psock_tpacket test"
echo "--------------------"
-./in_netns.sh ./psock_tpacket
-if [ $? -ne 0 ]; then
- echo "[FAIL]"
- ret=1
+if [ -f /proc/kallsyms ]; then
+ ./in_netns.sh ./psock_tpacket
+ if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ ret=1
+ else
+ echo "[PASS]"
+ fi
else
- echo "[PASS]"
+ echo "[SKIP] CONFIG_KALLSYMS not enabled"
fi
echo "--------------------"
--
2.7.4
=== Overview
arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:
1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
pointers")
This patchset extends tagged pointer support to syscall arguments.
As per the proposed ABI change [3], tagged pointers are only allowed to be
passed to syscalls when they point to memory ranges obtained by anonymous
mmap() or sbrk() (see the patchset [3] for more details).
For non-memory syscalls this is done by untaging user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel and stays tagged when the kernel
dereferences the pointer when perfoming user memory accesses.
Memory syscalls (mprotect, etc.) don't do user memory accesses but rather
deal with memory ranges, and untagged pointers are better suited to
describe memory ranges internally. Thus for memory syscalls we untag
pointers completely when they enter the kernel.
=== Other approaches
One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.
An alternative approach to untagging pointers in memory syscalls prologues
is to inspead allow tagged pointers to be passed to find_vma() (and other
vma related functions) and untag them there. Unfortunately, a lot of
find_vma() callers then compare or subtract the returned vma start and end
fields against the pointer that was being searched. Thus this approach
would still require changing all find_vma() callers.
=== Testing
The following testing approaches has been taken to find potential issues
with user pointer untagging:
1. Static testing (with sparse [2] and separately with a custom static
analyzer based on Clang) to track casts of __user pointers to integer
types to find places where untagging needs to be done.
2. Static testing with grep to find parts of the kernel that call
find_vma() (and other similar functions) or directly compare against
vm_start/vm_end fields of vma.
3. Static testing with grep to find parts of the kernel that compare
user pointers with TASK_SIZE or other similar consts and macros.
4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added
to the patchset.
=== Notes
This patchset is meant to be merged together with "arm64 relaxed ABI" [3].
This patchset is a prerequisite for ARM's memory tagging hardware feature
support [4].
This patchset has been merged into the Pixel 2 & 3 kernel trees and is
now being used to enable testing of Pixel phones with HWASan.
Thanks!
[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060…
[3] https://lkml.org/lkml/2019/3/18/819
[4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architectur…
Changes in v16:
- Moved untagging for memory syscalls from arm64 wrappers back to generic
code.
- Dropped untagging for the following memory syscalls: brk, mmap, munmap;
mremap (only dropped for new_address); mmap_pgoff (not used on arm64);
remap_file_pages (deprecated); shmat, shmdt (work on shared memory).
- Changed kselftest to LD_PRELOAD a shared library that overrides malloc
to return tagged pointers.
- Rebased onto 5.2-rc3.
Changes in v15:
- Removed unnecessary untagging from radeon_ttm_tt_set_userptr().
- Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr().
- Moved untagging to validate_range() in userfaultfd code.
- Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr().
- Rebased onto 5.1.
Changes in v14:
- Moved untagging for most memory syscalls to an arm64 specific
implementation, instead of doing that in the common code.
- Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since
the provided user pointers don't come from an anonymous map and thus are
not covered by this ABI relaxation.
- Dropped "kernel, arm64: untag user pointers in prctl_set_mm*".
- Moved untagging from __check_mem_type() to tee_shm_register().
- Updated untagging for the amdgpu and radeon drivers to cover the MMU
notifier, as suggested by Felix.
- Since this ABI relaxation doesn't actually allow tagged instruction
pointers, dropped the following patches:
- Dropped "tracing, arm64: untag user pointers in seq_print_user_ip".
- Dropped "uprobes, arm64: untag user pointers in find_active_uprobe".
- Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset".
- Rebased onto 5.1-rc7 (37624b58).
Changes in v13:
- Simplified untagging in tcp_zerocopy_receive().
- Looked at find_vma() callers in drivers/, which allowed to identify a
few other places where untagging is needed.
- Added patch "mm, arm64: untag user pointers in get_vaddr_frames".
- Added patch "drm/amdgpu, arm64: untag user pointers in
amdgpu_ttm_tt_get_user_pages".
- Added patch "drm/radeon, arm64: untag user pointers in
radeon_ttm_tt_pin_userptr".
- Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr".
- Added patch "media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get".
- Added patch "tee/optee, arm64: untag user pointers in check_mem_type".
- Added patch "vfio/type1, arm64: untag user pointers".
Changes in v12:
- Changed untagging in tcp_zerocopy_receive() to also untag zc->address.
- Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups
and validity checks, but leave them as is for actual user space accesses.
- Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3].
- Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3]
handles that.
Changes in v11:
- Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch.
- Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset"
patch.
- Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to
correctly perform subtration with a tagged addr.
- Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and
SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey().
- Moved untagged_addr() definition for other arches from
include/linux/memory.h to include/linux/mm.h.
- Changed untagging in strn*_user() to perform userspace accesses through
tagged pointers.
- Updated the documentation to mention that passing tagged pointers to
memory syscalls is allowed.
- Updated the test to use malloc'ed memory instead of stack memory.
Changes in v10:
- Added "mm, arm64: untag user pointers passed to memory syscalls" back.
- New patch "fs, arm64: untag user pointers in fs/userfaultfd.c".
- New patch "net, arm64: untag user pointers in tcp_zerocopy_receive".
- New patch "kernel, arm64: untag user pointers in prctl_set_mm*".
- New patch "tracing, arm64: untag user pointers in seq_print_user_ip".
Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.
Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
support tagged user pointers.
Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.
Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).
Changes in v5:
- Added 3 new patches that add untagging to places found with static
analysis.
- Rebased onto 44c929e1 (4.18-rc8).
Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).
Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.
Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.
Changes in v1:
- Rebased onto 4.17-rc1.
Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Andrey Konovalov (16):
uaccess: add untagged_addr definition for other arches
arm64: untag user pointers in access_ok and __uaccess_mask_ptr
lib, arm64: untag user pointers in strn*_user
mm: untag user pointers in do_pages_move
arm64: untag user pointers passed to memory syscalls
mm, arm64: untag user pointers in mm/gup.c
mm, arm64: untag user pointers in get_vaddr_frames
fs, arm64: untag user pointers in copy_mount_options
fs, arm64: untag user pointers in fs/userfaultfd.c
drm/amdgpu, arm64: untag user pointers
drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl
IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()
media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get
tee, arm64: untag user pointers in tee_shm_register
vfio/type1, arm64: untag user pointers in vaddr_get_pfn
selftests, arm64: add a selftest for passing tagged pointers to kernel
arch/arm64/include/asm/uaccess.h | 10 +++--
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +
drivers/gpu/drm/radeon/radeon_gem.c | 2 +
drivers/infiniband/core/uverbs_cmd.c | 4 ++
drivers/media/v4l2-core/videobuf-dma-contig.c | 9 ++--
drivers/tee/tee_shm.c | 1 +
drivers/vfio/vfio_iommu_type1.c | 2 +
fs/namespace.c | 2 +-
fs/userfaultfd.c | 22 +++++-----
include/linux/mm.h | 4 ++
lib/strncpy_from_user.c | 3 +-
lib/strnlen_user.c | 3 +-
mm/frame_vector.c | 2 +
mm/gup.c | 4 ++
mm/madvise.c | 2 +
mm/mempolicy.c | 3 ++
mm/migrate.c | 1 +
mm/mincore.c | 2 +
mm/mlock.c | 4 ++
mm/mprotect.c | 2 +
mm/mremap.c | 2 +
mm/msync.c | 2 +
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 22 ++++++++++
.../testing/selftests/arm64/run_tags_test.sh | 12 ++++++
tools/testing/selftests/arm64/tags_lib.c | 42 +++++++++++++++++++
tools/testing/selftests/arm64/tags_test.c | 18 ++++++++
28 files changed, 163 insertions(+), 22 deletions(-)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_lib.c
create mode 100644 tools/testing/selftests/arm64/tags_test.c
--
2.22.0.rc1.311.g5d7573a151-goog
Hi,
here are the rest and the main part of patches to add the support for
loading the compressed firmware files. The patch was slightly
refactored for more easily enhancing for other compression formats (if
anyone wants). Also the selftest patch is included. The
functionality doesn't change from the previous patchset.
thanks,
Takashi
===
Takashi Iwai (2):
firmware: Add support for loading compressed files
selftests: firmware: Add compressed firmware tests
drivers/base/firmware_loader/Kconfig | 18 +++
drivers/base/firmware_loader/firmware.h | 8 +-
drivers/base/firmware_loader/main.c | 147 ++++++++++++++++++++--
tools/testing/selftests/firmware/fw_filesystem.sh | 73 +++++++++--
tools/testing/selftests/firmware/fw_lib.sh | 7 ++
tools/testing/selftests/firmware/fw_run_tests.sh | 1 +
6 files changed, 232 insertions(+), 22 deletions(-)
--
2.16.4
This document is used by multiple architectures:
$ echo $(git grep -l pkey_mprotect arch|cut -d'/' -f 2|sort|uniq)
alpha arm arm64 ia64 m68k microblaze mips parisc powerpc s390 sh sparc x86 xtensa
So, let's move it to the core book and adjust the links to it
accordingly.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung(a)kernel.org>
---
Documentation/core-api/index.rst | 1 +
Documentation/{x86 => core-api}/protection-keys.rst | 0
Documentation/x86/index.rst | 1 -
arch/powerpc/Kconfig | 2 +-
arch/x86/Kconfig | 2 +-
tools/testing/selftests/x86/protection_keys.c | 2 +-
6 files changed, 4 insertions(+), 4 deletions(-)
rename Documentation/{x86 => core-api}/protection-keys.rst (100%)
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ee1bb8983a88..2466a4c51031 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -34,6 +34,7 @@ Core utilities
timekeeping
boot-time-mm
memory-hotplug
+ protection-keys
Interfaces for kernel debugging
diff --git a/Documentation/x86/protection-keys.rst b/Documentation/core-api/protection-keys.rst
similarity index 100%
rename from Documentation/x86/protection-keys.rst
rename to Documentation/core-api/protection-keys.rst
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index ae36fc5fc649..f2de1b2d3ac7 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -19,7 +19,6 @@ x86-specific Documentation
tlb
mtrr
pat
- protection-keys
intel_mpx
amd-memory-encryption
pti
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8c1c636308c8..3b795a0cab62 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -898,7 +898,7 @@ config PPC_MEM_KEYS
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.
- For details, see Documentation/vm/protection-keys.rst
+ For details, see Documentation/core-api/protection-keys.rst
If unsure, say y.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..d87d53fcd261 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1911,7 +1911,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.
- For details, see Documentation/x86/protection-keys.txt
+ For details, see Documentation/core-api/protection-keys.rst
If unsure, say y.
diff --git a/tools/testing/selftests/x86/protection_keys.c b/tools/testing/selftests/x86/protection_keys.c
index 5d546dcdbc80..480995bceefa 100644
--- a/tools/testing/selftests/x86/protection_keys.c
+++ b/tools/testing/selftests/x86/protection_keys.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
+ * Tests x86 Memory Protection Keys (see Documentation/core-api/protection-keys.rst)
*
* There are examples in here of:
* * how to set protection keys on memory
--
2.21.0
Hi Linus,
Please pull the following second Kselftest fixes update for
Linux 5.2 rc4.
This Kselftest second fixes update for Linux 5.2-rc4 consists of a
single fix for vm test build failure regression when it is built by
itself.
I found this while I was sanity checking the first fixes update for
Linux 5.2. Would like to get this into rc4.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit bc2cce3f2ebcae02aa4bb29e3436bf75ee674c32:
selftests: vm: install test_vmalloc.sh for run_vmtests (2019-05-30
08:32:57 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.2-rc4-2
for you to fetch changes up to e2e88325f4bcaea51f454723971f7b5ee0e1aa80:
selftests: vm: Fix test build failure when built by itself
(2019-06-05 16:05:40 -0600)
----------------------------------------------------------------
linux-kselftest-5.2-rc4-2
This Kselftest second fixes update for Linux 5.2-rc4 consists of a single
fix for vm test build failure regression when it is built by itself.
----------------------------------------------------------------
Shuah Khan (1):
selftests: vm: Fix test build failure when built by itself
tools/testing/selftests/vm/Makefile | 4 ----
1 file changed, 4 deletions(-)
----------------------------------------------------------------
Use udf as the guard instruction for the restartable sequence abort
handler.
Previously, the chosen signature was not a valid instruction, based
on the assumption that it could always sit in a literal pool. However,
there are compilation environments in which literal pools are not
availble, for instance execute-only code. Therefore, we need to
choose a signature value that is also a valid instruction.
Handle compiling with -mbig-endian on ARMv6+, which generates binaries
with mixed code vs data endianness (little endian code, big endian
data).
Else mismatch between code endianness for the generated signatures and
data endianness for the RSEQ_SIG parameter passed to the rseq
registration will trigger application segmentation faults when the
kernel try to abort rseq critical sections.
Prior to ARMv6, -mbig-endian generates big-endian code and data, so
endianness should not be reversed in that case.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
CC: Peter Zijlstra <peterz(a)infradead.org>
CC: Thomas Gleixner <tglx(a)linutronix.de>
CC: Joel Fernandes <joelaf(a)google.com>
CC: Catalin Marinas <catalin.marinas(a)arm.com>
CC: Dave Watson <davejwatson(a)fb.com>
CC: Will Deacon <will.deacon(a)arm.com>
CC: Shuah Khan <shuah(a)kernel.org>
CC: Andi Kleen <andi(a)firstfloor.org>
CC: linux-kselftest(a)vger.kernel.org
CC: "H . Peter Anvin" <hpa(a)zytor.com>
CC: Chris Lameter <cl(a)linux.com>
CC: Russell King <linux(a)arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages(a)gmail.com>
CC: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com>
CC: Paul Turner <pjt(a)google.com>
CC: Boqun Feng <boqun.feng(a)gmail.com>
CC: Josh Triplett <josh(a)joshtriplett.org>
CC: Steven Rostedt <rostedt(a)goodmis.org>
CC: Ben Maurer <bmaurer(a)fb.com>
CC: linux-api(a)vger.kernel.org
CC: Andy Lutomirski <luto(a)amacapital.net>
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Linus Torvalds <torvalds(a)linux-foundation.org>
---
tools/testing/selftests/rseq/rseq-arm.h | 52 +++++++++++++++++++++++++++++++--
1 file changed, 50 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 5f262c54364f..e8ccfc37d685 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -5,7 +5,54 @@
* (C) Copyright 2016-2018 - Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
*/
-#define RSEQ_SIG 0x53053053
+/*
+ * RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
+ * value 0x5de3. This traps if user-space reaches this instruction by mistake,
+ * and the uncommon operand ensures the kernel does not move the instruction
+ * pointer to attacker-controlled code on rseq abort.
+ *
+ * The instruction pattern in the A32 instruction set is:
+ *
+ * e7f5def3 udf #24035 ; 0x5de3
+ *
+ * This translates to the following instruction pattern in the T16 instruction
+ * set:
+ *
+ * little endian:
+ * def3 udf #243 ; 0xf3
+ * e7f5 b.n <7f5>
+ *
+ * pre-ARMv6 big endian code:
+ * e7f5 b.n <7f5>
+ * def3 udf #243 ; 0xf3
+ *
+ * ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
+ * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
+ * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
+ * (which match), so there is no need to reverse the endianness of the data
+ * representation of the signature. However, the choice between BE32 and BE8
+ * is done by the linker, so we cannot know whether code and data endianness
+ * will be mixed before the linker is invoked.
+ */
+
+#define RSEQ_SIG_CODE 0xe7f5def3
+
+#ifndef __ASSEMBLER__
+
+#define RSEQ_SIG_DATA \
+ ({ \
+ int sig; \
+ asm volatile ( "b 2f\n\t" \
+ "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+ "2:\n\t" \
+ "ldr %[sig], 1b\n\t" \
+ : [sig] "=r" (sig)); \
+ sig; \
+ })
+
+#define RSEQ_SIG RSEQ_SIG_DATA
+
+#endif
#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
#define rseq_smp_rmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -78,7 +125,8 @@ do { \
__rseq_str(table_label) ":\n\t" \
".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
- ".word " __rseq_str(RSEQ_SIG) "\n\t" \
+ ".arm\n\t" \
+ ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
__rseq_str(label) ":\n\t" \
teardown \
"b %l[" __rseq_str(abort_label) "]\n\t"
--
2.11.0
From: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
[ Upstream commit 82ce6eb1dd13fd12e449b2ee2c2ec051e6f52c43 ]
A test for the basic NAT functionality uses ip command which needs veth
device. There is a condition where the kernel support for veth is not
compiled into the kernel and the test script breaks. This patch contains
code for reasonable error display and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
Acked-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/netfilter/nft_nat.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 8ec76681605c..f25f72a75cf3 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -23,7 +23,11 @@ ip netns add ns0
ip netns add ns1
ip netns add ns2
-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+ echo "SKIP: No virtual ethernet pair device support in kernel"
+ exit $ksft_skip
+fi
ip link add veth1 netns ns0 type veth peer name eth0 netns ns2
ip -net ns0 link set lo up
--
2.20.1
From: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
[ Upstream commit 82ce6eb1dd13fd12e449b2ee2c2ec051e6f52c43 ]
A test for the basic NAT functionality uses ip command which needs veth
device. There is a condition where the kernel support for veth is not
compiled into the kernel and the test script breaks. This patch contains
code for reasonable error display and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
Acked-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/netfilter/nft_nat.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 8ec76681605c..f25f72a75cf3 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -23,7 +23,11 @@ ip netns add ns0
ip netns add ns1
ip netns add ns2
-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+ echo "SKIP: No virtual ethernet pair device support in kernel"
+ exit $ksft_skip
+fi
ip link add veth1 netns ns0 type veth peer name eth0 netns ns2
ip -net ns0 link set lo up
--
2.20.1
From: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
[ Upstream commit 82ce6eb1dd13fd12e449b2ee2c2ec051e6f52c43 ]
A test for the basic NAT functionality uses ip command which needs veth
device. There is a condition where the kernel support for veth is not
compiled into the kernel and the test script breaks. This patch contains
code for reasonable error display and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
Acked-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/netfilter/nft_nat.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 8ec76681605c..f25f72a75cf3 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -23,7 +23,11 @@ ip netns add ns0
ip netns add ns1
ip netns add ns2
-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+ echo "SKIP: No virtual ethernet pair device support in kernel"
+ exit $ksft_skip
+fi
ip link add veth1 netns ns0 type veth peer name eth0 netns ns2
ip -net ns0 link set lo up
--
2.20.1
From: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
[ Upstream commit 82ce6eb1dd13fd12e449b2ee2c2ec051e6f52c43 ]
A test for the basic NAT functionality uses ip command which needs veth
device. There is a condition where the kernel support for veth is not
compiled into the kernel and the test script breaks. This patch contains
code for reasonable error display and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
Acked-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/netfilter/nft_nat.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 3194007cf8d1..a59c5fd4e987 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -23,7 +23,11 @@ ip netns add ns0
ip netns add ns1
ip netns add ns2
-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+ echo "SKIP: No virtual ethernet pair device support in kernel"
+ exit $ksft_skip
+fi
ip link add veth1 netns ns0 type veth peer name eth0 netns ns2
ip -net ns0 link set lo up
--
2.20.1
Hi Linus,
Please pull the following Kselftest fixes update for Linux 5.2-rc4.
This Kselftest update for Linux 5.2-rc4 consists of
- Alex Shi's fixes to cgroup tests
- Alakesh Haloi's fix to userfaultfd compiler warning
- Naresh Kamboju's fix to vm install to include test script to run
the test.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit eff82a263b5cfa3427fd9dbfedd96da94fdc9f19:
selftests: rtc: rtctest: specify timeouts (2019-05-24 13:39:58 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.2-rc4
for you to fetch changes up to bc2cce3f2ebcae02aa4bb29e3436bf75ee674c32:
selftests: vm: install test_vmalloc.sh for run_vmtests (2019-05-30
08:32:57 -0600)
----------------------------------------------------------------
linux-kselftest-5.2-rc4
This Kselftest update for Linux 5.2-rc4 consists of
- Alex Shi's fixes to cgroup tests
- Alakesh Haloi's fix to userfaultfd compiler warning
- Naresh Kamboju's fix to vm install to include test script to run
the test.
----------------------------------------------------------------
Alakesh Haloi (1):
userfaultfd: selftest: fix compiler warning
Alex Shi (3):
kselftest/cgroup: fix unexpected testing failure on test_memcontrol
kselftest/cgroup: fix unexpected testing failure on test_core
kselftest/cgroup: fix incorrect test_core skip
Naresh Kamboju (1):
selftests: vm: install test_vmalloc.sh for run_vmtests
tools/testing/selftests/cgroup/test_core.c | 7 ++++++-
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
tools/testing/selftests/vm/Makefile | 2 ++
tools/testing/selftests/vm/userfaultfd.c | 2 +-
4 files changed, 13 insertions(+), 2 deletions(-)
----------------------------------------------------------------
Do you see this failure at your end?
Our environment is build on host and install them on target device and
run on Device under test (DUT).
Did i miss any kernel config fragments ?
bpf: test_sock_fields_ #
# libbpf Error in bpf_create_map_xattr(sk_pkt_out_cnt)Invalid
argument(22). Retrying without BTF.
Error: in_bpf_create_map_xattr(sk_pkt_out_cnt)Invalid #
# libbpf failed to create map (name 'sk_pkt_out_cnt') Invalid argument
failed: to_create #
# libbpf failed to load object 'test_sock_fields_kern.o'
failed: to_load #
# main(439)FAILbpf_prog_load_xattr() err-22
err-22: _ #
[FAIL] 22 selftests bpf test_sock_fields
selftests: bpf_test_sock_fields [FAIL]
Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20190605/testru…
Config:
http://snapshots.linaro.org/openembedded/lkft/lkft/sumo/intel-corei7-64/lkf…
Test results for comparison,
https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/bpf_test_s…
Best regards
Naresh Kamboju
From: Hangbin Liu <liuhangbin(a)gmail.com>
[ Upstream commit fc82d93e57e3d41f79eff19031588b262fc3d0b6 ]
The IPv4 testing address are all in 192.51.100.0 subnet. It doesn't make
sense to set a 198.51.100.1 local address. Should be a typo.
Fixes: 65b2b4939a64 ("selftests: net: initial fib rule tests")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
Reviewed-by: David Ahern <dsahern(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/fib_rule_tests.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/fib_rule_tests.sh b/tools/testing/selftests/net/fib_rule_tests.sh
index d84193bdc307..dbd90ca73e44 100755
--- a/tools/testing/selftests/net/fib_rule_tests.sh
+++ b/tools/testing/selftests/net/fib_rule_tests.sh
@@ -55,7 +55,7 @@ setup()
$IP link add dummy0 type dummy
$IP link set dev dummy0 up
- $IP address add 198.51.100.1/24 dev dummy0
+ $IP address add 192.51.100.1/24 dev dummy0
$IP -6 address add 2001:db8:1::1/64 dev dummy0
set +e
--
2.20.1
From: Hangbin Liu <liuhangbin(a)gmail.com>
[ Upstream commit fc82d93e57e3d41f79eff19031588b262fc3d0b6 ]
The IPv4 testing address are all in 192.51.100.0 subnet. It doesn't make
sense to set a 198.51.100.1 local address. Should be a typo.
Fixes: 65b2b4939a64 ("selftests: net: initial fib rule tests")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
Reviewed-by: David Ahern <dsahern(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/fib_rule_tests.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/fib_rule_tests.sh b/tools/testing/selftests/net/fib_rule_tests.sh
index 4b7e107865bf..1ba069967fa2 100755
--- a/tools/testing/selftests/net/fib_rule_tests.sh
+++ b/tools/testing/selftests/net/fib_rule_tests.sh
@@ -55,7 +55,7 @@ setup()
$IP link add dummy0 type dummy
$IP link set dev dummy0 up
- $IP address add 198.51.100.1/24 dev dummy0
+ $IP address add 192.51.100.1/24 dev dummy0
$IP -6 address add 2001:db8:1::1/64 dev dummy0
set +e
--
2.20.1
This patch series enables the KVM selftests for s390x. As a first
test, the sync_regs from x86 has been adapted to s390x, and after
a fix for KVM_CAP_MAX_VCPU_ID on s390x, the kvm_create_max_vcpus
is now enabled here, too.
Please note that the ucall() interface is not used yet - since
s390x neither has PIO nor MMIO, this needs some more work first
before it becomes usable (we likely should use a DIAG hypercall
here, which is what the sync_reg test is currently using, too...
I started working on that topic, but did not finish that work
yet, so I decided to not include it yet).
RFC -> v1:
- Rebase, needed to add the first patch for vcpu_nested_state_get/set
- Added patch to introduce VM_MODE_DEFAULT macro
- Improved/cleaned up the code in processor.c
- Added patch to fix KVM_CAP_MAX_VCPU_ID on s390x
- Added patch to enable the kvm_create_max_vcpus on s390x and aarch64
Andrew Jones (1):
kvm: selftests: aarch64: fix default vm mode
Thomas Huth (8):
KVM: selftests: Wrap vcpu_nested_state_get/set functions with x86
guard
KVM: selftests: Guard struct kvm_vcpu_events with
__KVM_HAVE_VCPU_EVENTS
KVM: selftests: Introduce a VM_MODE_DEFAULT macro for the default bits
KVM: selftests: Align memory region addresses to 1M on s390x
KVM: selftests: Add processor code for s390x
KVM: selftests: Add the sync_regs test for s390x
KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID
KVM: selftests: Move kvm_create_max_vcpus test to generic code
MAINTAINERS | 2 +
arch/mips/kvm/mips.c | 3 +
arch/powerpc/kvm/powerpc.c | 3 +
arch/s390/kvm/kvm-s390.c | 1 +
arch/x86/kvm/x86.c | 3 +
tools/testing/selftests/kvm/Makefile | 7 +-
.../testing/selftests/kvm/include/kvm_util.h | 10 +
.../selftests/kvm/include/s390x/processor.h | 22 ++
.../kvm/{x86_64 => }/kvm_create_max_vcpus.c | 3 +-
.../selftests/kvm/lib/aarch64/processor.c | 2 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 25 +-
.../selftests/kvm/lib/s390x/processor.c | 286 ++++++++++++++++++
.../selftests/kvm/lib/x86_64/processor.c | 2 +-
.../selftests/kvm/s390x/sync_regs_test.c | 151 +++++++++
virt/kvm/arm/arm.c | 3 +
virt/kvm/kvm_main.c | 2 -
16 files changed, 514 insertions(+), 11 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/s390x/processor.h
rename tools/testing/selftests/kvm/{x86_64 => }/kvm_create_max_vcpus.c (93%)
create mode 100644 tools/testing/selftests/kvm/lib/s390x/processor.c
create mode 100644 tools/testing/selftests/kvm/s390x/sync_regs_test.c
--
2.21.0
Current implementation of kselftest-merge only finds config files that
are one level deep using `$(srctree)/tools/testing/selftests/*/config`.
Often, config files are added in nested directories, and do not get
picked up by kselftest-merge.
Use `find` to catch all config files under
`$(srctree)/tools/testing/selftests` instead.
Signed-off-by: Dan Rue <dan.rue(a)linaro.org>
---
Makefile | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/Makefile b/Makefile
index a45f84a7e811..e99e7f9484af 100644
--- a/Makefile
+++ b/Makefile
@@ -1228,9 +1228,8 @@ kselftest-clean:
PHONY += kselftest-merge
kselftest-merge:
$(if $(wildcard $(objtree)/.config),, $(error No .config exists, config your kernel first!))
- $(Q)$(CONFIG_SHELL) $(srctree)/scripts/kconfig/merge_config.sh \
- -m $(objtree)/.config \
- $(srctree)/tools/testing/selftests/*/config
+ $(Q)find $(srctree)/tools/testing/selftests -name config | \
+ xargs $(srctree)/scripts/kconfig/merge_config.sh -m $(objtree)/.config
+$(Q)$(MAKE) -f $(srctree)/Makefile olddefconfig
# ---------------------------------------------------------------------------
--
2.21.0
This document is used by multiple architectures:
$ echo $(git grep -l pkey_mprotect arch|cut -d'/' -f 2|sort|uniq)
alpha arm arm64 ia64 m68k microblaze mips parisc powerpc s390 sh sparc x86 xtensa
So, let's move it to the core book and adjust the links to it
accordingly.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung(a)kernel.org>
---
Documentation/core-api/index.rst | 1 +
Documentation/{x86 => core-api}/protection-keys.rst | 0
Documentation/x86/index.rst | 1 -
arch/powerpc/Kconfig | 2 +-
arch/x86/Kconfig | 2 +-
tools/testing/selftests/x86/protection_keys.c | 2 +-
6 files changed, 4 insertions(+), 4 deletions(-)
rename Documentation/{x86 => core-api}/protection-keys.rst (100%)
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ee1bb8983a88..2466a4c51031 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -34,6 +34,7 @@ Core utilities
timekeeping
boot-time-mm
memory-hotplug
+ protection-keys
Interfaces for kernel debugging
diff --git a/Documentation/x86/protection-keys.rst b/Documentation/core-api/protection-keys.rst
similarity index 100%
rename from Documentation/x86/protection-keys.rst
rename to Documentation/core-api/protection-keys.rst
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index ae36fc5fc649..f2de1b2d3ac7 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -19,7 +19,6 @@ x86-specific Documentation
tlb
mtrr
pat
- protection-keys
intel_mpx
amd-memory-encryption
pti
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1120ff8ac715..e437aa3e78b4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -898,7 +898,7 @@ config PPC_MEM_KEYS
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.
- For details, see Documentation/vm/protection-keys.rst
+ For details, see Documentation/core-api/protection-keys.rst
If unsure, say y.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 23de3b9da480..61244bdb886f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1911,7 +1911,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.
- For details, see Documentation/x86/protection-keys.txt
+ For details, see Documentation/core-api/protection-keys.rst
If unsure, say y.
diff --git a/tools/testing/selftests/x86/protection_keys.c b/tools/testing/selftests/x86/protection_keys.c
index 5d546dcdbc80..480995bceefa 100644
--- a/tools/testing/selftests/x86/protection_keys.c
+++ b/tools/testing/selftests/x86/protection_keys.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
+ * Tests x86 Memory Protection Keys (see Documentation/core-api/protection-keys.rst)
*
* There are examples in here of:
* * how to set protection keys on memory
--
2.21.0
=== Overview
arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:
1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
pointers")
This patchset extends tagged pointer support to syscall arguments.
As per the proposed ABI change [3], tagged pointers are only allowed to be
passed to syscalls when they point to memory ranges obtained by anonymous
mmap() or sbrk() (see the patchset [3] for more details).
For non-memory syscalls this is done by untaging user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel and stays tagged when the kernel
dereferences the pointer when perfoming user memory accesses.
Memory syscalls (mmap, mprotect, etc.) don't do user memory accesses but
rather deal with memory ranges, and untagged pointers are better suited to
describe memory ranges internally. Thus for memory syscalls we untag
pointers completely when they enter the kernel.
=== Other approaches
One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.
An alternative approach to untagging pointers in memory syscalls prologues
is to inspead allow tagged pointers to be passed to find_vma() (and other
vma related functions) and untag them there. Unfortunately, a lot of
find_vma() callers then compare or subtract the returned vma start and end
fields against the pointer that was being searched. Thus this approach
would still require changing all find_vma() callers.
=== Testing
The following testing approaches has been taken to find potential issues
with user pointer untagging:
1. Static testing (with sparse [2] and separately with a custom static
analyzer based on Clang) to track casts of __user pointers to integer
types to find places where untagging needs to be done.
2. Static testing with grep to find parts of the kernel that call
find_vma() (and other similar functions) or directly compare against
vm_start/vm_end fields of vma.
3. Static testing with grep to find parts of the kernel that compare
user pointers with TASK_SIZE or other similar consts and macros.
4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added
to the patchset.
=== Notes
This patchset is meant to be merged together with "arm64 relaxed ABI" [3].
This patchset is a prerequisite for ARM's memory tagging hardware feature
support [4].
This patchset has been merged into the Pixel 2 & 3 kernel trees and is
now being used to enable testing of Pixel phones with HWASan.
Thanks!
[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060…
[3] https://lkml.org/lkml/2019/3/18/819
[4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architectur…
Changes in v15:
- Removed unnecessary untagging from radeon_ttm_tt_set_userptr().
- Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr().
- Moved untagging to validate_range() in userfaultfd code.
- Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr().
- Rebased onto 5.1.
Changes in v14:
- Moved untagging for most memory syscalls to an arm64 specific
implementation, instead of doing that in the common code.
- Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since
the provided user pointers don't come from an anonymous map and thus are
not covered by this ABI relaxation.
- Dropped "kernel, arm64: untag user pointers in prctl_set_mm*".
- Moved untagging from __check_mem_type() to tee_shm_register().
- Updated untagging for the amdgpu and radeon drivers to cover the MMU
notifier, as suggested by Felix.
- Since this ABI relaxation doesn't actually allow tagged instruction
pointers, dropped the following patches:
- Dropped "tracing, arm64: untag user pointers in seq_print_user_ip".
- Dropped "uprobes, arm64: untag user pointers in find_active_uprobe".
- Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset".
- Rebased onto 5.1-rc7 (37624b58).
Changes in v13:
- Simplified untagging in tcp_zerocopy_receive().
- Looked at find_vma() callers in drivers/, which allowed to identify a
few other places where untagging is needed.
- Added patch "mm, arm64: untag user pointers in get_vaddr_frames".
- Added patch "drm/amdgpu, arm64: untag user pointers in
amdgpu_ttm_tt_get_user_pages".
- Added patch "drm/radeon, arm64: untag user pointers in
radeon_ttm_tt_pin_userptr".
- Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr".
- Added patch "media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get".
- Added patch "tee/optee, arm64: untag user pointers in check_mem_type".
- Added patch "vfio/type1, arm64: untag user pointers".
Changes in v12:
- Changed untagging in tcp_zerocopy_receive() to also untag zc->address.
- Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups
and validity checks, but leave them as is for actual user space accesses.
- Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3].
- Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3]
handles that.
Changes in v11:
- Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch.
- Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset"
patch.
- Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to
correctly perform subtration with a tagged addr.
- Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and
SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey().
- Moved untagged_addr() definition for other arches from
include/linux/memory.h to include/linux/mm.h.
- Changed untagging in strn*_user() to perform userspace accesses through
tagged pointers.
- Updated the documentation to mention that passing tagged pointers to
memory syscalls is allowed.
- Updated the test to use malloc'ed memory instead of stack memory.
Changes in v10:
- Added "mm, arm64: untag user pointers passed to memory syscalls" back.
- New patch "fs, arm64: untag user pointers in fs/userfaultfd.c".
- New patch "net, arm64: untag user pointers in tcp_zerocopy_receive".
- New patch "kernel, arm64: untag user pointers in prctl_set_mm*".
- New patch "tracing, arm64: untag user pointers in seq_print_user_ip".
Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.
Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
support tagged user pointers.
Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.
Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).
Changes in v5:
- Added 3 new patches that add untagging to places found with static
analysis.
- Rebased onto 44c929e1 (4.18-rc8).
Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).
Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.
Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.
Changes in v1:
- Rebased onto 4.17-rc1.
Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Andrey Konovalov (17):
uaccess: add untagged_addr definition for other arches
arm64: untag user pointers in access_ok and __uaccess_mask_ptr
lib, arm64: untag user pointers in strn*_user
mm: add ksys_ wrappers to memory syscalls
arms64: untag user pointers passed to memory syscalls
mm: untag user pointers in do_pages_move
mm, arm64: untag user pointers in mm/gup.c
mm, arm64: untag user pointers in get_vaddr_frames
fs, arm64: untag user pointers in copy_mount_options
fs, arm64: untag user pointers in fs/userfaultfd.c
drm/amdgpu, arm64: untag user pointers
drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl
IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()
media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get
tee, arm64: untag user pointers in tee_shm_register
vfio/type1, arm64: untag user pointers in vaddr_get_pfn
selftests, arm64: add a selftest for passing tagged pointers to kernel
arch/arm64/include/asm/uaccess.h | 10 +-
arch/arm64/kernel/sys.c | 128 ++++++++++++++++-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +
drivers/gpu/drm/radeon/radeon_gem.c | 2 +
drivers/infiniband/core/uverbs_cmd.c | 4 +
drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +-
drivers/tee/tee_shm.c | 1 +
drivers/vfio/vfio_iommu_type1.c | 2 +
fs/namespace.c | 2 +-
fs/userfaultfd.c | 22 +--
include/linux/mm.h | 4 +
include/linux/syscalls.h | 22 +++
ipc/shm.c | 7 +-
lib/strncpy_from_user.c | 3 +-
lib/strnlen_user.c | 3 +-
mm/frame_vector.c | 2 +
mm/gup.c | 4 +
mm/madvise.c | 129 +++++++++---------
mm/mempolicy.c | 21 ++-
mm/migrate.c | 1 +
mm/mincore.c | 57 ++++----
mm/mlock.c | 20 ++-
mm/mmap.c | 30 +++-
mm/mprotect.c | 6 +-
mm/mremap.c | 27 ++--
mm/msync.c | 35 +++--
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 11 ++
.../testing/selftests/arm64/run_tags_test.sh | 12 ++
tools/testing/selftests/arm64/tags_test.c | 21 +++
31 files changed, 436 insertions(+), 164 deletions(-)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_test.c
--
2.21.0.1020.gf2820cf01a-goog
Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
Fred Klassen (1):
net/udp_gso: Allow TX timestamp with UDP GSO
net/ipv4/udp_offload.c | 5 +++++
1 file changed, 5 insertions(+)
--
2.11.0
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().
In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.
A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].
This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.
Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.
Note: arm64 and nds32 respective fixes have been merged in 5.2-rc1,
hence they have been removed from this series.
[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2
Changes:
--------
v5:
- Rebased on 5.2-rc2
- Fixed a bug in kselftest.
v4:
- Address review comments.
v3:
- Rebased on 5.2-rc1.
- Address review comments.
v2:
- Rebased on 5.1-rc5.
- Addressed review comments.
Cc: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Vincenzo Frascino (3):
powerpc: Fix vDSO clock_getres()
s390: Fix vDSO clock_getres()
kselftest: Extend vDSO selftest to clock_getres
arch/powerpc/include/asm/vdso_datapage.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +-
arch/powerpc/kernel/time.c | 1 +
arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +-
arch/s390/include/asm/vdso.h | 1 +
arch/s390/kernel/asm-offsets.c | 2 +-
arch/s390/kernel/time.c | 1 +
arch/s390/kernel/vdso32/clock_getres.S | 12 +-
arch/s390/kernel/vdso64/clock_getres.S | 10 +-
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_clock_getres.c | 124 ++++++++++++++++++
12 files changed, 155 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c
--
2.21.0
Hi Linus,
Please pull the following Kselftest fixes update for Linux 5.2-rc3.
This Kselftest update for Linux 5.2-rc3 consists of
- Alexandre Belloni's fixes to rtc regressions introduced in kselftest
Makefile test run output refactoring work from Kees Cook.
- ftrace test checkbashisms fixes from Masami Hiramatsu
As a note, it is an usual and expected outcome to see a few regressions
when Kselftest run-time scripts are enhanced. No surprises there.
I am glad we are finding these problems early on in the rc cycle.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit a188339ca5a396acc588e5851ed7e19f66b0ebd9:
Linux 5.2-rc1 (2019-05-19 15:47:09 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.2-rc3
for you to fetch changes up to eff82a263b5cfa3427fd9dbfedd96da94fdc9f19:
selftests: rtc: rtctest: specify timeouts (2019-05-24 13:39:58 -0600)
----------------------------------------------------------------
linux-kselftest-5.2-rc3
This Kselftest update for Linux 5.2-rc3 consists of
- Alexandre Belloni's fixes to rtc regressions introduced in kselftest
Makefile test run output refactoring work from Kees Cook.
- ftrace test checkbashisms fixes from Masami Hiramatsu
----------------------------------------------------------------
Alexandre Belloni (2):
selftests/harness: Allow test to configure timeout
selftests: rtc: rtctest: specify timeouts
Masami Hiramatsu (2):
selftests/ftrace: Make a script checkbashisms clean
selftests/ftrace: Add checkbashisms meta-testcase
tools/testing/selftests/ftrace/ftracetest | 1 +
.../selftests/ftrace/test.d/kprobe/kprobe_ftrace.tc | 2 +-
.../selftests/ftrace/test.d/selftest/bashisms.tc | 21
+++++++++++++++++++++
tools/testing/selftests/kselftest_harness.h | 17 ++++++++++++-----
tools/testing/selftests/rtc/rtctest.c | 6 +++---
5 files changed, 38 insertions(+), 9 deletions(-)
create mode 100644
tools/testing/selftests/ftrace/test.d/selftest/bashisms.tc
----------------------------------------------------------------
Patch changelog:
v8:
* Default to O_CLOEXEC to match other new fd-creation syscalls
(users can always disable O_CLOEXEC afterwards). [Christian]
* Implement magic-link restrictions based on their mode. This is
done through a series of masks and is designed to avoid breaking
users -- most users don't have chained O_PATH fd re-opens.
* Add O_EMPTYPATH which allows for fd re-opening without needing
procfs. This would help some users of fd re-opening, and with the
changes to magic-link permissions we now have the right semantics
for such a flag.
* Add selftests for resolveat(2), O_EMPTYPATH, and the magic-link
mode semantics.
v7:
* Remove execveat(2) support for these flags since it might
result in some pretty hairy security issues with setuid binaries.
There are other avenues we can go down to solve the issues with
CVE-2019-5736. [Jann]
* Reserve an additional bit in resolveat(2) for the eXecute access
mode if we end up implementing it.
v6:
* Drop O_* flags API to the new LOOKUP_ path scoping bits and
instead introduce resolveat(2) as an alternative method of
obtaining an O_PATH. The justification for this is included in
patch 6 (though switching back to O_* flags is trivial).
v5:
* In response to CVE-2019-5736 (one of the vectors showed that
open(2)+fexec(3) cannot be used to scope binfmt_script's implicit
open_exec()), AT_* flags have been re-added and are now piped
through to binfmt_script (and other binfmt_* that use open_exec)
but are only supported for execveat(2) for now.
v4:
* Remove AT_* flag reservations, as they require more discussion.
* Switch to path_is_under() over __d_path() for breakout checking.
* Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd
is now ignored for absolute paths to match other flags.
* Improve the dirfd_path_init() refactor and move it to a separate
commit.
* Remove reference to Linux-capsicum.
* Switch "proclink" name to magic-link.
v3: [resend]
v2:
* Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
some semi-aggressive __d_path checking (see patch 3).
* Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
hopes they can be re-enabled once safe.
* Removed the selftests as they will be reimplemented as xfstests.
* Removed stat(2) support, since you can already get it through
O_PATH and fstatat(2).
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall
resolveat(2) which provides an alternative way to get an O_PATH file
descriptor (the reasoning for doing this is included in patch 6). The
following new LOOKUP_ flags are added:
* LOOKUP_XDEV blocks all mountpoint crossings (upwards, downwards, or
through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do resolveat(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using ".." (see patch 4 for more detail).
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks as in
patch 4, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having O_THISROOT (such as
CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. resolveat(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm(a)xmission.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Christian Brauner <christian(a)brauner.io>
Cc: David Drysdale <drysdale(a)google.com>
Cc: Tycho Andersen <tycho(a)tycho.ws>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <containers(a)lists.linux-foundation.org>
Cc: <linux-fsdevel(a)vger.kernel.org>
Cc: <linux-api(a)vger.kernel.org>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
Aleksa Sarai (10):
namei: obey trailing magic-link DAC permissions
procfs: switch magic-link modes to be more sane
open: O_EMPTYPATH: procfs-less file descriptor re-opening
namei: split out nd->dfd handling to dirfd_path_init
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
namei: resolveat(2) syscall
kselftest: save-and-restore errno to allow for %m formatting
selftests: add resolveat(2) selftests
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 3 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 397 ++++++++++++++---
fs/open.c | 10 +-
fs/proc/base.c | 20 +-
fs/proc/fd.c | 16 +-
fs/proc/namespaces.c | 2 +-
include/linux/fcntl.h | 10 +-
include/linux/fs.h | 4 +
include/linux/namei.h | 8 +
include/linux/types.h | 2 +-
include/uapi/asm-generic/fcntl.h | 5 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 10 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest.h | 15 +
tools/testing/selftests/resolveat/.gitignore | 1 +
tools/testing/selftests/resolveat/Makefile | 6 +
tools/testing/selftests/resolveat/helpers.h | 195 +++++++++
.../selftests/resolveat/linkmode_test.c | 306 ++++++++++++++
.../selftests/resolveat/resolveat_test.c | 400 ++++++++++++++++++
39 files changed, 1350 insertions(+), 87 deletions(-)
create mode 100644 tools/testing/selftests/resolveat/.gitignore
create mode 100644 tools/testing/selftests/resolveat/Makefile
create mode 100644 tools/testing/selftests/resolveat/helpers.h
create mode 100644 tools/testing/selftests/resolveat/linkmode_test.c
create mode 100644 tools/testing/selftests/resolveat/resolveat_test.c
--
2.21.0
Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
Also this updates the UDP GSO selftests to optionally stress test
this condition, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.
Fred Klassen (4):
net/udp_gso: Allow TX timestamp with UDP GSO
net/udpgso_bench_tx: options to exercise TX CMSG
net/udpgso_bench_tx: fix sendmmsg on unconnected socket
net/udpgso_bench_tx: audit error queue
net/ipv4/udp_offload.c | 4 +
tools/testing/selftests/net/udpgso_bench_tx.c | 376 ++++++++++++++++++++++++--
2 files changed, 358 insertions(+), 22 deletions(-)
--
2.11.0
Hey,
This is v2 of this patchset.
In accordance with some comments There's a cond_resched() added to the
close loop similar to what is done for close_files().
A common helper pick_file() for __close_fd() and __close_range() has
been split out. This allows to only make a cond_resched() call when
filp_close() has been called similar to what is done in close_files().
Maybe that's not worth it. Jann mentioned that cond_resched() looks
rather cheap.
So it maybe that we could simply do:
while (fd <= max_fd) {
__close(files, fd++);
cond_resched();
}
I also added a missing test for close_range(fd, fd, 0).
Thanks!
Christian
Christian Brauner (2):
open: add close_range()
tests: add close_range() tests
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 62 +++++++-
fs/open.c | 20 +++
include/linux/fdtable.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 6 +
.../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
26 files changed, 249 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c
--
2.21.0
This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Dmitry V. Levin <ldv(a)altlinux.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-api(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
---
v1: unchanged
v2:
- Christian Brauner <christian(a)brauner.io>:
- verify that close_range() correctly closes a single file descriptor
v3:
- Christian Brauner <christian(a)brauner.io>:
- add missing Cc for Shuah
- add missing Cc for linux-kselftest
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 6 +
.../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
4 files changed, 150 insertions(+)
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..06e57fabbff9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += bpf
TARGETS += breakpoints
TARGETS += capabilities
TARGETS += cgroup
+TARGETS += core
TARGETS += cpufreq
TARGETS += cpu-hotplug
TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
new file mode 100644
index 000000000000..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
new file mode 100644
index 000000000000..de3ae68aa345
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -g -I../../../../usr/include/ -I../../../../include
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index 000000000000..d6e6079d3d53
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/kernel.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags)
+{
+ return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+ const char *test_name = "close_range";
+ int i, ret;
+ int open_fds[101];
+ int fd_max, fd_mid, fd_min;
+
+ ksft_set_plan(9);
+
+ for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+ int fd;
+
+ fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+ if (fd < 0) {
+ if (errno == ENOENT)
+ ksft_exit_skip(
+ "%s test: skipping test since /dev/null does not exist\n",
+ test_name);
+
+ ksft_exit_fail_msg(
+ "%s test: %s - failed to open /dev/null\n",
+ strerror(errno), test_name);
+ }
+
+ open_fds[i] = fd;
+ }
+
+ fd_min = open_fds[0];
+ fd_max = open_fds[99];
+
+ ret = sys_close_range(fd_min, fd_max, 1);
+ if (!ret)
+ ksft_exit_fail_msg(
+ "%s test: managed to pass invalid flag value\n",
+ test_name);
+ ksft_test_result_pass("do not allow invalid flag values for close_range()\n");
+
+ fd_mid = open_fds[50];
+ ret = sys_close_range(fd_min, fd_mid, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from %d to %d\n",
+ test_name, fd_min, fd_mid);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+ for (i = 0; i <= 50; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from %d to %d\n",
+ test_name, fd_min, fd_mid);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_min, fd_mid);
+
+ /* create a couple of gaps */
+ close(57);
+ close(78);
+ close(81);
+ close(82);
+ close(84);
+ close(90);
+
+ fd_mid = open_fds[51];
+ /* Choose slightly lower limit and leave some fds for a later test */
+ fd_max = open_fds[92];
+ ret = sys_close_range(fd_mid, fd_max, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+ for (i = 51; i <= 92; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+ fd_mid = open_fds[93];
+ fd_max = open_fds[99];
+ /* test that the kernel caps and still closes all fds */
+ ret = sys_close_range(fd_mid, UINT_MAX, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+ for (i = 93; i < 100; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+ ret = sys_close_range(open_fds[100], open_fds[100], 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close single file descriptor\n",
+ test_name);
+ ksft_test_result_pass("close_range() closed single file descriptor\n");
+
+ ret = fcntl(open_fds[100], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close single file descriptor\n",
+ test_name);
+ ksft_test_result_pass("fcntl() verify closed single file descriptor\n");
+
+ return ksft_exit_pass();
+}
--
2.21.0
kselftests exposed a problem in the s390 handling for memory slots.
Right now we only do proper memory slot handling for creation of new
memory slots. Neither MOVE, nor DELETION are handled properly. Let us
implement those.
Signed-off-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
---
arch/s390/kvm/kvm-s390.c | 35 +++++++++++++++++++++--------------
1 file changed, 21 insertions(+), 14 deletions(-)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 871d2e99b156..6ec0685ab2c7 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4525,21 +4525,28 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- int rc;
+ int rc = 0;
- /* If the basics of the memslot do not change, we do not want
- * to update the gmap. Every update causes several unnecessary
- * segment translation exceptions. This is usually handled just
- * fine by the normal fault handler + gmap, but it will also
- * cause faults on the prefix page of running guest CPUs.
- */
- if (old->userspace_addr == mem->userspace_addr &&
- old->base_gfn * PAGE_SIZE == mem->guest_phys_addr &&
- old->npages * PAGE_SIZE == mem->memory_size)
- return;
-
- rc = gmap_map_segment(kvm->arch.gmap, mem->userspace_addr,
- mem->guest_phys_addr, mem->memory_size);
+ switch (change) {
+ case KVM_MR_DELETE:
+ rc = gmap_unmap_segment(kvm->arch.gmap, old->base_gfn * PAGE_SIZE,
+ old->npages * PAGE_SIZE);
+ break;
+ case KVM_MR_MOVE:
+ rc = gmap_unmap_segment(kvm->arch.gmap, old->base_gfn * PAGE_SIZE,
+ old->npages * PAGE_SIZE);
+ if (rc)
+ break;
+ /* FALLTHROUGH */
+ case KVM_MR_CREATE:
+ rc = gmap_map_segment(kvm->arch.gmap, mem->userspace_addr,
+ mem->guest_phys_addr, mem->memory_size);
+ break;
+ case KVM_MR_FLAGS_ONLY:
+ break;
+ default:
+ WARN(1, "Unknown KVM MR CHANGE: %d\n", change);
+ }
if (rc)
pr_warn("failed to commit memory region\n");
return;
--
2.21.0
The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..73612d604a2a 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.19.1.856.g8858448bb
The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..c19a97dd02d4 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.19.1.856.g8858448bb
Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This is version 2 which moves the tests to net-next.
Fred Klassen (1):
net/udp_gso: Allow TX timestamp with UDP GSO
net/ipv4/udp_offload.c | 4 ++++
1 file changed, 4 insertions(+)
--
2.11.0
Hi Linus,
Please pull the following Kselftest fixes update for Linux 5.2-rc2.
This Kselftest fixes update for Linux 5.2-rc2 consists of:
- 2 fixes to regressions introduced in kselftest Makefile test run
output refactoring work from Kees Cook.
- Adding Atom support to syscall_arg_fault test from Tong Bo.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit a188339ca5a396acc588e5851ed7e19f66b0ebd9:
Linux 5.2-rc1 (2019-05-19 15:47:09 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.2-rc2
for you to fetch changes up to fe48319243a626c860fd666ca032daacc2ba84a5:
selftests/timers: Add missing fflush(stdout) calls (2019-05-21
09:24:31 -0600)
----------------------------------------------------------------
linux-kselftest-5.2-rc2
This Kselftest fixes update for Linux 5.2-rc2 consists of:
- 2 fixes to regressions introduced in kselftest Makefile test run output
refactoring work from Kees Cook.
- Adding Atom support to syscall_arg_fault test from Tong Bo.
----------------------------------------------------------------
Kees Cook (2):
selftests: Remove forced unbuffering for test running
selftests/timers: Add missing fflush(stdout) calls
Tong Bo (1):
selftests/x86: Support Atom for syscall_arg_fault test
tools/testing/selftests/kselftest/runner.sh | 12 +-----------
tools/testing/selftests/timers/adjtick.c | 1 +
tools/testing/selftests/timers/leapcrash.c | 1 +
tools/testing/selftests/timers/mqueue-lat.c | 1 +
tools/testing/selftests/timers/nanosleep.c | 1 +
tools/testing/selftests/timers/nsleep-lat.c | 1 +
tools/testing/selftests/timers/raw_skew.c | 1 +
tools/testing/selftests/timers/set-tai.c | 1 +
tools/testing/selftests/timers/set-tz.c | 2 ++
tools/testing/selftests/timers/threadtest.c | 1 +
tools/testing/selftests/timers/valid-adjtimex.c | 2 ++
tools/testing/selftests/x86/syscall_arg_fault.c | 10 ++++++++--
12 files changed, 21 insertions(+), 13 deletions(-)
----------------------------------------------------------------
This series of patches move the commonly used bpf_printk macro to
bpf_helpers.h which is already included in all BPF programs which
defined that macro on their own.
v1->v2:
- If HBM_DEBUG is not defined in hbm sample, undefine bpf_printk and set
an empty macro for it.
Michal Rostecki (2):
selftests: bpf: Move bpf_printk to bpf_helpers.h
samples: bpf: Do not define bpf_printk macro
samples/bpf/hbm_kern.h | 11 ++---------
samples/bpf/tcp_basertt_kern.c | 7 -------
samples/bpf/tcp_bufs_kern.c | 7 -------
samples/bpf/tcp_clamp_kern.c | 7 -------
samples/bpf/tcp_cong_kern.c | 7 -------
samples/bpf/tcp_iw_kern.c | 7 -------
samples/bpf/tcp_rwnd_kern.c | 7 -------
samples/bpf/tcp_synrto_kern.c | 7 -------
samples/bpf/tcp_tos_reflect_kern.c | 7 -------
samples/bpf/xdp_sample_pkts_kern.c | 7 -------
tools/testing/selftests/bpf/bpf_helpers.h | 8 ++++++++
.../testing/selftests/bpf/progs/sockmap_parse_prog.c | 7 -------
.../selftests/bpf/progs/sockmap_tcp_msg_prog.c | 7 -------
.../selftests/bpf/progs/sockmap_verdict_prog.c | 7 -------
.../testing/selftests/bpf/progs/test_lwt_seg6local.c | 7 -------
tools/testing/selftests/bpf/progs/test_xdp_noinline.c | 7 -------
tools/testing/selftests/bpf/test_sockmap_kern.h | 7 -------
17 files changed, 10 insertions(+), 114 deletions(-)
--
2.21.0
On 23/05/2019 22:29:36+0530, Jeffrin Thalakkottoor wrote:
> I get the following result when i run "rtctest"
> please see the following....
> -----------------x------------------------x-----------------
> $pwd
> /home/jeffrin/upstream/linux-kselftest/tools/testing/selftests/rtc
> $./rtctest
> [==========] Running 7 tests from 2 test cases.
> [ RUN ] rtc.date_read
> rtctest.c:32:rtc.date_read:Expected -1 (18446744073709551615) !=
> self->fd (18446744073709551615)
> rtc.date_read: Test terminated by assertion
Your user probably doesn't have access to the rtc device file.
--
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.
The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.
First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating
file descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:
static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;
dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}
to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us
2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us
close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.
>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.
/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d8…
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa…
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/sr…
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/gr…
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5…
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.
Suggested-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Dmitry V. Levin <ldv(a)altlinux.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: linux-api(a)vger.kernel.org
---
v1:
- Linus Torvalds <torvalds(a)linux-foundation.org>:
- add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <viro(a)zeniv.linux.org.uk>:
- add cond_resched() to yield cpu when closing a lot of file descriptors
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 63 ++++++++++++++++++---
fs/open.c | 20 +++++++
include/linux/fdtable.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
22 files changed, 100 insertions(+), 9 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+545 common close_range sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+435 n32 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
+435 n64 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
431 o32 fsconfig sys_fsconfig
432 o32 fsmount sys_fsmount
433 o32 fspick sys_fspick
+435 o32 close_range sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+435 common close_range sys_close_range sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+435 i386 close_range sys_close_range __ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+435 common close_range __x64_sys_close_range
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..54945efa046e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -615,12 +615,9 @@ void fd_install(unsigned int fd, struct file *file)
EXPORT_SYMBOL(fd_install);
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
{
- struct file *file;
+ struct file *file = NULL;
struct fdtable *fdt;
spin_lock(&files->file_lock);
@@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
- spin_unlock(&files->file_lock);
- return filp_close(file, files);
out_unlock:
spin_unlock(&files->file_lock);
- return -EBADF;
+ return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+ struct file *file;
+
+ file = pick_file(files, fd);
+ if (!file)
+ return -EBADF;
+
+ return filp_close(file, files);
}
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+ unsigned int cur_max;
+
+ if (fd > max_fd)
+ return -EINVAL;
+
+ rcu_read_lock();
+ cur_max = files_fdtable(files)->max_fds;
+ rcu_read_unlock();
+
+ /* cap to last valid index into fdtable */
+ if (max_fd >= cur_max)
+ max_fd = cur_max - 1;
+
+ while (fd <= max_fd) {
+ struct file *file;
+
+ file = pick_file(files, fd++);
+ if (!file)
+ continue;
+
+ filp_close(file, files);
+ cond_resched();
+ }
+
+ return 0;
+}
+
/*
* variant of __close_fd that gets a ref on the file for later fput
*/
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
return retval;
}
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags: reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+ unsigned int, flags)
+{
+ if (flags)
+ return -EINVAL;
+
+ return __close_range(current->files, fd, max_fd);
+}
+
/*
* This routine simulates a hangup on the tty, to arrange that users
* are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
extern int __close_fd(struct files_struct *files,
unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+ unsigned int max_fd);
extern int __close_fd_get_file(unsigned int fd, struct file **res);
extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags);
asmlinkage long sys_vhangup(void);
/* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
/*
* 32 bit systems traditionally used different
--
2.21.0
Hi,
Here are patches for making sure the ftracetest testcases
are checkbashisms clean.
This actually needs a patch from Juerg, "selftests/ftrace:
Make the coloring POSIX compliant" to complete the work.
http://lkml.kernel.org/r/20190220161333.28109-1-juergh@canonical.com
(Note that this is still under development)
So as Juerg pointed, recently ftracetest becomes not POSIX
compliant, and such kind of issues happened repeatedly.
To avoid those anymore, I decided to introduce a testcase
which runs checkbasisms on ftracetest and its testcases.
I think this can help us to find out whether it was
written in a way out of POSIX.
Thank you,
---
Masami Hiramatsu (2):
selftests/ftrace: Make a script checkbashisms clean
selftests/ftrace: Add checkbashisms meta-testcase
tools/testing/selftests/ftrace/ftracetest | 1 +
.../ftrace/test.d/kprobe/kprobe_ftrace.tc | 2 +-
.../selftests/ftrace/test.d/selftest/bashisms.tc | 21 ++++++++++++++++++++
3 files changed, 23 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/selftest/bashisms.tc
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi,
Commit a745f7af3cbd ("selftests/harness: Add 30 second timeout per
test") wrongly assumed that no individual test would run for more than
30 seconds and this silently broke rtctest.
Please consider the following patches as fixes for v5.2 to avoid having
any non working release.
Thanks,
Alexandre Belloni (2):
selftests/harness: Allow test to configure timeout
selftests: rtc: rtctest: specify timeouts
tools/testing/selftests/kselftest_harness.h | 17 ++++++++++++-----
tools/testing/selftests/rtc/rtctest.c | 6 +++---
2 files changed, 15 insertions(+), 8 deletions(-)
--
2.21.0
From: Peter Zijlstra <peterz(a)infradead.org>
commit 9e298e8604088a600d8100a111a532a9d342af09 upstream.
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/ftrace.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_fun
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,28 @@ int ftrace_int3_handler(struct pt_regs *
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +878,8 @@ void arch_ftrace_update_trampoline(struc
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +981,7 @@ static int ftrace_mod_jmp(unsigned long
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
From: Peter Zijlstra <peterz(a)infradead.org>
commit 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 upstream.
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/text-patching.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_r
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
From: Peter Zijlstra <peterz(a)infradead.org>
commit 9e298e8604088a600d8100a111a532a9d342af09 upstream.
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/ftrace.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_fun
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,28 @@ int ftrace_int3_handler(struct pt_regs *
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
static int ftrace_write(unsigned long ip, const char *val, int size)
@@ -858,6 +877,8 @@ void arch_ftrace_update_trampoline(struc
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -959,6 +980,7 @@ static int ftrace_mod_jmp(unsigned long
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
From: Peter Zijlstra <peterz(a)infradead.org>
commit 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 upstream.
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/text-patching.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_r
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
From: Peter Zijlstra <peterz(a)infradead.org>
commit 9e298e8604088a600d8100a111a532a9d342af09 upstream.
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/ftrace.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -228,6 +229,7 @@ int ftrace_modify_call(struct dyn_ftrace
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -256,6 +258,8 @@ int ftrace_update_ftrace_func(ftrace_fun
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -291,13 +295,28 @@ int ftrace_int3_handler(struct pt_regs *
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
static int ftrace_write(unsigned long ip, const char *val, int size)
@@ -868,6 +887,8 @@ void arch_ftrace_update_trampoline(struc
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -964,6 +985,7 @@ static int ftrace_mod_jmp(unsigned long
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
From: Peter Zijlstra <peterz(a)infradead.org>
commit 9e298e8604088a600d8100a111a532a9d342af09 upstream.
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/ftrace.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -30,6 +30,7 @@
#include <asm/sections.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -229,6 +230,7 @@ int ftrace_modify_call(struct dyn_ftrace
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -257,6 +259,8 @@ int ftrace_update_ftrace_func(ftrace_fun
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -292,13 +296,28 @@ int ftrace_int3_handler(struct pt_regs *
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
static int ftrace_write(unsigned long ip, const char *val, int size)
@@ -869,6 +888,8 @@ void arch_ftrace_update_trampoline(struc
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -965,6 +986,7 @@ static int ftrace_mod_jmp(unsigned long
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
From: Peter Zijlstra <peterz(a)infradead.org>
commit 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 upstream.
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/text-patching.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -38,4 +38,32 @@ extern void *text_poke(void *addr, const
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
From: Peter Zijlstra <peterz(a)infradead.org>
commit 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 upstream.
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/text-patching.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_r
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
These patches try to test the fix made in commit e2f7fc0ac695 ("bpf:
fix undefined behavior in narrow load handling"). The problem existed
in the generated BPF bytecode that was doing a 32bit narrow read of a
64bit field, so to test it the code would need to be executed.
Currently the only such field exists in BPF_PROG_TYPE_PERF_EVENT,
which is not supported by bpf_prog_test_run().
First commit adds the test, but since I can't run it, I'm not sure if
it is even valid (because endianness, for example). Second commit adds
a message to test_verifier when it couldn't run the program because of
lack permissions or program type being not supported. Third commit
fixes a possible problem with errno.
With these patches, I get the following output on my test:
/kernel/tools/testing/selftests/bpf$ sudo ./test_verifier 920 920
#920/p 32bit loads of a 64bit field (both least and most significant words) Did not run the program (not supported) OK
Summary: 1 PASSED, 0 SKIPPED, 0 FAILED
So it looks like I need to pick a different approach.
Krzesimir Nowak (3):
selftests/bpf: Test correctness of narrow 32bit read on 64bit field
selftests/bpf: Print a message when tester could not run a program
selftests/bpf: Avoid a clobbering of errno
tools/testing/selftests/bpf/test_verifier.c | 19 +++++++++++++++----
.../testing/selftests/bpf/verifier/var_off.c | 15 +++++++++++++++
2 files changed, 30 insertions(+), 4 deletions(-)
--
2.20.1
This patch series enables the KVM selftests for s390x. As a first
test, the sync_regs from x86 has been adapted to s390x.
Please note that the ucall() interface is not used yet - since
s390x neither has PIO nor MMIO, this needs some more work first
before it becomes usable (we likely should use a DIAG hypercall
here, which is what the sync_reg test is currently using, too...).
Thomas Huth (4):
KVM: selftests: Guard struct kvm_vcpu_events with
__KVM_HAVE_VCPU_EVENTS
KVM: selftests: Align memory region addresses to 1M on s390x
KVM: selftests: Add processor code for s390x
KVM: selftests: Add the sync_regs test for s390x
MAINTAINERS | 2 +
tools/testing/selftests/kvm/Makefile | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/s390x/processor.h | 22 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 24 +-
.../selftests/kvm/lib/s390x/processor.c | 277 ++++++++++++++++++
.../selftests/kvm/s390x/sync_regs_test.c | 151 ++++++++++
7 files changed, 476 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/s390x/processor.h
create mode 100644 tools/testing/selftests/kvm/lib/s390x/processor.c
create mode 100644 tools/testing/selftests/kvm/s390x/sync_regs_test.c
--
2.21.0
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9e298e8604088a600d8100a111a532a9d342af09 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] ftrace/x86_64: Emulate call function while updating in
breakpoint handler
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..bd553b3af22e 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,28 @@ int ftrace_int3_handler(struct pt_regs *regs)
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +878,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +981,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9e298e8604088a600d8100a111a532a9d342af09 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] ftrace/x86_64: Emulate call function while updating in
breakpoint handler
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
[
Live kernel patching is not implemented on x86_32, thus the emulate
calls are only for x86_64.
]
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Changed to only implement emulated calls for x86_64 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..bd553b3af22e 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,28 @@ int ftrace_int3_handler(struct pt_regs *regs)
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+#ifdef CONFIG_X86_64
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
+#else
+ if (ftrace_location(ip) || is_ftrace_caller(ip)) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+#endif
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +878,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +981,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] x86_64: Allow breakpoints to emulate call instructions
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..05861cc08787 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4b33dadf37666c0860b88f9e52a16d07bf6d0b03 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 1 May 2019 15:11:17 +0200
Subject: [PATCH] x86_64: Allow breakpoints to emulate call instructions
In order to allow breakpoints to emulate call instructions, they need to push
the return address onto the stack. The x86_64 int3 handler adds a small gap
to allow the stack to grow some. Use this gap to add the return address to
be able to emulate a call instruction at the breakpoint location.
These helper functions are added:
int3_emulate_jmp(): changes the location of the regs->ip to return there.
(The next two are only for x86_64)
int3_emulate_push(): to push the address onto the gap in the stack
int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Tested-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Nicolai Stange <nstange(a)suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
[ Modified to only work for x86_64 and added comment to int3_emulate_push() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..05861cc08787 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,32 @@ extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+#ifdef CONFIG_X86_64
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ /*
+ * The int3 handler in entry_64.S adds a gap between the
+ * stack where the break point happened, and the saving of
+ * pt_regs. We can extend the original stack because of
+ * this gap. See the idtentry macro's create_gap option.
+ */
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+#endif
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
## TLDR
I rebased the last patchset on 5.1-rc7 in hopes that we can get this in
5.2.
Shuah, I think you, Greg KH, and myself talked off thread, and we agreed
we would merge through your tree when the time came? Am I remembering
correctly?
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1-rc7/v1
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1-rc7/v1 branch.
## Changes Since Last Version
None. I just rebased the last patchset on v5.1-rc7.
--
2.21.0.593.g511ec345e18-goog
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 6cea33701eb024bc6c920ab83940ee22afd29139 ]
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf_open.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 8fcd1c076add0..cbd55f5f8d598 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -11,6 +11,8 @@ static const char *__doc__ =
#include <bpf/libbpf.h>
#include <getopt.h>
+#include "bpf_rlimit.h"
+
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"debug", no_argument, NULL, 'D' },
--
2.20.1
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 6cea33701eb024bc6c920ab83940ee22afd29139 ]
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf_open.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 8fcd1c076add0..cbd55f5f8d598 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -11,6 +11,8 @@ static const char *__doc__ =
#include <bpf/libbpf.h>
#include <getopt.h>
+#include "bpf_rlimit.h"
+
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"debug", no_argument, NULL, 'D' },
--
2.20.1
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 6cea33701eb024bc6c920ab83940ee22afd29139 ]
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf_open.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 65cbd30704b5a..9e9db202d218a 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -11,6 +11,8 @@ static const char *__doc__ =
#include <bpf/libbpf.h>
#include <getopt.h>
+#include "bpf_rlimit.h"
+
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"debug", no_argument, NULL, 'D' },
--
2.20.1
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().
In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.
A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].
This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.
Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.
Note: arm64 and nds32 respective fixes have been merged in 5.2-rc1,
hence they have been removed from this series.
[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2
Changes:
--------
v3:
- Rebased on 5.2-rc1.
- Addressed review comments.
v2:
- Rebased on 5.1-rc5.
- Addressed review comments.
Cc: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Vincenzo Frascino (3):
powerpc: Fix vDSO clock_getres()
s390: Fix vDSO clock_getres()
kselftest: Extend vDSO selftest to clock_getres
arch/powerpc/include/asm/vdso_datapage.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +-
arch/powerpc/kernel/time.c | 1 +
arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +-
arch/s390/include/asm/vdso.h | 1 +
arch/s390/kernel/asm-offsets.c | 2 +-
arch/s390/kernel/time.c | 1 +
arch/s390/kernel/vdso32/clock_getres.S | 12 +-
arch/s390/kernel/vdso64/clock_getres.S | 10 +-
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_clock_getres.c | 137 ++++++++++++++++++
12 files changed, 168 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c
--
2.21.0
A test for the basic NAT functionality uses ip command which
needs veth device.There is a condition where the kernel support
for veth is not compiled into the kernel and the test script
breaks.This patch contains code for reasonable error display
and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
---
tools/testing/selftests/netfilter/nft_nat.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 8ec76681605c..f25f72a75cf3 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -23,7 +23,11 @@ ip netns add ns0
ip netns add ns1
ip netns add ns2
-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+ echo "SKIP: No virtual ethernet pair device support in kernel"
+ exit $ksft_skip
+fi
ip link add veth1 netns ns0 type veth peer name eth0 netns ns2
ip -net ns0 link set lo up
--
2.20.1
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().
In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.
A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].
This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.
Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.
[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2
Changes:
--------
v2:
- Rebased on 5.1-rc5.
- Addressed review comments.
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Greentime Hu <green.hu(a)gmail.com>
Cc: Vincent Chen <deanbo422(a)gmail.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Vincenzo Frascino (5):
arm64: Fix vDSO clock_getres()
powerpc: Fix vDSO clock_getres()
s390: Fix vDSO clock_getres()
nds32: Fix vDSO clock_getres()
kselftest: Extend vDSO selftest to clock_getres
arch/arm64/include/asm/vdso_datapage.h | 1 +
arch/arm64/kernel/asm-offsets.c | 2 +-
arch/arm64/kernel/vdso.c | 2 +
arch/arm64/kernel/vdso/gettimeofday.S | 22 ++--
arch/nds32/include/asm/vdso_datapage.h | 1 +
arch/nds32/kernel/vdso.c | 1 +
arch/nds32/kernel/vdso/gettimeofday.c | 4 +-
arch/powerpc/include/asm/vdso_datapage.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +-
arch/powerpc/kernel/time.c | 1 +
arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +-
arch/s390/include/asm/vdso.h | 1 +
arch/s390/kernel/asm-offsets.c | 2 +-
arch/s390/kernel/time.c | 1 +
arch/s390/kernel/vdso32/clock_getres.S | 12 +-
arch/s390/kernel/vdso64/clock_getres.S | 10 +-
tools/testing/selftests/vDSO/Makefile | 2 +
.../selftests/vDSO/vdso_clock_getres.c | 108 ++++++++++++++++++
19 files changed, 159 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c
--
2.21.0
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.
The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.
First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating
file descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:
static int close_all_fds(void)
{
DIR *dir;
struct dirent *direntp;
dir = opendir("/proc/self/fd");
if (!dir)
return -1;
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir); /* cannot fail */
return 0;
}
to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us
2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us
close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.
>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.
/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d8…
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa…
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/sr…
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/gr…
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5…
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.
Suggested-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Dmitry V. Levin <ldv(a)altlinux.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: linux-api(a)vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 30 +++++++++++++++++++++
fs/open.c | 20 ++++++++++++++
include/linux/fdtable.h | 2 ++
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/unistd.h | 4 ++-
22 files changed, 75 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+545 common close_range sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+435 n32 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
+435 n64 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
431 o32 fsconfig sys_fsconfig
432 o32 fsmount sys_fsmount
433 o32 fspick sys_fspick
+435 o32 close_range sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+435 common close_range sys_close_range sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+435 i386 close_range sys_close_range __ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+435 common close_range __x64_sys_close_range
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..3680977a685a 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -641,6 +641,36 @@ int __close_fd(struct files_struct *files, unsigned fd)
}
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+ unsigned int cur_max;
+
+ if (fd > max_fd)
+ return -EINVAL;
+
+ rcu_read_lock();
+ cur_max = files_fdtable(files)->max_fds;
+ rcu_read_unlock();
+
+ /* cap to last valid index into fdtable */
+ if (max_fd >= cur_max)
+ max_fd = cur_max - 1;
+
+ while (fd <= max_fd)
+ __close_fd(files, fd++);
+
+ return 0;
+}
+
/*
* variant of __close_fd that gets a ref on the file for later fput
*/
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
return retval;
}
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags: reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+ unsigned int, flags)
+{
+ if (flags)
+ return -EINVAL;
+
+ return __close_range(current->files, fd, max_fd);
+}
+
/*
* This routine simulates a hangup on the tty, to arrange that users
* are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
extern int __close_fd(struct files_struct *files,
unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+ unsigned int max_fd);
extern int __close_fd_get_file(unsigned int fd, struct file **res);
extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags);
asmlinkage long sys_vhangup(void);
/* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
/*
* 32 bit systems traditionally used different
--
2.21.0
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
v1:
- kbuild test robot <lkp(a)intel.com>:
- add missing entry for pidfd_open to arch/arm/tools/syscall.tbl
- Oleg Nesterov <oleg(a)redhat.com>:
- use simpler thread-group leader check
v2:
- Oleg Nesterov <oleg(a)redhat.com>:
- avoid using additional variable
- remove unneeded comment
- Arnd Bergmann <arnd(a)arndb.de>:
- switch from 428 to 434 since the new mount api has taken it
- bump syscall numbers in arch/arm64/include/asm/unistd.h
- Joel Fernandes (Google) <joel(a)joelfernandes.org>:
- switch from ESRCH to EINVAL when the passed-in pid does not refer to a
thread-group leader
- Christian Brauner <christian(a)brauner.io>:
- rebase on v5.2-rc1
- adapt syscall number to account for new mount api syscalls
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 43 +++++++++++++++++++++
21 files changed, 66 insertions(+), 3 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..1db9bbcfb84e 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+544 common pidfd_open sys_pidfd_open
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..81e6e1817c45 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 70e6882853c0..e8f7d95a1481 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -44,7 +44,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 434
+#define __NR_compat_syscalls 435
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..7a3158ccd68e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_pidfd_open 434
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..ecc44926737b 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..9a3eb2558568 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..ad706f83c755 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..97035e19ad03 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+434 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..5022b9e179c2 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..f2c3bda2d39f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..6ebacfeaf853 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+434 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..834c9c7d79fa 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..c58e71f21129 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..43e4429a5272 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..1bee0a77fdd3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+434 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..782b81945ccc 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+434 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..e5684a4512c0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_pidfd_open 434
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..c3df226f47a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1724,7 +1724,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 89548d35eefb..8fc9d94f6ac1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -37,6 +37,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -450,6 +451,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ ret = 0;
+ rcu_read_lock();
+ if (!pid_task(p, PIDTYPE_TGID))
+ ret = -EINVAL;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
Hi,friend,
This is Daniel Murray and i am from Sinara Group Co.Ltd in Russia.
We are glad to know about your company from the web and we are interested in your products.
Could you kindly send us your Latest catalog and price list for our trial order.
Best Regards,
Daniel Murray
Purchasing Manager
The check for entry->index == 0 is done twice. One time should
be sufficient.
Suggested-by: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Signed-off-by: Thomas Huth <thuth(a)redhat.com>
---
Vitaly already noticed this in his review to the "Fix a condition
in test_hv_cpuid()" patch a couple of days ago, but so far I haven't
seen any patch yet on the list that fixes this ... if I missed it
instead, please simply ignore this patch.
tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
index 9a21e912097c..8bdf1e7da6cc 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
@@ -52,9 +52,6 @@ static void test_hv_cpuid(struct kvm_cpuid2 *hv_cpuid_entries,
TEST_ASSERT(entry->index == 0,
".index field should be zero");
- TEST_ASSERT(entry->index == 0,
- ".index field should be zero");
-
TEST_ASSERT(entry->flags == 0,
".flags field should be zero");
--
2.21.0
The code is trying to check that all the padding is zeroed out and it
does this:
entry->padding[0] == entry->padding[1] == entry->padding[2] == 0
Assume everything is zeroed correctly, then the first comparison is
true, the next comparison is false and false is equal to zero so the
overall condition is true. This bug doesn't affect run time very
badly, but the code should instead just check that all three paddings
are zero individually.
Also the error message was copy and pasted from an earlier error and it
wasn't correct.
Fixes: 7edcb7343327 ("KVM: selftests: Add hyperv_cpuid test")
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
---
tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
index 9a21e912097c..63b9fc3fdfbe 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_cpuid.c
@@ -58,9 +58,8 @@ static void test_hv_cpuid(struct kvm_cpuid2 *hv_cpuid_entries,
TEST_ASSERT(entry->flags == 0,
".flags field should be zero");
- TEST_ASSERT(entry->padding[0] == entry->padding[1]
- == entry->padding[2] == 0,
- ".index field should be zero");
+ TEST_ASSERT(!entry->padding[0] && !entry->padding[1] &&
+ !entry->padding[2], "padding should be zero");
/*
* If needed for debug:
--
2.20.1
> Hi Alex!
>
> Sorry for the late reply, somehow I missed your e-mails.
>
> The patchset looks good to me, except a typo in subjects/commit logs:
> you probably meant "unexpected failures".
>
> Please, fix and feel free to use:
> Reviewed-by: Roman Gushchin <guro(a)fb.com>
> for the series.
>
Thanks Roman!
BTW, do you know other test cases for cgroup2 function or performance?
Thanks
Alex
The test_core will skip the
test_cgcore_no_internal_process_constraint_on_threads test case if the
'cpu' controller missing in root's subtree_control. In fact we need to
set the 'cpu' in subtree_control, to make the testing meaningful.
./test_core
...
ok 4 # skip test_cgcore_no_internal_process_constraint_on_threads
...
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
---
tools/testing/selftests/cgroup/test_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index b1a570d7c557..2ffc3e07d98d 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -198,7 +198,7 @@ static int test_cgcore_no_internal_process_constraint_on_threads(const char *roo
char *parent = NULL, *child = NULL;
if (cg_read_strstr(root, "cgroup.controllers", "cpu") ||
- cg_read_strstr(root, "cgroup.subtree_control", "cpu")) {
+ cg_write(root, "cgroup.subtree_control", "+cpu")) {
ret = KSFT_SKIP;
goto cleanup;
}
--
2.19.1.856.g8858448bb
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:
$sudo ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...
To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Claudio Zumbo <claudioz(a)fb.com>
Cc: Claudio <claudiozumbo(a)gmail.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Reviewed-by: Roman Gushchin <guro(a)fb.com>
---
tools/testing/selftests/cgroup/test_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..b1a570d7c557 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.19.1.856.g8858448bb
> Hi Alex!
>
> Sorry for the late reply, somehow I missed your e-mails.
>
> The patchset looks good to me, except a typo in subjects/commit logs:
> you probably meant "unexpected failures".
>
> Please, fix and feel free to use:
> Reviewed-by: Roman Gushchin <guro(a)fb.com>
> for the series.
Hi Roman,
Thanks for the typo correct and review. I will resend them with your suggestion.
BTW, How do you test cgroup2 except kselftest? Any other function/performance testing case?
Thanks a lot!
Alex
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a huge problem
for Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfd for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Acked-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
v1:
- kbuild test robot <lkp(a)intel.com>:
- add missing entry for pidfd_open to arch/arm/tools/syscall.tbl
- Oleg Nesterov <oleg(a)redhat.com>:
- use simpler thread-group leader check
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 50 +++++++++++++++++++++
20 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 165f268beafc..ddc3c93ad7a7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
535 common io_uring_setup sys_io_uring_setup
536 common io_uring_enter sys_io_uring_enter
537 common io_uring_register sys_io_uring_register
+538 common pidfd_open sys_pidfd_open
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0393917eaa57..fc41fb34a636 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -441,3 +441,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 23f1a44acada..350e2049b4a9 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -874,6 +874,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 56e3d0b685e1..7115f6dd347a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -348,3 +348,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index df4ec3ec71d1..44bf12b16ffe 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4964947732af..0d32e5152dc0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9392dfe33f97..726e107b3c9f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -366,3 +366,4 @@
425 n32 io_uring_setup sys_io_uring_setup
426 n32 io_uring_enter sys_io_uring_enter
427 n32 io_uring_register sys_io_uring_register
+428 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index fe8ca623add8..83b46b568d51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -424,3 +424,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 00f5a63c8d9a..5294d04d7fa5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 061418f787c3..dcdb838adf49 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 480b057556ee..8e66edfbc521 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index a1dd24307b00..d6f3bc686939 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..1af6b469160a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c18e6ebe3387 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 30084eaf8422..21ee795f3003 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..94a257a93d20 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 737db1828437..980cc1d2b8d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..4afca3d6dcb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -38,6 +38,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -451,6 +452,55 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+ struct task_struct *tsk;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ ret = 0;
+ rcu_read_lock();
+ /*
+ * If this returns non-NULL the pid was used as a thread-group
+ * leader. Note, we race with exec here: If it changes the
+ * thread-group leader we might return the old leader.
+ */
+ tsk = pid_task(p, PIDTYPE_TGID);
+ if (!tsk)
+ ret = -ESRCH;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
Atom-based CPUs trigger stack fault when invoke 32-bit SYSENTER instruction
with invalid register values. So we also need SIGBUS handling in this case.
Following is assembly when the fault exception happens.
(gdb) disassemble $eip
Dump of assembler code for function __kernel_vsyscall:
0xf7fd8fe0 <+0>: push %ecx
0xf7fd8fe1 <+1>: push %edx
0xf7fd8fe2 <+2>: push %ebp
0xf7fd8fe3 <+3>: mov %esp,%ebp
0xf7fd8fe5 <+5>: sysenter
0xf7fd8fe7 <+7>: int $0x80
=> 0xf7fd8fe9 <+9>: pop %ebp
0xf7fd8fea <+10>: pop %edx
0xf7fd8feb <+11>: pop %ecx
0xf7fd8fec <+12>: ret
End of assembler dump.
According to Intel SDM, this could also be a Stack Segment Fault(#SS, 12),
except a normal Page Fault(#PF, 14). Especially, in section 6.9 of Vol.3A,
both stack and page faults are within the 10th(lowest priority) class, and
as it said, "exceptions within each class are implementation-dependent and
may vary from processor to processor". It's expected for processors like
Intel Atom to trigger stack fault(SIGBUS), while we get page fault(SIGSEGV)
from common Core processors.
Signed-off-by: Tong Bo <bo.tong(a)intel.com>
Acked-by: Andy Lutomirski <luto(a)kernel.org>
---
tools/testing/selftests/x86/syscall_arg_fault.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/x86/syscall_arg_fault.c b/tools/testing/selftests/x86/syscall_arg_fault.c
index 7db4fc9..d2548401 100644
--- a/tools/testing/selftests/x86/syscall_arg_fault.c
+++ b/tools/testing/selftests/x86/syscall_arg_fault.c
@@ -43,7 +43,7 @@ static sigjmp_buf jmpbuf;
static volatile sig_atomic_t n_errs;
-static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+static void sigsegv_or_sigbus(int sig, siginfo_t *info, void *ctx_void)
{
ucontext_t *ctx = (ucontext_t*)ctx_void;
@@ -73,7 +73,13 @@ int main()
if (sigaltstack(&stack, NULL) != 0)
err(1, "sigaltstack");
- sethandler(SIGSEGV, sigsegv, SA_ONSTACK);
+ sethandler(SIGSEGV, sigsegv_or_sigbus, SA_ONSTACK);
+ /*
+ * The actual exception can vary. On Atom CPUs, we get #SS
+ * instead of #PF when the vDSO fails to access the stack when
+ * ESP is too close to 2^32, and #SS causes SIGBUS.
+ */
+ sethandler(SIGBUS, sigsegv_or_sigbus, SA_ONSTACK);
sethandler(SIGILL, sigill, SA_ONSTACK);
/*
--
2.7.4
Hi Linus,
Please pull the following Kselftest update for Linux 5.2-rc1
This kselftest second update for Linux 5.2-rc1 consists of
Kselftest framework fixes from Shuah Khan
- kselftest framework bpf build/test workflow regression fix
- Fix to kselftest install to use default install path
- Fix to kselftest KBUILD_OUTPUT builds to not clutter main
KBUILD_OUTPUT directory with selftest objects
- .gitignore fixes from Kelsey Skunberg
- rseq selftests updates from Mathieu Desnoyers and Martin Schwidefsky:
They change the per-architecture pre-abort signatures to ensure those
are valid trap instructions.
The way exit points are presented to debuggers is enhanced, ensuring
all exit points are present, so debuggers don't have to disassemble
rseq critical section to properly skip over them.
Discussions with the glibc community is reaching a consensus of
exposing a __rseq_handled symbol from glibc to coexist with rseq early
adopters.
Update the rseq selftest code to expose and use this symbol.
Support for compiling asm goto with clang is added with the
"-no-integrated-as" compiler switch, similarly to the top level kernel
Makefile.
- kselftest Makefile test run output refactoring and making test
output TAP13 compliant from Kees Cook:
This re-factors the selftest Makefiles to extract the test running
logic to be reused between "run_tests" and "emit_tests", while also
fixing up the test output to be TAP version 13 compliant:
- added "plan" line
- fixed result line syntax
- moved all test output to be "# "-prefixed as TAP "diagnostic"
lines
The prefixing code includes a fallback mode for limited execution
environments.
Additionally, the plan lines are fixed for all callers of kselftest.h.
diff is attached.
Also, there is a merge conflict with the following commit which went
through the s390 maintainer tree and is now in your master which will
require a minor fix-up.
commit e24e4712efad737ca09ff299276737331cd021d9
Author: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
rseq/selftests: s390: use trap4 for RSEQ_SIG
thanks,
-- Shuah
---------------------------------------------------------------
The following changes since commit d917fb876f6eaeeea8a2b620d2a266ce26372f4d:
selftests: build and run gpio when output directory is the src dir
(2019-04-22 17:02:26 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.2-rc1-2
for you to fetch changes up to 61c2018c0743fe0c9ca68e308b5727b8a7c3d544:
selftests: avoid KBUILD_OUTPUT dir cluttering with selftest objects
(2019-05-14 17:37:41 -0600)
----------------------------------------------------------------
linux-kselftest-5.2-rc1-2
This kselftest second update for Linux 5.2-rc1 consists of
Kselftest framework fixes from Shuah Khan
- kselftest framework bpf build/test workflow regression fix
- Fix to kselftest install to use default install path
- Fix to kselftest KBUILD_OUTPUT builds to not clutter main
KBUILD_OUTPUT directory with selftest objects
- .gitignore fixes from Kelsey Skunberg
- rseq selftests updates from Mathieu Desnoyers and Martin Schwidefsky:
They change the per-architecture pre-abort signatures to ensure those
are valid trap instructions.
The way exit points are presented to debuggers is enhanced, ensuring
all exit points are present, so debuggers don't have to disassemble
rseq critical section to properly skip over them.
Discussions with the glibc community is reaching a consensus of exposing
a __rseq_handled symbol from glibc to coexist with rseq early adopters.
Update the rseq selftest code to expose and use this symbol.
Support for compiling asm goto with clang is added with the
"-no-integrated-as" compiler switch, similarly to the top level kernel
Makefile.
- kselftest Makefile test run output refactoring and making test
output TAP13 compliant from Kees Cook:
This re-factors the selftest Makefiles to extract the test running logic
to be reused between "run_tests" and "emit_tests", while also fixing
up the test output to be TAP version 13 compliant:
- added "plan" line
- fixed result line syntax
- moved all test output to be "# "-prefixed as TAP "diagnostic"
lines
The prefixing code includes a fallback mode for limited execution
environments.
Additionally, the plan lines are fixed for all callers of kselftest.h.
----------------------------------------------------------------
Kees Cook (8):
selftests: Extract single-test shell logic from lib.mk
selftests: Use runner.sh for emit targets
selftests: Extract logic for multiple test runs
selftests: Add plan line and fix result line syntax
selftests: Distinguish between missing and non-executable
selftests: Move test output to diagnostic lines
selftests: Remove KSFT_TAP_LEVEL
selftests: Add test plan API to kselftest.h and adjust callers
Kelsey Skunberg (2):
selftests: pidfd: Create .gitignore to include pidfd_test
selftests: drivers: Create .gitignore to include /dma-buf/udmabuf
Martin Schwidefsky (1):
rseq/selftests: s390: use trap4 for RSEQ_SIG
Mathieu Desnoyers (11):
rseq/selftests: x86: Work-around bogus gcc-8 optimisation
rseq/selftests: Add __rseq_exit_point_array section for debuggers
rseq/selftests: Introduce __rseq_cs_ptr_array, rename
__rseq_table to __rseq_cs
rseq/selftests: Use __rseq_handled symbol to coexist with glibc
rseq/selftests: s390: use jg instruction for jumps outside of the asm
rseq/selftests: x86: use ud1 instruction as RSEQ_SIG opcode
rseq/selftests: arm: use udf instruction for RSEQ_SIG
rseq/selftests: aarch64 code signature: handle big-endian environment
rseq/selftests: powerpc code signature: generate valid instructions
rseq/selftests: mips: use break instruction for RSEQ_SIG
rseq/selftests: add -no-integrated-as for clang
Shuah Khan (3):
selftests: fix install target to use default install path
selftests: fix bpf build/test workflow regression when
KBUILD_OUTPUT is set
selftests: avoid KBUILD_OUTPUT dir cluttering with selftest objects
tools/testing/selftests/.gitignore | 1 -
tools/testing/selftests/Makefile | 39 +--
.../selftests/breakpoints/breakpoint_test.c | 15 +-
.../selftests/breakpoints/breakpoint_test_arm64.c | 3 +-
.../breakpoints/step_after_suspend_test.c | 8 +
tools/testing/selftests/capabilities/test_execve.c | 6 +-
tools/testing/selftests/drivers/.gitignore | 1 +
.../selftests/futex/functional/futex_requeue_pi.c | 1 +
.../functional/futex_requeue_pi_mismatched_ops.c | 1 +
.../functional/futex_requeue_pi_signal_restart.c | 1 +
.../functional/futex_wait_private_mapped_file.c | 1 +
.../futex/functional/futex_wait_timeout.c | 1 +
.../functional/futex_wait_uninitialized_heap.c | 1 +
.../futex/functional/futex_wait_wouldblock.c | 1 +
tools/testing/selftests/kselftest.h | 17 +-
tools/testing/selftests/kselftest/prefix.pl | 23 ++
tools/testing/selftests/kselftest/runner.sh | 86 +++++++
tools/testing/selftests/lib.mk | 76 ++----
.../testing/selftests/membarrier/membarrier_test.c | 1 +
tools/testing/selftests/pidfd/.gitignore | 1 +
tools/testing/selftests/pidfd/pidfd_test.c | 1 +
tools/testing/selftests/rseq/Makefile | 8 +-
tools/testing/selftests/rseq/rseq-arm.h | 132 +++++++++--
tools/testing/selftests/rseq/rseq-arm64.h | 74 +++++-
tools/testing/selftests/rseq/rseq-mips.h | 115 ++++++++-
tools/testing/selftests/rseq/rseq-ppc.h | 90 ++++++-
tools/testing/selftests/rseq/rseq-s390.h | 78 +++++-
tools/testing/selftests/rseq/rseq-x86.h | 264
++++++++++++++-------
tools/testing/selftests/rseq/rseq.c | 55 ++++-
tools/testing/selftests/rseq/rseq.h | 1 +
tools/testing/selftests/sigaltstack/sas.c | 1 +
tools/testing/selftests/sync/sync_test.c | 1 +
32 files changed, 883 insertions(+), 221 deletions(-)
create mode 100644 tools/testing/selftests/drivers/.gitignore
create mode 100755 tools/testing/selftests/kselftest/prefix.pl
create mode 100644 tools/testing/selftests/kselftest/runner.sh
create mode 100644 tools/testing/selftests/pidfd/.gitignore
---------------------------------------------------------------
This series of patches move the commonly used bpf_printk macro to
bpf_helpers.h which is already included in all BPF programs which
defined that macro on their own.
v1->v2:
- If HBM_DEBUG is not defined in hbm sample, undefine bpf_printk and set
an empty macro for it.
Michal Rostecki (2):
selftests: bpf: Move bpf_printk to bpf_helpers.h
samples: bpf: Do not define bpf_printk macro
samples/bpf/hbm_kern.h | 11 ++---------
samples/bpf/tcp_basertt_kern.c | 7 -------
samples/bpf/tcp_bufs_kern.c | 7 -------
samples/bpf/tcp_clamp_kern.c | 7 -------
samples/bpf/tcp_cong_kern.c | 7 -------
samples/bpf/tcp_iw_kern.c | 7 -------
samples/bpf/tcp_rwnd_kern.c | 7 -------
samples/bpf/tcp_synrto_kern.c | 7 -------
samples/bpf/tcp_tos_reflect_kern.c | 7 -------
samples/bpf/xdp_sample_pkts_kern.c | 7 -------
tools/testing/selftests/bpf/bpf_helpers.h | 8 ++++++++
.../testing/selftests/bpf/progs/sockmap_parse_prog.c | 7 -------
.../selftests/bpf/progs/sockmap_tcp_msg_prog.c | 7 -------
.../selftests/bpf/progs/sockmap_verdict_prog.c | 7 -------
.../testing/selftests/bpf/progs/test_lwt_seg6local.c | 7 -------
tools/testing/selftests/bpf/progs/test_xdp_noinline.c | 7 -------
tools/testing/selftests/bpf/test_sockmap_kern.h | 7 -------
17 files changed, 10 insertions(+), 114 deletions(-)
--
2.21.0
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a huge problem
for Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfd for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-api(a)vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 48 +++++++++++++++++++++
19 files changed, 69 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 165f268beafc..ddc3c93ad7a7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
535 common io_uring_setup sys_io_uring_setup
536 common io_uring_enter sys_io_uring_enter
537 common io_uring_register sys_io_uring_register
+538 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 23f1a44acada..350e2049b4a9 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -874,6 +874,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 56e3d0b685e1..7115f6dd347a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -348,3 +348,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index df4ec3ec71d1..44bf12b16ffe 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4964947732af..0d32e5152dc0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9392dfe33f97..726e107b3c9f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -366,3 +366,4 @@
425 n32 io_uring_setup sys_io_uring_setup
426 n32 io_uring_enter sys_io_uring_enter
427 n32 io_uring_register sys_io_uring_register
+428 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index fe8ca623add8..83b46b568d51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -424,3 +424,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 00f5a63c8d9a..5294d04d7fa5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 061418f787c3..dcdb838adf49 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 480b057556ee..8e66edfbc521 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index a1dd24307b00..d6f3bc686939 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..1af6b469160a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c18e6ebe3387 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 30084eaf8422..21ee795f3003 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..94a257a93d20 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 737db1828437..980cc1d2b8d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..237d18d6ecb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -38,6 +38,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+ struct task_struct *tsk;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ rcu_read_lock();
+ tsk = pid_task(p, PIDTYPE_PID);
+ if (!tsk)
+ ret = -ESRCH;
+ else if (unlikely(!thread_group_leader(tsk)))
+ ret = -EINVAL;
+ else
+ ret = 0;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
From: Paolo Bonzini <pbonzini(a)redhat.com>
[ Upstream commit 76d58e0f07ec203bbdfcaabd9a9fc10a5a3ed5ea ]
If a memory slot's size is not a multiple of 64 pages (256K), then
the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
either requires the requested page range to go beyond memslot->npages,
or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
requires log->num_pages to be both in range and aligned.
To allow this case, allow log->num_pages not to be a multiple of 64 if
it ends exactly on the last page of the slot.
Reported-by: Peter Xu <peterx(a)redhat.com>
Fixes: 98938aa8edd6 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
Documentation/virtual/kvm/api.txt | 5 +++--
tools/testing/selftests/kvm/dirty_log_test.c | 9 ++++++---
virt/kvm/kvm_main.c | 7 ++++---
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ba8927c0d45c4..a1b8e6d92298c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3790,8 +3790,9 @@ The ioctl clears the dirty status of pages in a memory slot, according to
the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap
field. Bit 0 of the bitmap corresponds to page "first_page" in the
memory slot, and num_pages is the size in bits of the input bitmap.
-Both first_page and num_pages must be a multiple of 64. For each bit
-that is set in the input bitmap, the corresponding page is marked "clean"
+first_page must be a multiple of 64; num_pages must also be a multiple of
+64 unless first_page + num_pages is the size of the memory slot. For each
+bit that is set in the input bitmap, the corresponding page is marked "clean"
in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
(for example via write-protection, or by clearing the dirty bit in
a page table entry).
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 4715cfba20dce..93f99c6b7d79e 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -288,8 +288,11 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
#endif
max_gfn = (1ul << (guest_pa_bits - guest_page_shift)) - 1;
guest_page_size = (1ul << guest_page_shift);
- /* 1G of guest page sized pages */
- guest_num_pages = (1ul << (30 - guest_page_shift));
+ /*
+ * A little more than 1G of guest page sized pages. Cover the
+ * case where the size is not aligned to 64 pages.
+ */
+ guest_num_pages = (1ul << (30 - guest_page_shift)) + 3;
host_page_size = getpagesize();
host_num_pages = (guest_num_pages * guest_page_size) / host_page_size +
!!((guest_num_pages * guest_page_size) % host_page_size);
@@ -359,7 +362,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
#ifdef USE_CLEAR_DIRTY_LOG
kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0,
- DIV_ROUND_UP(host_num_pages, 64) * 64);
+ host_num_pages);
#endif
vm_dirty_log_verify(bmap);
iteration++;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b4f2d892a1d36..c9b102e65a3a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1241,7 +1241,7 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
return -EINVAL;
- if ((log->first_page & 63) || (log->num_pages & 63))
+ if (log->first_page & 63)
return -EINVAL;
slots = __kvm_memslots(kvm, as_id);
@@ -1254,8 +1254,9 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
n = kvm_dirty_bitmap_bytes(memslot);
if (log->first_page > memslot->npages ||
- log->num_pages > memslot->npages - log->first_page)
- return -EINVAL;
+ log->num_pages > memslot->npages - log->first_page ||
+ (log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
+ return -EINVAL;
*flush = false;
dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
--
2.20.1
Linus, Greg, Masahiro, Here are some simple fixes for the kheaders feature.
Please consider these patches for an rc release. They are based on Linus's
master branch. The only difference between the last series [1] and this one is
I squashed 1/3 and 3/3 and rebased. Thanks!
[1] https://patchwork.kernel.org/cover/10939557/
Joel Fernandes (Google) (2):
kheaders: Move from proc to sysfs
kheaders: Do not regenerate archive if config is not changed
init/Kconfig | 17 +++++----
kernel/Makefile | 4 +--
kernel/{gen_ikh_data.sh => gen_kheaders.sh} | 17 ++++++---
kernel/kheaders.c | 40 +++++++++------------
4 files changed, 38 insertions(+), 40 deletions(-)
rename kernel/{gen_ikh_data.sh => gen_kheaders.sh} (82%)
--
2.21.0.1020.gf2820cf01a-goog
## TLDR
I mostly wanted to incorporate feedback I got over the last week and a
half.
Biggest things to look out for:
- KUnit core now outputs results in TAP14.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Greg, Logan, you might want to re-review this.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
There is still some discussion going on on the [PATCH v2 00/17] thread,
but I wanted to get some of these updates out before they got too stale
(and too difficult for me to keep track of). I hope no one minds.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3].
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1/v3 branch.
## Changes Since Last Version
- Converted KUnit core to print test results in TAP14 format as
suggested by Greg and Frank.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
- Added a new set of EXPECTs and ASSERTs for pointer comparison.
- Removed more function indirection as suggested by Logan.
- Added a new patch that adds `kunit_try_catch_throw` to objtool's
noreturn list.
- Fixed a number of minorish issues pointed out by Shuah, Masahiro, and
kbuild bot.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#kuni…
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1/v3
--
2.21.0.1020.gf2820cf01a-goog
The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:
$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events
To correct this unexcepted failure, this patch write the 'memory' to
subtree_control of root to get a right result.
Signed-off-by: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Jay Kamat <jgkamat(a)fb.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..73612d604a2a 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
+ if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+ if (cg_write(root, "cgroup.subtree_control", "+memory"))
+ ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
--
2.19.1.856.g8858448bb
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
This patch adds polling support to pidfd. Using the polling support, LMK
will be able to get notified when a process exists in race-free and fast
way, and allows the LMK to do other things (such as by polling on other
fds) while awaiting the process being killed to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when
the tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage
of all the cases the patch is handling.
Andy had a similar patch [1] in the past which was a good reference
however this patch tries to handle different situations properly related
to thread group existence, and how/where it notifies. And also solves
other bugs (waitqueue lifetime). Daniel had a similar patch [2]
recently which this patch supercedes.
[1] https://lore.kernel.org/patchwork/patch/345098/
[2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Christian Brauner <christian(a)brauner.io>
Cc: Jann Horn <jannh(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: kernel-team(a)android.com
(Oleg improved the code by showing how to avoid tasklist_lock)
Suggested-by: Oleg Nesterov <oleg(a)redhat.com>
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
v1 -> v2:
* Restructure poll code to avoid tasklist_lock (Oleg)
* use task_pid instead of get_pid_task in notify_pidfd (Oleg)
* Added comments to code, commit message nits (Christian)
* Test case nits/improvements (Christian)
RFC -> v1:
* Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
* Updated selftests.
* Renamed poll wake function to do_notify_pidfd.
* Removed depending on EXIT flags
* Removed POLLERR flag since semantics are controversial and
we don't have usecases for it right now (later we can add if there's
a need for it).
include/linux/pid.h | 3 +++
kernel/fork.c | 29 +++++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 11 +++++++++++
4 files changed, 45 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..1484db6ca8d1 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@
#define _LINUX_PID_H
#include <linux/rculist.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -60,6 +61,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 5525837ed80e..721f8c9d2921 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,8 +1685,37 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid = file->private_data;
+ int poll_flags = 0;
+
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ WARN_ON_ONCE(task && !thread_group_leader(task));
+
+ /*
+ * Inform pollers only when the whole thread group exits, if thread
+ * group leader exits before all other threads in the group, then
+ * poll(2) should block, similar to the wait(2) family.
+ */
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+ rcu_read_unlock();
+
+ return poll_flags;
+}
+
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..5c90c239242f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index 1581140f2d99..a17fff073c3d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1800,6 +1800,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1823,6 +1831,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
2.21.0.593.g511ec345e18-goog
On Wed, May 08, 2019 at 05:39:07PM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 07:42:48AM -0500, Josh Poimboeuf wrote:
> > On Wed, May 08, 2019 at 02:04:16PM +0200, Peter Zijlstra wrote:
>
> > > Do the x86_64 variants also want some ORC annotation?
> >
> > Maybe so. Though it looks like regs->ip isn't saved. The saved
> > registers might need to be tweaked. I'll need to look into it.
>
> What all these sites do (and maybe we should look at unifying them
> somehow) is turn a CALL frame (aka RET-IP) into an exception frame (aka
> pt_regs).
>
> So regs->ip will be the return address (which is fixed up to be the CALL
> address in the handler).
But from what I can tell, trampoline_handler() hard-codes regs->ip to
point to kretprobe_trampoline(), and the original return address is
placed in regs->sp.
Masami, is there a reason why regs->ip doesn't have the original return
address and regs->sp doesn't have the original SP? I think that would
help the unwinder understand things.
--
Josh
The following files are generated after building /selftests/bpf/ and
should be added to .gitignore:
- libbpf.pc
- libbpf.so.*
Signed-off-by: Kelsey Skunberg <skunberg.kelsey(a)gmail.com>
---
Change since v1:
- Add libbpf.so.* in replace of libbpf.so.0 and
libbpf.so.0.0.3
- Update commit log to reflect file change
tools/testing/selftests/bpf/.gitignore | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 41e8a689aa77..a877803e4ba8 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -32,3 +32,5 @@ test_tcpnotify_user
test_libbpf
test_tcp_check_syncookie_user
alu32
+libbpf.pc
+libbpf.so.*
--
2.20.1
Hello kbuild, kselftest,
I've been working on a patchset which adds an additional build script to
the toolchain when compiling livepatches. There are a few kernel
section features in which this script does not yet support, but can
detect and abort when it encounters. To test this detection, I've
written a small set of kernel modules that require such sections.
A few questions:
Is build-testing out of scope for kernel selftests? For expediency,
it was really easy to spin out new lib/livepatch kernel modules.
Does kbuild support the notion of expected failure? In this case, the
build script returns a non-zero error and the build stops.
Am I trying to fit a square peg in a round hole? I could easily keep
these build tests in a private branch, but could they exist in a
different format somewhere else in the tree?
Suggestions welcome,
-- Joe