Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.
Original error:
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test
Reported-by: Geert Uytterhoeven geert(a)linux-m68k.org
Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
---
fs/ext4/inode-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode-test.c b/fs/ext4/inode-test.c
index 92a9da1774aa..bbce1c328d85 100644
--- a/fs/ext4/inode-test.c
+++ b/fs/ext4/inode-test.c
@@ -25,7 +25,7 @@
* For constructing the negative timestamp lower bound value.
* binary: 10000000 00000000 00000000 00000000
*/
-#define LOWER_MSB_1 (-0x80000000L)
+#define LOWER_MSB_1 (-(UPPER_MSB_0) - 1L) /* avoid overflow */
/*
* For constructing the negative timestamp upper bound value.
* binary: 11111111 11111111 11111111 11111111
--
2.24.0.432.g9d3f5f5b63-goog
Hi,
Here is the 4th version of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n or CONFIG_DYNAMIC_FTRACE=n.
In this version I fixed [1/4] to cleanup set_ftrace_notrace (Thanks Steve!)
Thank you,
---
Masami Hiramatsu (4):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
selftests/ftrace: Fix multiple kprobe testcase
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 5 ++++-
.../ftrace/test.d/kprobe/multiple_kprobes.tc | 6 +++---
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
7 files changed, 19 insertions(+), 9 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.
Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
---
fs/ext4/inode-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode-test.c b/fs/ext4/inode-test.c
index 92a9da1774aa..bbce1c328d85 100644
--- a/fs/ext4/inode-test.c
+++ b/fs/ext4/inode-test.c
@@ -25,7 +25,7 @@
* For constructing the negative timestamp lower bound value.
* binary: 10000000 00000000 00000000 00000000
*/
-#define LOWER_MSB_1 (-0x80000000L)
+#define LOWER_MSB_1 (-(UPPER_MSB_0) - 1L) /* avoid overflow */
/*
* For constructing the negative timestamp upper bound value.
* binary: 11111111 11111111 11111111 11111111
--
2.24.0.432.g9d3f5f5b63-goog
Hi Lurii,
On Tue, Nov 26, 2019 at 3:12 AM Linux Kernel Mailing List
<linux-kernel(a)vger.kernel.org> wrote:
> Commit: 1cbeab1b242d16fdb22dc3dab6a7d6afe746ae6d
> Parent: d460623c5fa126dc51bb2571dd7714ca75b0116c
> Refname: refs/heads/master
> Web: https://git.kernel.org/torvalds/c/1cbeab1b242d16fdb22dc3dab6a7d6afe746ae6d
> Author: Iurii Zaikin <yzaikin(a)google.com>
> AuthorDate: Thu Oct 17 15:12:33 2019 -0700
> Committer: Shuah Khan <skhan(a)linuxfoundation.org>
> CommitDate: Wed Oct 23 10:28:23 2019 -0600
>
> ext4: add kunit test for decoding extended timestamps
>
> KUnit tests for decoding extended 64 bit timestamps that verify the
> seconds part of [a/c/m] timestamps in ext4 inode structs are decoded
> correctly.
>
> Test data is derived from the table in the Inode Timestamps section of
> Documentation/filesystems/ext4/inodes.rst.
>
> KUnit tests run during boot and output the results to the debug log
> in TAP format (http://testanything.org/). Only useful for kernel devs
> running KUnit test harness and are not for inclusion into a production
> build.
>
> Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
> Reviewed-by: Theodore Ts'o <tytso(a)mit.edu>
> Reviewed-by: Brendan Higgins <brendanhiggins(a)google.com>
> Tested-by: Brendan Higgins <brendanhiggins(a)google.com>
> Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
> Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
While this test succeeds on arm64, it fails on m68k and arm32 (presumably
all 32-bit platforms?):
# Subtest: ext4_inode_test
1..1
# inode_test_xtimestamp_decoding: EXPECTATION FAILED at fs/ext4/inode-test.c:250
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
# inode_test_xtimestamp_decoding: EXPECTATION FAILED at fs/ext4/inode-test.c:250
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
# inode_test_xtimestamp_decoding: EXPECTATION FAILED at fs/ext4/inode-test.c:250
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Previous error message for invalid kunitconfig was vague. Added to it so
that it lists invalid fields and prompts for them to be removed. Added
validate_config function returning whether or not this kconfig is valid.
Signed-off-by: Heidi Fahim <heidifahim(a)google.com>
---
tools/testing/kunit/kunit_kernel.py | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)
diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py
index bf3876835331..010d3f5030d2 100644
--- a/tools/testing/kunit/kunit_kernel.py
+++ b/tools/testing/kunit/kunit_kernel.py
@@ -93,6 +93,19 @@ class LinuxSourceTree(object):
return False
return True
+ def validate_config(self, build_dir):
+ kconfig_path = get_kconfig_path(build_dir)
+ validated_kconfig = kunit_config.Kconfig()
+ validated_kconfig.read_from_file(kconfig_path)
+ if not self._kconfig.is_subset_of(validated_kconfig):
+ invalid = self._kconfig.entries() - validated_kconfig.entries()
+ message = 'Provided Kconfig is not contained in validated .config. Invalid fields found in kunitconfig: %s' % (
+ ', '.join([str(e) for e in invalid])
+ )
+ logging.error(message)
+ return False
+ return True
+
def build_config(self, build_dir):
kconfig_path = get_kconfig_path(build_dir)
if build_dir and not os.path.exists(build_dir):
@@ -103,12 +116,7 @@ class LinuxSourceTree(object):
except ConfigError as e:
logging.error(e)
return False
- validated_kconfig = kunit_config.Kconfig()
- validated_kconfig.read_from_file(kconfig_path)
- if not self._kconfig.is_subset_of(validated_kconfig):
- logging.error('Provided Kconfig is not contained in validated .config!')
- return False
- return True
+ return self.validate_config(build_dir)
def build_reconfig(self, build_dir):
"""Creates a new .config if it is not a subset of the kunitconfig."""
@@ -133,12 +141,7 @@ class LinuxSourceTree(object):
except (ConfigError, BuildError) as e:
logging.error(e)
return False
- used_kconfig = kunit_config.Kconfig()
- used_kconfig.read_from_file(get_kconfig_path(build_dir))
- if not self._kconfig.is_subset_of(used_kconfig):
- logging.error('Provided Kconfig is not contained in final config!')
- return False
- return True
+ return self.validate_config(build_dir)
def run_kernel(self, args=[], timeout=None, build_dir=None):
args.extend(['mem=256M'])
--
2.24.0.432.g9d3f5f5b63-goog
Support for frequency limits in dev_pm_qos was removed when cpufreq was
switched to freq_qos, this series attempts to restore it by
reimplementing on top of freq_qos.
Discussion about removal is here:
https://lore.kernel.org/linux-pm/VI1PR04MB7023DF47D046AEADB4E051EBEE680@VI1…
The cpufreq core switched away because it needs contraints at the level
of a "cpufreq_policy" which cover multiple cpus so dev_pm_qos coupling
to struct device was not useful. Cpufreq could only use dev_pm_qos by
implementing an additional layer of aggregation anyway.
However in the devfreq subsystem scaling is always performed on a per-device
basis so dev_pm_qos is a very good match. Support for dev_pm_qos in devfreq
core is here:
https://patchwork.kernel.org/cover/11252409/
That series is RFC mostly because it needs these PM core patches.
Earlier versions got entangled in some locking cleanups but those are
not strictly necessary to get dev_pm_qos functionality.
In theory if freq_qos is extended to handle conflicting min/max values then
this sharing would be valuable. Right now freq_qos just ties two unrelated
pm_qos aggregations for min and max freq.
---
This is implemented by embeding a freq_qos_request inside dev_pm_qos_request:
the data field was already an union in order to deal with flag requests.
The internal freq_qos_apply is exported so that it can be called from
dev_pm_qos apply_constraints.
The dev_pm_qos_constraints_destroy function has no obvious equivalent in
freq_qos and the whole approach of "removing requests" is somewhat dubios:
request objects should be owned by consumers and the list of qos requests
should be empty when the target device is deleted.
Changes since v2:
* #define PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE FREQ_QOS_MAX_DEFAULT_VALUE
* #define FREQ_QOS_MAX_DEFAULT_VALUE S32_MAX (in new patch)
* Add initial kunit test for freq_qos, validating the MAX_DEFAULT_VALUE found
by Matthias.
Link to v2: https://patchwork.kernel.org/cover/11250413/
First two patches can be applied separated
Changes since v1:
* Don't rename or EXPORT_SYMBOL_GPL the freq_qos_apply function; just
drop the static marker.
Link to v1: https://patchwork.kernel.org/cover/11212887/
Leonard Crestez (4):
PM / QoS: Initial kunit test
PM / QOS: Redefine FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX
PM / QoS: Reorder pm_qos/freq_qos/dev_pm_qos structs
PM / QoS: Restore DEV_PM_QOS_MIN/MAX_FREQUENCY
drivers/base/Kconfig | 4 ++
drivers/base/power/Makefile | 1 +
drivers/base/power/qos-test.c | 116 ++++++++++++++++++++++++++++++++++
drivers/base/power/qos.c | 69 ++++++++++++++++++--
include/linux/pm_qos.h | 86 ++++++++++++++-----------
kernel/power/qos.c | 4 +-
6 files changed, 237 insertions(+), 43 deletions(-)
create mode 100644 drivers/base/power/qos-test.c
--
2.17.1
Hi,
Here is the 3rd version of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n or CONFIG_DYNAMIC_FTRACE=n.
In this version and v2, I updated the descriptions of the first 2 patches
according to Steve's comment, added Steve's Reviewed-by to the 3rd patch,
and added the 4th patch which was newly found.
Thank you,
---
Masami Hiramatsu (4):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
selftests/ftrace: Fix multiple kprobe testcase
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 4 +++-
.../ftrace/test.d/kprobe/multiple_kprobes.tc | 6 +++---
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
7 files changed, 18 insertions(+), 9 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi Linus,
Please pull the following kselftest fixes update for Linux 5.5-rc1.
This kselftest fixes update for Linux 5.5-rc1 consists of several
fixes to tests and framework. Masami Hiramatsu fixed several tests
to build and run correctly on arm and other 32bit architectures.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ce3a677802121e038d2f062e90f96f84e7351da0:
selftests: watchdog: Add command line option to show watchdog_info
(2019-10-02 13:44:43 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.5-rc1-fixes
for you to fetch changes up to ed2d8fa734e7759ac3788a19f308d3243d0eb164:
selftests: sync: Fix cast warnings on arm (2019-11-07 14:54:37 -0700)
----------------------------------------------------------------
linux-kselftest-5.5-rc1-fixes
This kselftest fixes update for Linux 5.5-rc1 consists of several
fixes to tests and framework. Masami Hiramatsu fixed several tests
to build and run correctly on arm and other 32bit architectures.
----------------------------------------------------------------
Kees Cook (2):
selftests: gen_kselftest_tar.sh: Do not clobber kselftest/
selftests: Move kselftest_module.sh into kselftest/
Masami Hiramatsu (6):
selftests: breakpoints: Fix a typo of function name
selftests: proc: Make va_max 1MB
selftests: vm: Build/Run 64bit tests only on 64bit arch
selftests: net: Use size_t and ssize_t for counting file size
selftests: net: Fix printf format warnings on arm
selftests: sync: Fix cast warnings on arm
Prabhakar Kushwaha (1):
kselftest: Fix NULL INSTALL_PATH for TARGETS runlist
Shuah Khan (1):
selftests: Fix O= and KBUILD_OUTPUT handling for relative paths
tools/testing/selftests/Makefile | 8 +++++---
.../selftests/breakpoints/breakpoint_test_arm64.c | 2 +-
tools/testing/selftests/gen_kselftest_tar.sh | 21
+++++++++++--------
.../{kselftest_module.sh => kselftest/module.sh} | 0
tools/testing/selftests/kselftest_install.sh | 24
+++++++++++-----------
tools/testing/selftests/lib/bitmap.sh | 2 +-
tools/testing/selftests/lib/prime_numbers.sh | 2 +-
tools/testing/selftests/lib/printf.sh | 2 +-
tools/testing/selftests/lib/strscpy.sh | 2 +-
tools/testing/selftests/net/so_txtime.c | 4 ++--
tools/testing/selftests/net/tcp_mmap.c | 8 ++++----
tools/testing/selftests/net/udpgso.c | 3 ++-
tools/testing/selftests/net/udpgso_bench_tx.c | 3 ++-
.../selftests/proc/proc-self-map-files-002.c | 6 +++++-
tools/testing/selftests/sync/sync.c | 6 +++---
tools/testing/selftests/vm/Makefile | 5 +++++
tools/testing/selftests/vm/run_vmtests | 10 +++++++++
17 files changed, 68 insertions(+), 40 deletions(-)
rename tools/testing/selftests/{kselftest_module.sh =>
kselftest/module.sh} (100%)
----------------------------------------------------------------
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
hugetlb memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage. We also added cgroup-v2 support to
hugetlb_cgroup so that the new use cases can be extended to v2.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that modifies
the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
reservation time rather than fault time. Adding a new page_counter seems
better as userspace could, if it wants, choose to enforce different cgroups
differently: one via limit_in_bytes, and another via
reservation_limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and reservation_limit_in_bytes concurrently, and this
approach gives them the option to do so.
Testing:
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and
without this patch series. Seems environment issue rather than code:
- Overall results:
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 84 0
* Skipped: 0 0
* PASS: 66 0
* FAIL: 14 0
* Killed by signal: 0 0
* Bad configuration: 4 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********
- Failing tests:
- elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this
patch series.
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32):
FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc
HUGETLB_MORECORE=yes malloc (2M: 32):
FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32):
FAIL Address is not hugepage
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
HUGETLB_MORECORE=yes heapshrink (2M: 32):
FAIL Heap not on hugepages
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32):
FAIL Heap not on hugepages
- HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL small_data is not hugepage
- HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
FAIL small_data is not hugepage
- alloc-instantiate-race shared (2M: 32):
Bad configuration: sched_setaffinity(cpu1): Invalid argument -
FAIL Child 1 killed by signal Killed
- shmoverride_linked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <almasrymina(a)google.com>
Acked-by: Hillf Danton <hdanton(a)sina.com>
---
include/linux/hugetlb.h | 23 ++++++++-
mm/hugetlb_cgroup.c | 111 ++++++++++++++++++++++++++++++----------
2 files changed, 107 insertions(+), 27 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d08..9c49a0ba894d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
#ifdef CONFIG_HUGETLB_PAGE
+enum {
+ /* Tracks hugetlb memory faulted in. */
+ HUGETLB_RES_USAGE,
+ /* Tracks hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_USAGE,
+ /* Limit for hugetlb memory faulted in. */
+ HUGETLB_RES_LIMIT,
+ /* Limit for hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_LIMIT,
+ /* Max usage for hugetlb memory faulted in. */
+ HUGETLB_RES_MAX_USAGE,
+ /* Max usage for hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_MAX_USAGE,
+ /* Faulted memory accounting fail count. */
+ HUGETLB_RES_FAILCNT,
+ /* Reserved memory accounting fail count. */
+ HUGETLB_RES_RESERVATION_FAILCNT,
+ HUGETLB_RES_NULL,
+ HUGETLB_RES_MAX,
+};
+
#define HSTATE_NAME_LEN 32
/* Defines one hugetlb page size */
struct hstate {
@@ -340,7 +361,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
- struct cftype cgroup_files[5];
+ struct cftype cgroup_files[HUGETLB_RES_MAX];
#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f1930fa0b445d..1ed4448ca41d3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
* the counter to account for hugepages from hugetlb.
*/
struct page_counter hugepage[HUGE_MAX_HSTATE];
+ /*
+ * the counter to account for hugepage reservations from hugetlb.
+ */
+ struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
};
#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
@@ -33,6 +37,14 @@ struct hugetlb_cgroup {
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+ if (reserved)
+ return &h_cg->reserved_hugepage[idx];
+ return &h_cg->hugepage[idx];
+}
+
static inline
struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
{
@@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}
-enum {
- RES_USAGE,
- RES_LIMIT,
- RES_MAX_USAGE,
- RES_FAILCNT,
-};
-
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct page_counter *counter;
+ struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+ reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
switch (MEMFILE_ATTR(cft->private)) {
- case RES_USAGE:
+ case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
- case RES_LIMIT:
+ case HUGETLB_RES_RESERVATION_USAGE:
+ return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+ case HUGETLB_RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
- case RES_MAX_USAGE:
+ case HUGETLB_RES_RESERVATION_LIMIT:
+ return (u64)reserved_counter->max * PAGE_SIZE;
+ case HUGETLB_RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
- case RES_FAILCNT:
+ case HUGETLB_RES_RESERVATION_MAX_USAGE:
+ return (u64)reserved_counter->watermark * PAGE_SIZE;
+ case HUGETLB_RES_FAILCNT:
return counter->failcnt;
+ case HUGETLB_RES_RESERVATION_FAILCNT:
+ return reserved_counter->failcnt;
default:
BUG();
}
@@ -291,6 +306,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+ bool reserved = false;
if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -304,9 +320,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
switch (MEMFILE_ATTR(of_cft(of)->private)) {
- case RES_LIMIT:
+ case HUGETLB_RES_RESERVATION_LIMIT:
+ reserved = true;
+ /* Fall through. */
+ case HUGETLB_RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
- ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+ ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
+ reserved),
+ nr_pages);
mutex_unlock(&hugetlb_limit_mutex);
break;
default:
@@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
int ret = 0;
- struct page_counter *counter;
+ struct page_counter *counter, *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+ reserved_counter =
+ &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];
switch (MEMFILE_ATTR(of_cft(of)->private)) {
- case RES_MAX_USAGE:
+ case HUGETLB_RES_MAX_USAGE:
page_counter_reset_watermark(counter);
break;
- case RES_FAILCNT:
+ case HUGETLB_RES_RESERVATION_MAX_USAGE:
+ page_counter_reset_watermark(reserved_counter);
+ break;
+ case HUGETLB_RES_FAILCNT:
counter->failcnt = 0;
break;
+ case HUGETLB_RES_RESERVATION_FAILCNT:
+ reserved_counter->failcnt = 0;
+ break;
default:
ret = -EINVAL;
break;
@@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx)
struct hstate *h = &hstates[idx];
/* format the size */
- mem_fmt(buf, 32, huge_page_size(h));
+ mem_fmt(buf, sizeof(buf), huge_page_size(h));
/* Add the limit file */
- cft = &h->cgroup_files[0];
+ cft = &h->cgroup_files[HUGETLB_RES_LIMIT];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+ cft->write = hugetlb_cgroup_write;
+
+ /* Add the reservation limit file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
+ buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT);
cft->read_u64 = hugetlb_cgroup_read_u64;
cft->write = hugetlb_cgroup_write;
/* Add the usage file */
- cft = &h->cgroup_files[1];
+ cft = &h->cgroup_files[HUGETLB_RES_USAGE];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation usage file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
+ buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE);
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the MAX usage file */
- cft = &h->cgroup_files[2];
+ cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the MAX reservation usage file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE];
+ snprintf(cft->name, MAX_CFTYPE_NAME,
+ "%s.reservation_max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the failcntfile */
- cft = &h->cgroup_files[3];
+ cft = &h->cgroup_files[HUGETLB_RES_FAILCNT];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation failcntfile */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* NULL terminate the last cft */
- cft = &h->cgroup_files[4];
+ cft = &h->cgroup_files[HUGETLB_RES_NULL];
memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
--
2.24.0.rc1.363.gb1bccd3e3d-goog
Hi,
Here is a set of well-reviewed (expect for one patch), lower-risk items
that can go into Linux 5.5. The one patch that wasn't reviewed is the
powerpc conversion, and it's still at this point a no-op, because
tracking isn't yet activated.
This is based on linux-next: b9d3d01405061bb42358fe53f824e894a1922ced
("Add linux-next specific files for 20191122").
This is essentially a cut-down v8 of "mm/gup: track dma-pinned pages:
FOLL_PIN" [1], and with one of the VFIO patches split into two patches.
The idea here is to get this long list of "noise" checked into 5.5, so
that the actual, higher-risk "track FOLL_PIN pages" (which is deferred:
not part of this series) will be a much shorter patchset to review.
For the v4l2-core changes, I've left those here (instead of sending
them separately to the -media tree), in order to get the name change
done now (put_user_page --> unpin_user_page). However, I've added a Cc
stable, as recommended during the last round of reviews.
Here are the relevant notes from the original cover letter, edited to
match the current situation:
This is a prerequisite to tracking dma-pinned pages. That in turn is a
prerequisite to solving the larger problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
[1] https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com
thanks,
John Hubbard
NVIDIA
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (18):
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
goldish_pipe: rename local pin_user_pages() routine
mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
media/v4l2-core: set pages dirty upon releasing DMA buffers
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 ++++++++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 4 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +--
drivers/vfio/vfio_iommu_type1.c | 35 +--
fs/io_uring.c | 6 +-
include/linux/mm.h | 77 +++--
mm/gup.c | 332 +++++++++++++-------
mm/gup_benchmark.c | 9 +-
mm/memremap.c | 80 ++---
mm/process_vm_access.c | 28 +-
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 6 +-
24 files changed, 642 insertions(+), 285 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
Hi,
Here is the 2nd version of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n or CONFIG_DYNAMIC_FTRACE=n.
In this version I just updated the descriptions of the first 2 patches
according to Steve's comment and add his Reviewed-by to the last patch.
Thank you,
---
Masami Hiramatsu (3):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 4 +++-
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
6 files changed, 15 insertions(+), 6 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi,
Here is a series of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n but CONFIG_TRACER=y.
Thank you,
---
Masami Hiramatsu (3):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 4 +++-
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
6 files changed, 15 insertions(+), 6 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
If settings files are present in upper directories of a test directory,
load those first before the local settings. This allows a top-level
directory to define settings for subdirectories while still allowing the
subdirectories to override them on a per-directory basis.
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
Note that this depends on Matt's patch:
https://lore.kernel.org/lkml/20191022171223.27934-1-matthieu.baerts@tessare…
---
tools/testing/selftests/kselftest/runner.sh | 30 +++++++++++++++------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/kselftest/runner.sh b/tools/testing/selftests/kselftest/runner.sh
index 84de7bc74f2c..666cfeaa8046 100644
--- a/tools/testing/selftests/kselftest/runner.sh
+++ b/tools/testing/selftests/kselftest/runner.sh
@@ -39,6 +39,27 @@ tap_timeout()
fi
}
+load_settings()
+{
+ local fullpath="$1"
+ local upperpath=${fullpath%/*}
+
+ # Load upper path settings first.
+ if [ "$fullpath" != "$upperpath" ] ; then
+ load_settings "$upperpath"
+ fi
+
+ # Load per-test-directory kselftest "settings" file.
+ local settings="$BASE_DIR/$fullpath/settings"
+ if [ -r "$settings" ] ; then
+ while read line ; do
+ local field=$(echo "$line" | cut -d= -f1)
+ local value=$(echo "$line" | cut -d= -f2-)
+ eval "kselftest_$field"="$value"
+ done < "$settings"
+ fi
+}
+
run_one()
{
DIR="$1"
@@ -50,14 +71,7 @@ run_one()
# Reset any "settings"-file variables.
export kselftest_timeout="$kselftest_default_timeout"
# Load per-test-directory kselftest "settings" file.
- settings="$BASE_DIR/$DIR/settings"
- if [ -r "$settings" ] ; then
- while read line ; do
- field=$(echo "$line" | cut -d= -f1)
- value=$(echo "$line" | cut -d= -f2-)
- eval "kselftest_$field"="$value"
- done < "$settings"
- fi
+ load_settings "$DIR"
TEST_HDR_MSG="selftests: $DIR: $BASENAME_TEST"
echo "# $TEST_HDR_MSG"
--
2.17.1
--
Kees Cook
Add documentation for the Python script used to build, run, and collect
results from the kernel known as kunit_tool. kunit_tool
(tools/testing/kunit/kunit.py) was already added in previous commits.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
---
Documentation/dev-tools/kunit/index.rst | 1 +
Documentation/dev-tools/kunit/kunit-tool.rst | 57 ++++++++++++++++++++
Documentation/dev-tools/kunit/start.rst | 5 +-
3 files changed, 62 insertions(+), 1 deletion(-)
create mode 100644 Documentation/dev-tools/kunit/kunit-tool.rst
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index 26ffb46bdf99d..c60d760a0eed1 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -9,6 +9,7 @@ KUnit - Unit Testing for the Linux Kernel
start
usage
+ kunit-tool
api/index
faq
diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst b/Documentation/dev-tools/kunit/kunit-tool.rst
new file mode 100644
index 0000000000000..37509527c04e1
--- /dev/null
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+kunit_tool How-To
+=================
+
+What is kunit_tool?
+===================
+
+kunit_tool is a script (``tools/testing/kunit/kunit.py``) that aids in building
+the Linux kernel as UML (`User Mode Linux
+<http://user-mode-linux.sourceforge.net/>`_), running KUnit tests, parsing
+the test results and displaying them in a user friendly manner.
+
+What is a kunitconfig?
+======================
+
+It's just a defconfig that kunit_tool looks for in the base directory.
+kunit_tool uses it to generate a .config as you might expect. In addition, it
+verifies that the generated .config contains the CONFIG options in the
+kunitconfig; the reason it does this is so that it is easy to be sure that a
+CONFIG that enables a test actually ends up in the .config.
+
+How do I use kunit_tool?
+=================================
+
+If a kunitconfig is present at the root directory, all you have to do is:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run
+
+However, you most likely want to use it with the following options:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all`
+
+- ``--timeout`` sets a maximum amount of time to allow tests to run.
+- ``--jobs`` sets the number of threads to use to build the kernel.
+
+If you just want to use the defconfig that ships with the kernel, you can
+append the ``--defconfig`` flag as well:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all` --defconfig
+
+.. note::
+ This command is particularly helpful for getting started because it
+ just works. No kunitconfig needs to be present.
+
+For a list of all the flags supported by kunit_tool, you can run:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --help
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index aeeddfafeea20..f4d9a4fa914f8 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -19,7 +19,10 @@ The wrapper can be run with:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py run
+ ./tools/testing/kunit/kunit.py run --defconfig
+
+For more information on this wrapper (also called kunit_tool) checkout the
+:doc:`kunit-tool` page.
Creating a kunitconfig
======================
--
2.24.0.432.g9d3f5f5b63-goog
The current kunit execution model is to provide base kunit functionality
and tests built-in to the kernel. The aim of this series is to allow
building kunit itself and tests as modules. This in turn allows a
simple form of selective execution; load the module you wish to test.
In doing so, kunit itself (if also built as a module) will be loaded as
an implicit dependency.
Because this requires a core API modification - if a module delivers
multiple suites, they must be declared with the kunit_test_suites()
macro - we're proposing this patch set as a candidate to be applied to the
test tree before too many kunit consumers appear. We attempt to deal
with existing consumers in patch 3.
Changes since v3:
- removed symbol lookup patch for separate submission later
- removed use of sysctl_hung_task_timeout_seconds (patch 4, as discussed
with Brendan and Stephen)
- disabled build of string-stream-test when CONFIG_KUNIT_TEST=m; this
is to avoid having to deal with symbol lookup issues
- changed string-stream-impl.h back to string-stream.h (Brendan)
- added module build support to new list, ext4 tests
Changes since v2:
- moved string-stream.h header to lib/kunit/string-stream-impl.h (Brendan)
(patch 1)
- split out non-exported interfaces in try-catch-impl.h (Brendan)
(patch 2)
- added kunit_find_symbol() and KUNIT_INIT_SYMBOL to lookup non-exported
symbols (patches 3, 4)
- removed #ifdef MODULE around module licenses (Randy, Brendan, Andy)
(patch 4)
- replaced kunit_test_suite() with kunit_test_suites() rather than
supporting both (Brendan) (patch 4)
- lookup sysctl_hung_task_timeout_secs as kunit may be built as a module
and the symbol may not be available (patch 5)
Alan Maguire (6):
kunit: move string-stream.h to lib/kunit
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: allow kunit tests to be loaded as a module
kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
kunit: allow kunit to be loaded as a module
kunit: update documentation to describe module-based build
Documentation/dev-tools/kunit/faq.rst | 3 +-
Documentation/dev-tools/kunit/index.rst | 3 +
Documentation/dev-tools/kunit/usage.rst | 16 ++
fs/ext4/Kconfig | 2 +-
fs/ext4/Makefile | 5 +
fs/ext4/inode-test.c | 4 +-
include/kunit/assert.h | 3 +-
include/kunit/string-stream.h | 51 -----
include/kunit/test.h | 35 +++-
include/kunit/try-catch.h | 10 -
kernel/sysctl-test.c | 4 +-
lib/Kconfig.debug | 4 +-
lib/kunit/Kconfig | 6 +-
lib/kunit/Makefile | 14 +-
lib/kunit/assert.c | 10 +
lib/kunit/example-test.c | 88 ---------
lib/kunit/kunit-example-test.c | 90 +++++++++
lib/kunit/kunit-test.c | 334 ++++++++++++++++++++++++++++++++
lib/kunit/string-stream-test.c | 5 +-
lib/kunit/string-stream.c | 3 +-
lib/kunit/string-stream.h | 51 +++++
lib/kunit/test-test.c | 331 -------------------------------
lib/kunit/test.c | 25 ++-
lib/kunit/try-catch-impl.h | 28 +++
lib/kunit/try-catch.c | 37 +---
lib/list-test.c | 4 +-
26 files changed, 628 insertions(+), 538 deletions(-)
delete mode 100644 include/kunit/string-stream.h
delete mode 100644 lib/kunit/example-test.c
create mode 100644 lib/kunit/kunit-example-test.c
create mode 100644 lib/kunit/kunit-test.c
create mode 100644 lib/kunit/string-stream.h
delete mode 100644 lib/kunit/test-test.c
create mode 100644 lib/kunit/try-catch-impl.h
--
1.8.3.1
Summary
------------------------------------------------------------------------
kernel: 5.4.0-rc8
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
git branch: master
git commit: 1fef9976397fc9b951ee54467eccd65e0c508785
git describe: next-20191120
Test details: https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20191120
Regressions (compared to build next-20191119)
------------------------------------------------------------------------
No regressions
Fixes (compared to build next-20191119)
------------------------------------------------------------------------
No fixes
In total:
------------------------------------------------------------------------
Ran 0 total tests in the following environments and test suites.
pass 0
fail 0
xfail 0
skip 0
Environments
--------------
- x15 - arm
Test Suites
-----------
Failures
------------------------------------------------------------------------
x15:
Skips
------------------------------------------------------------------------
No skips
--
Linaro LKFT
https://lkft.linaro.org
Hi,
Christoph Hellwig has a preference to do things a little differently,
for the devmap cleanup in patch 5 ("mm: devmap: refactor 1-based
refcounting for ZONE_DEVICE pages"). That came up in a different
review thread, because the patch is out for review in two locations.
Here's that review thread:
https://lore.kernel.org/r/20191118070826.GB3099@infradead.org
...and I'm hoping that we can defer that request, because otherwise
it derails this series, which is starting to otherwise look like
it could be ready for 5.5.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v6
Changes since v5:
* Fixed the refcounting for huge pages: in most cases, it was
only taking one GUP_PIN_COUNTING_BIAS's worth of refs, when it
should have been taking one GUP_PIN_COUNTING_BIAS for each subpage.
(Much thanks to Jan Kara for spotting that one!)
* Renamed user_page_ref_inc() to try_pin_page(), and added a new
try_pin_compound_head(). This definitely improves readability.
* Factored out some more duplication in the FOLL_PIN and FOLL_GET
cases, in gup.c.
* Fixed up some straggling "get_" --> "pin_" references in the comments.
* Added reviewed-by tags.
Changes since v4:
* Renamed put_user_page*() --> unpin_user_page().
* Removed all pin_longterm_pages*() calls. We will use FOLL_LONGTERM
at the call sites. (FOLL_PIN, however, remains an internal gup flag).
This is very nice: many patches just change three characters now:
get_user_pages --> pin_user_pages. I think we've found the right
balance of wrapper calls and gup flags, for the call sites.
* Updated a lot of documentation and commit logs to match the above
two large changes.
* Changed gup_benchmark tests and run_vmtests, to adapt to one less
use case: there is no pin_longterm_pages() call anymore.
* This includes a new devmap cleanup patch from Dan Williams, along
with a rebased follow-up: patches 4 and 5, already mentioned above.
* Fixed patch 10 ("mm/gup: introduce pin_user_pages*() and FOLL_PIN"),
so as to make pin_user_pages*() calls act as placeholders for the
corresponding get_user_pages*() calls, until a later patch fully
implements the DMA-pinning functionality.
Thanks to Jan Kara for noticing that.
* Fixed the implementation of pin_user_pages_remote().
* Further tweaked patch 2 ("mm/gup: factor out duplicate code from four
routines"), in response to Jan Kara's feedback.
* Dropped a few reviewed-by tags due to changes that invalidated
them.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.
* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-9: refactoring and preparatory cleanup, independent fixes
* Patch 10: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 11-16: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 17: Activate tracking of FOLL_PIN pages.
* Patches 18-20: convert various callers
* Patches: 21-23: gup_benchmark and run_vmtests support
* Patch 24: rename put_user_page*() --> unpin_user_page*()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 +++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 168 +++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 548 +++++++++++++++-----
mm/gup_benchmark.c | 74 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 76 ++-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 21 +-
tools/testing/selftests/vm/run_vmtests | 22 +
30 files changed, 1104 insertions(+), 350 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
This is a re-send of
<https://lore.kernel.org/lkml/20191117011713.13032-1-cyphar@cyphar.com/>
but rebased on top of 5.4-rc8 (also my mails got duplicated the first
time I sent v17 -- hopefully that doesn't happen this time).
Patch changelog:
v17:
* Add a path_is_under() check for LOOKUP_IS_SCOPED in complete_walk(), as a
last line of defence to ensure that namei bugs will not break the contract
of LOOKUP_BENEATH or LOOKUP_IN_ROOT.
* Update based on feedback by Al Viro:
* Make nd_jump_link() free the passed path on error, so that callers don't
need to worry about it in the error path.
* Remove needless m_retry and r_retry variables in handle_dots().
* Always return -ECHILD from follow_dotdot_rcu().
v16: <https://lore.kernel.org/lkml/20191116002802.6663-1-cyphar@cyphar.com/>
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (13):
namei: only return -ECHILD from follow_dotdot_rcu()
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 185 +++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 3 +-
fs/proc/namespaces.c | 20 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 40 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 6 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1686 insertions(+), 113 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: af42d3466bdc8f39806b26f593604fdc54140bcb
--
2.24.0
Add documentation for the Python script used to build, run, and collect
results from the kernel known as kunit_tool. kunit_tool
(tools/testing/kunit/kunit.py) was already added in previous commits.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
---
Documentation/dev-tools/kunit/index.rst | 1 +
Documentation/dev-tools/kunit/kunit-tool.rst | 57 ++++++++++++++++++++
Documentation/dev-tools/kunit/start.rst | 5 +-
3 files changed, 62 insertions(+), 1 deletion(-)
create mode 100644 Documentation/dev-tools/kunit/kunit-tool.rst
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index 26ffb46bdf99d..c60d760a0eed1 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -9,6 +9,7 @@ KUnit - Unit Testing for the Linux Kernel
start
usage
+ kunit-tool
api/index
faq
diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst b/Documentation/dev-tools/kunit/kunit-tool.rst
new file mode 100644
index 0000000000000..50d46394e97e3
--- /dev/null
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+kunit_tool How-To
+=================
+
+What is kunit_tool?
+===================
+
+kunit_tool is a script (``tools/testing/kunit/kunit.py``) that aids in building
+the Linux kernel as UML (`User Mode Linux
+<http://user-mode-linux.sourceforge.net/>`_), running KUnit tests, parsing
+the test results and displaying them in a user friendly manner.
+
+What is a kunitconfig?
+======================
+
+It's just a defconfig that kunit_tool looks for in the base directory.
+kunit_tool uses it to generate a .config as you might expect. In addition, it
+verifies that the generated .config contains the CONFIG options in the
+kunitconfig; the reason it does this is so that it is easy to be sure that a
+CONFIG that enables a test actually ends up in the .config.
+
+How do I use kunit_tool?
+========================
+
+If a kunitconfig is present at the root directory, all you have to do is:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run
+
+However, you most likely want to use it with the following options:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all`
+
+- ``--timeout`` sets a maximum amount of time to allow tests to run.
+- ``--jobs`` sets the number of threads to use to build the kernel.
+
+If you just want to use the defconfig that ships with the kernel, you can
+append the ``--defconfig`` flag as well:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all` --defconfig
+
+.. note::
+ This command is particularly helpful for getting started because it
+ just works. No kunitconfig needs to be present.
+
+For a list of all the flags supported by kunit_tool, you can run:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --help
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index aeeddfafeea20..f4d9a4fa914f8 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -19,7 +19,10 @@ The wrapper can be run with:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py run
+ ./tools/testing/kunit/kunit.py run --defconfig
+
+For more information on this wrapper (also called kunit_tool) checkout the
+:doc:`kunit-tool` page.
Creating a kunitconfig
======================
--
2.24.0.432.g9d3f5f5b63-goog
Fix typos and gramatical errors in the Getting Started and Usage guide
for KUnit.
Reported-by: Randy Dunlap <rdunlap(a)infradead.org>
Link: https://patchwork.kernel.org/patch/11156481/
Reported-by: Rinat Ibragimov <ibragimovrinat(a)mail.ru>
Link: https://github.com/google/kunit-docs/issues/1
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
---
Documentation/dev-tools/kunit/start.rst | 8 ++++----
Documentation/dev-tools/kunit/usage.rst | 24 ++++++++++++------------
2 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index f4d9a4fa914f8..9d6db892c41c0 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -26,7 +26,7 @@ For more information on this wrapper (also called kunit_tool) checkout the
Creating a kunitconfig
======================
-The Python script is a thin wrapper around Kbuild as such, it needs to be
+The Python script is a thin wrapper around Kbuild. As such, it needs to be
configured with a ``kunitconfig`` file. This file essentially contains the
regular Kernel config, with the specific test targets as well.
@@ -62,8 +62,8 @@ If everything worked correctly, you should see the following:
followed by a list of tests that are run. All of them should be passing.
.. note::
- Because it is building a lot of sources for the first time, the ``Building
- kunit kernel`` step may take a while.
+ Because it is building a lot of sources for the first time, the
+ ``Building KUnit kernel`` step may take a while.
Writing your first test
=======================
@@ -162,7 +162,7 @@ Now you can run the test:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py
+ ./tools/testing/kunit/kunit.py run
You should see the following failure:
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index c6e69634e274b..b9a065ab681ee 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -16,7 +16,7 @@ Organization of this document
=============================
This document is organized into two main sections: Testing and Isolating
-Behavior. The first covers what a unit test is and how to use KUnit to write
+Behavior. The first covers what unit tests are and how to use KUnit to write
them. The second covers how to use KUnit to isolate code and make it possible
to unit test code that was otherwise un-unit-testable.
@@ -174,13 +174,13 @@ Test Suites
~~~~~~~~~~~
Now obviously one unit test isn't very helpful; the power comes from having
-many test cases covering all of your behaviors. Consequently it is common to
-have many *similar* tests; in order to reduce duplication in these closely
-related tests most unit testing frameworks provide the concept of a *test
-suite*, in KUnit we call it a *test suite*; all it is is just a collection of
-test cases for a unit of code with a set up function that gets invoked before
-every test cases and then a tear down function that gets invoked after every
-test case completes.
+many test cases covering all of a unit's behaviors. Consequently it is common
+to have many *similar* tests; in order to reduce duplication in these closely
+related tests most unit testing frameworks - including KUnit - provide the
+concept of a *test suite*. A *test suite* is just a collection of test cases
+for a unit of code with a set up function that gets invoked before every test
+case and then a tear down function that gets invoked after every test case
+completes.
Example:
@@ -211,7 +211,7 @@ KUnit test framework.
.. note::
A test case will only be run if it is associated with a test suite.
-For a more information on these types of things see the :doc:`api/test`.
+For more information on these types of things see the :doc:`api/test`.
Isolating Behavior
==================
@@ -338,7 +338,7 @@ We can easily test this code by *faking out* the underlying EEPROM:
return count;
}
- ssize_t fake_eeprom_write(struct eeprom *this, size_t offset, const char *buffer, size_t count)
+ ssize_t fake_eeprom_write(struct eeprom *parent, size_t offset, const char *buffer, size_t count)
{
struct fake_eeprom *this = container_of(parent, struct fake_eeprom, parent);
@@ -454,7 +454,7 @@ KUnit on non-UML architectures
By default KUnit uses UML as a way to provide dependencies for code under test.
Under most circumstances KUnit's usage of UML should be treated as an
implementation detail of how KUnit works under the hood. Nevertheless, there
-are instances where being able to run architecture specific code, or test
+are instances where being able to run architecture specific code or test
against real hardware is desirable. For these reasons KUnit supports running on
other architectures.
@@ -557,7 +557,7 @@ run your tests on your hardware setup just by compiling for your architecture.
.. important::
Always prefer tests that run on UML to tests that only run under a particular
architecture, and always prefer tests that run under QEMU or another easy
- (and monitarily free) to obtain software environment to a specific piece of
+ (and monetarily free) to obtain software environment to a specific piece of
hardware.
Nevertheless, there are still valid reasons to write an architecture or hardware
--
2.24.0.432.g9d3f5f5b63-goog
Fix typos and gramatical errors in the Getting Started and Usage guide
for KUnit.
Reported-by: Randy Dunlap <rdunlap(a)infradead.org>
Link: https://patchwork.kernel.org/patch/11156481/
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
Documentation/dev-tools/kunit/start.rst | 6 +++---
Documentation/dev-tools/kunit/usage.rst | 22 +++++++++++-----------
2 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index f4d9a4fa914f8..db146c7d77490 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -26,7 +26,7 @@ For more information on this wrapper (also called kunit_tool) checkout the
Creating a kunitconfig
======================
-The Python script is a thin wrapper around Kbuild as such, it needs to be
+The Python script is a thin wrapper around Kbuild. As such, it needs to be
configured with a ``kunitconfig`` file. This file essentially contains the
regular Kernel config, with the specific test targets as well.
@@ -62,8 +62,8 @@ If everything worked correctly, you should see the following:
followed by a list of tests that are run. All of them should be passing.
.. note::
- Because it is building a lot of sources for the first time, the ``Building
- kunit kernel`` step may take a while.
+ Because it is building a lot of sources for the first time, the
+ ``Building KUnit kernel`` step may take a while.
Writing your first test
=======================
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index c6e69634e274b..ae42a0d128c27 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -16,7 +16,7 @@ Organization of this document
=============================
This document is organized into two main sections: Testing and Isolating
-Behavior. The first covers what a unit test is and how to use KUnit to write
+Behavior. The first covers what unit tests are and how to use KUnit to write
them. The second covers how to use KUnit to isolate code and make it possible
to unit test code that was otherwise un-unit-testable.
@@ -174,13 +174,13 @@ Test Suites
~~~~~~~~~~~
Now obviously one unit test isn't very helpful; the power comes from having
-many test cases covering all of your behaviors. Consequently it is common to
-have many *similar* tests; in order to reduce duplication in these closely
-related tests most unit testing frameworks provide the concept of a *test
-suite*, in KUnit we call it a *test suite*; all it is is just a collection of
-test cases for a unit of code with a set up function that gets invoked before
-every test cases and then a tear down function that gets invoked after every
-test case completes.
+many test cases covering all of a unit's behaviors. Consequently it is common
+to have many *similar* tests; in order to reduce duplication in these closely
+related tests most unit testing frameworks - including KUnit - provide the
+concept of a *test suite*. A *test suite* is just a collection of test cases
+for a unit of code with a set up function that gets invoked before every test
+case and then a tear down function that gets invoked after every test case
+completes.
Example:
@@ -211,7 +211,7 @@ KUnit test framework.
.. note::
A test case will only be run if it is associated with a test suite.
-For a more information on these types of things see the :doc:`api/test`.
+For more information on these types of things see the :doc:`api/test`.
Isolating Behavior
==================
@@ -454,7 +454,7 @@ KUnit on non-UML architectures
By default KUnit uses UML as a way to provide dependencies for code under test.
Under most circumstances KUnit's usage of UML should be treated as an
implementation detail of how KUnit works under the hood. Nevertheless, there
-are instances where being able to run architecture specific code, or test
+are instances where being able to run architecture specific code or test
against real hardware is desirable. For these reasons KUnit supports running on
other architectures.
@@ -557,7 +557,7 @@ run your tests on your hardware setup just by compiling for your architecture.
.. important::
Always prefer tests that run on UML to tests that only run under a particular
architecture, and always prefer tests that run under QEMU or another easy
- (and monitarily free) to obtain software environment to a specific piece of
+ (and monetarily free) to obtain software environment to a specific piece of
hardware.
Nevertheless, there are still valid reasons to write an architecture or hardware
--
2.24.0.432.g9d3f5f5b63-goog
Hi,
Please note that two of these patches are also out for review
separately, and may go in earlier than this series. This series
requires those, so to ease review and testing, they are also
included here: patches 4 and 5, the devmap cleanups.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v5
Despite the large number of changes, it does feel like the review
comments are converging, btw.
Changes since v4:
* Renamed put_user_page*() --> unpin_user_page().
* Removed all pin_longterm_pages*() calls. We will use FOLL_LONGTERM
at the call sites. (FOLL_PIN, however, remains an internal gup flag).
This is very nice: many patches just change three characters now:
get_user_pages --> pin_user_pages. I think we've found the right
balance of wrapper calls and gup flags, for the call sites.
* Updated a lot of documentation and commit logs to match the above
two large changes.
* Changed gup_benchmark tests and run_vmtests, to adapt to one less
use case: there is no pin_longterm_pages() call anymore.
* This includes a new devmap cleanup patch from Dan Williams, along
with a rebased follow-up: patches 4 and 5, already mentioned above.
* Fixed patch 10 ("mm/gup: introduce pin_user_pages*() and FOLL_PIN"),
so as to make pin_user_pages*() calls act as placeholders for the
corresponding get_user_pages*() calls, until a later patch fully
implements the DMA-pinning functionality.
Thanks to Jan Kara for noticing that.
* Fixed the implementation of pin_user_pages_remote().
* Further tweaked patch 2 ("mm/gup: factor out duplicate code from four
routines"), in response to Jan Kara's feedback.
* Dropped a few reviewed-by tags due to changes that invalidated
them.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.
* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-9: refactoring and preparatory cleanup, independent fixes
* Patch 10: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 11-16: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 17: Activate tracking of FOLL_PIN pages.
* Patches 18-20: convert various callers
* Patches: 21-23: gup_benchmark and run_vmtests support
* Patch 24: rename put_user_page*() --> unpin_user_page*()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 ++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 155 +++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 561 +++++++++++++++-----
mm/gup_benchmark.c | 74 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 76 ++-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 21 +-
tools/testing/selftests/vm/run_vmtests | 22 +
30 files changed, 1104 insertions(+), 350 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
These changes are based on Jason's rdma/hmm branch (5.4.0-rc5).
Patch 1 was previously posted here [1] but was dropped from that orginal
series. Hopefully, the tests will reduce concerns about edge conditions.
I'm sure more tests could be usefully added but I thought this was a good
starting point.
Changes since v3:
patch 1:
Unchanged except rebased on Jason's latest hmm (bbe3329e354d3ab5dc18).
patch 2:
Is now part of Jason's tree.
patch 3 (now 2):
Major changes to incorporate Jason's review feedback.
* drivers/char/hmm_dmirror.c driver moved to lib/test_hmm.c
* XArray used instead of "page tables".
* platform device driver removed.
* remove redundant copyright.
Changes since v2:
patch 1:
Removed hmm_range_needs_fault() and just use hmm_range_need_fault().
Updated the change log to include that it fixes a bug where
hmm_range_fault() incorrectly returned an error when no fault is requested.
patch 2:
Removed the confusing change log wording about DMA.
Changed hmm_range_fault() to return the PFN of the zero page like any other
page.
patch 3:
Adjusted the test code to match the new zero page behavior.
Changes since v1:
Rebased to Jason's rdma/hmm branch (5.4.0-rc1).
Cleaned up locking for the test driver's page tables.
Incorporated Christoph Hellwig's comments.
[1] https://lore.kernel.org/linux-mm/20190726005650.2566-6-rcampbell@nvidia.com/
Ralph Campbell (2):
mm/hmm: make full use of walk_page_range()
mm/hmm/test: add self tests for HMM
MAINTAINERS | 3 +
include/uapi/linux/test_hmm.h | 59 ++
lib/Kconfig.debug | 11 +
lib/Makefile | 1 +
lib/test_hmm.c | 1306 ++++++++++++++++++++++++
mm/hmm.c | 121 ++-
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +
tools/testing/selftests/vm/config | 2 +
tools/testing/selftests/vm/hmm-tests.c | 1295 +++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 16 +
tools/testing/selftests/vm/test_hmm.sh | 97 ++
12 files changed, 2852 insertions(+), 63 deletions(-)
create mode 100644 include/uapi/linux/test_hmm.h
create mode 100644 lib/test_hmm.c
create mode 100644 tools/testing/selftests/vm/hmm-tests.c
create mode 100755 tools/testing/selftests/vm/test_hmm.sh
--
2.20.1
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v17:
* Add a path_is_under() check for LOOKUP_IS_SCOPED in complete_walk(), as a
last line of defence to ensure that namei bugs will not break the contract
of LOOKUP_BENEATH or LOOKUP_IN_ROOT.
* Update based on feedback by Al Viro:
* Make nd_jump_link() free the passed path on error, so that callers don't
need to worry about it in the error path.
* Remove needless m_retry and r_retry variables in handle_dots().
* Always return -ECHILD from follow_dotdot_rcu().
v16: <https://lore.kernel.org/lkml/20191116002802.6663-1-cyphar@cyphar.com/>
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (13):
namei: only return -ECHILD from follow_dotdot_rcu()
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 185 +++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 3 +-
fs/proc/namespaces.c | 20 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 40 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 6 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1686 insertions(+), 113 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: 31f4f5b495a62c9a8b15b1c3581acd5efeb9af8c
--
2.24.0
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v16:
* Update based on review by Al Viro:
* Handle magic-link related errors from with nd_jump_link() and drop
LOOKUP_MAGICLINK_JUMPED.
* Drop the slash-skipping logic for LOOKUP_IN_ROOT since it's not actually
necessary (link_path_walk() already does it, and it's possible that doing
it slightly breaks downstream code).
* Update outdated open_how documentation to match new semantics.
* Update commit message to further explain why -EAGAIN is preferable for
path_is_under() when checking whether ".." is safe.
* Split out the set_root() and nd_jump_root() errors from the
LOOKUP_BENEATH patch.
* Expand path-lookup documentation changes to describe all new LOOKUP flags.
* Cleanup the signature of ns_get_path() such that it returns an int
(previously it returned a void * -- even though the only two
possible return values were ERR_PTR or NULL).
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (12):
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 166 ++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 5 +-
fs/proc/namespaces.c | 23 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 40 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 8 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1675 insertions(+), 112 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: 31f4f5b495a62c9a8b15b1c3581acd5efeb9af8c
--
2.24.0
From: Michael Ellerman <mpe(a)ellerman.id.au>
[ Upstream commit 69f8117f17b332a68cd8f4bf8c2d0d3d5b84efc5 ]
Use TEST_GEN_PROGS and don't redefine all, this makes the out-of-tree
build work. We need to move the extra dependencies below the include
of lib.mk, because it adds the $(OUTPUT) prefix if it's defined.
We can also drop the clean rule, lib.mk does it for us.
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/cache_shape/Makefile | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile b/tools/testing/selftests/powerpc/cache_shape/Makefile
index 1be547434a49c..7e0c175b82978 100644
--- a/tools/testing/selftests/powerpc/cache_shape/Makefile
+++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
@@ -1,11 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := cache_shape
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c
+TEST_GEN_PROGS := cache_shape
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c
--
2.20.1
From: Joel Stanley <joel(a)jms.id.au>
[ Upstream commit 27825349d7b238533a47e3d98b8bb0efd886b752 ]
We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.
It also means we don't need our own clean rule, lib.mk does it.
We also have to update the signal_tm rule to use $(OUTPUT).
Signed-off-by: Joel Stanley <joel(a)jms.id.au>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/signal/Makefile | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/powerpc/signal/Makefile b/tools/testing/selftests/powerpc/signal/Makefile
index a7cbd5082e271..4213978f3ee2c 100644
--- a/tools/testing/selftests/powerpc/signal/Makefile
+++ b/tools/testing/selftests/powerpc/signal/Makefile
@@ -1,14 +1,9 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := signal signal_tm
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c signal.S
+TEST_GEN_PROGS := signal signal_tm
CFLAGS += -maltivec
-signal_tm: CFLAGS += -mhtm
+$(OUTPUT)/signal_tm: CFLAGS += -mhtm
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c signal.S
--
2.20.1
From: "Shuah Khan (Samsung OSG)" <shuah(a)kernel.org>
[ Upstream commit 9a244229a4b850b11952a0df79607c69b18fd8df ]
When /dev/watchdog open fails, watchdog exits with "watchdog not enabled"
message. This is incorrect when open fails due to insufficient privilege.
Fix message to clearly state the reason when open fails with EACCESS when
a non-root user runs it.
Signed-off-by: Shuah Khan (Samsung OSG) <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index 6e290874b70e2..e029e2017280f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -89,7 +89,13 @@ int main(int argc, char *argv[])
fd = open("/dev/watchdog", O_WRONLY);
if (fd == -1) {
- printf("Watchdog device not enabled.\n");
+ if (errno == ENOENT)
+ printf("Watchdog device not enabled.\n");
+ else if (errno == EACCES)
+ printf("Run watchdog as root.\n");
+ else
+ printf("Watchdog device open failed %s\n",
+ strerror(errno));
exit(-1);
}
--
2.20.1
From: Michael Ellerman <mpe(a)ellerman.id.au>
[ Upstream commit 69f8117f17b332a68cd8f4bf8c2d0d3d5b84efc5 ]
Use TEST_GEN_PROGS and don't redefine all, this makes the out-of-tree
build work. We need to move the extra dependencies below the include
of lib.mk, because it adds the $(OUTPUT) prefix if it's defined.
We can also drop the clean rule, lib.mk does it for us.
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/cache_shape/Makefile | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile b/tools/testing/selftests/powerpc/cache_shape/Makefile
index ede4d3dae7505..689f6c8ebcd8d 100644
--- a/tools/testing/selftests/powerpc/cache_shape/Makefile
+++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
@@ -1,12 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := cache_shape
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c
+TEST_GEN_PROGS := cache_shape
top_srcdir = ../../../../..
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c
--
2.20.1
From: Joel Stanley <joel(a)jms.id.au>
[ Upstream commit 27825349d7b238533a47e3d98b8bb0efd886b752 ]
We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.
It also means we don't need our own clean rule, lib.mk does it.
We also have to update the signal_tm rule to use $(OUTPUT).
Signed-off-by: Joel Stanley <joel(a)jms.id.au>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/signal/Makefile | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/powerpc/signal/Makefile b/tools/testing/selftests/powerpc/signal/Makefile
index 1fca25c6ace06..209a958dca127 100644
--- a/tools/testing/selftests/powerpc/signal/Makefile
+++ b/tools/testing/selftests/powerpc/signal/Makefile
@@ -1,15 +1,10 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := signal signal_tm
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c signal.S
+TEST_GEN_PROGS := signal signal_tm
CFLAGS += -maltivec
-signal_tm: CFLAGS += -mhtm
+$(OUTPUT)/signal_tm: CFLAGS += -mhtm
top_srcdir = ../../../../..
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c signal.S
--
2.20.1
From: Joel Stanley <joel(a)jms.id.au>
[ Upstream commit c39b79082a38a4f8c801790edecbbb4d62ed2992 ]
We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.
It also means we don't need our own clean rule, lib.mk does it.
We also have to update the ptrace-pkey and core-pkey rules to use
$(OUTPUT).
Signed-off-by: Joel Stanley <joel(a)jms.id.au>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/ptrace/Makefile | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/powerpc/ptrace/Makefile b/tools/testing/selftests/powerpc/ptrace/Makefile
index 923d531265f8c..9f9423430059e 100644
--- a/tools/testing/selftests/powerpc/ptrace/Makefile
+++ b/tools/testing/selftests/powerpc/ptrace/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
+TEST_GEN_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
ptrace-tar ptrace-tm-tar ptrace-tm-spd-tar ptrace-vsx ptrace-tm-vsx \
ptrace-tm-spd-vsx ptrace-tm-spr ptrace-hwbreak ptrace-pkey core-pkey \
perf-hwbreak
@@ -7,14 +7,9 @@ TEST_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
top_srcdir = ../../../../..
include ../../lib.mk
-all: $(TEST_PROGS)
-
CFLAGS += -m64 -I../../../../../usr/include -I../tm -mhtm -fno-pie
-ptrace-pkey core-pkey: child.h
-ptrace-pkey core-pkey: LDLIBS += -pthread
-
-$(TEST_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
+$(OUTPUT)/ptrace-pkey $(OUTPUT)/core-pkey: child.h
+$(OUTPUT)/ptrace-pkey $(OUTPUT)/core-pkey: LDLIBS += -pthread
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
--
2.20.1
From: Andrea Parri <andrea.parri(a)amarulasolutions.com>
[ Upstream commit fb363e2d20351e1d16629df19e7bce1a31b3227a ]
Fixes the following warnings:
dirty_log_test.c: In function ‘help’:
dirty_log_test.c:216:9: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘int’ [-Wformat=]
printf(" -i: specify iteration counts (default: %"PRIu64")\n",
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/test_util.h:18:0,
from dirty_log_test.c:16:
/usr/include/inttypes.h:105:34: note: format string is defined here
# define PRIu64 __PRI64_PREFIX "u"
dirty_log_test.c:218:9: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘int’ [-Wformat=]
printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/test_util.h:18:0,
from dirty_log_test.c:16:
/usr/include/inttypes.h:105:34: note: format string is defined here
# define PRIu64 __PRI64_PREFIX "u"
Signed-off-by: Andrea Parri <andrea.parri(a)amarulasolutions.com>
Signed-off-by: Shuah Khan (Samsung OSG) <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/kvm/dirty_log_test.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 0c2cdc105f968..a9c4b5e21d7e7 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -31,9 +31,9 @@
/* How many pages to dirty for each guest loop */
#define TEST_PAGES_PER_LOOP 1024
/* How many host loops to run (one KVM_GET_DIRTY_LOG for each loop) */
-#define TEST_HOST_LOOP_N 32
+#define TEST_HOST_LOOP_N 32UL
/* Interval for each host loop (ms) */
-#define TEST_HOST_LOOP_INTERVAL 10
+#define TEST_HOST_LOOP_INTERVAL 10UL
/*
* Guest variables. We use these variables to share data between host
--
2.20.1