Discussions around time virtualization are there for a long time. The first attempt to implement time namespace was in 2006 by Jeff Dike.
From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces: 1. change date and time inside a container; 2. adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
The main idea behind this patch set is adding per-namespace offsets for system clocks. When a process in a non-root time namespace requests time of a clock, a namespace offset is added to the current value of this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks.
Questions to discuss:
* Clone flags exhaustion. Currently there is only one unused clone flag bit left, and it may be worth to use it to extend arguments of the clone system call.
* Realtime clock implementation details: Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times? Implementation for adjtime() syscall.
Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: Adrian Reber adrian@lisas.de Cc: Andrei Vagin avagin@openvz.org Cc: Andy Lutomirski luto@kernel.org Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: Jeff Dike jdike@addtoit.com Cc: Oleg Nesterov oleg@redhat.com Cc: Pavel Emelyanov xemul@virtuozzo.com Cc: Shuah Khan shuah@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: containers@lists.linux-foundation.org Cc: criu@openvz.org Cc: linux-api@vger.kernel.org Cc: x86@kernel.org
Andrei Vagin (12): ns: Introduce Time Namespace timens: Add timens_offsets timens: Introduce CLOCK_MONOTONIC offsets timens: Introduce CLOCK_BOOTTIME offset timerfd/timens: Take into account ns clock offsets kernel: Take into account timens clock offsets in clock_nanosleep x86/vdso/timens: Add offsets page in vvar x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow posix-timers/timens: Take into account clock offsets selftest/timens: Add test for timerfd selftest/timens: Add test for clock_nanosleep timens/selftest: Add timer offsets test
Dmitry Safonov (8): timens: Shift /proc/uptime x86/vdso: Restrict splitting vvar vma x86/vdso: Purge timens page on setns()/unshare()/clone() x86/vdso: Look for vvar vma to purge timens page timens: Add align for timens_offsets timens: Optimize zero-offsets selftest: Add Time Namespace test for supported clocks timens/selftest: Add procfs selftest
arch/Kconfig | 5 + arch/x86/Kconfig | 1 + arch/x86/entry/vdso/vclock_gettime.c | 52 +++++ arch/x86/entry/vdso/vdso-layout.lds.S | 9 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 67 +++++++ arch/x86/include/asm/vdso.h | 2 + fs/proc/namespaces.c | 3 + fs/proc/uptime.c | 3 + fs/timerfd.c | 16 +- include/linux/nsproxy.h | 1 + include/linux/proc_ns.h | 1 + include/linux/time_namespace.h | 72 +++++++ include/linux/timens_offsets.h | 25 +++ include/linux/user_namespace.h | 1 + include/uapi/linux/sched.h | 1 + init/Kconfig | 8 + kernel/Makefile | 1 + kernel/fork.c | 3 +- kernel/nsproxy.c | 19 +- kernel/time/hrtimer.c | 8 + kernel/time/posix-timers.c | 89 ++++++++- kernel/time/posix-timers.h | 2 + kernel/time_namespace.c | 230 +++++++++++++++++++++++ tools/testing/selftests/timens/.gitignore | 5 + tools/testing/selftests/timens/Makefile | 6 + tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++ tools/testing/selftests/timens/config | 1 + tools/testing/selftests/timens/log.h | 21 +++ tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++ tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++ tools/testing/selftests/timens/timer.c | 95 ++++++++++ tools/testing/selftests/timens/timerfd.c | 96 ++++++++++ 33 files changed, 1272 insertions(+), 13 deletions(-) create mode 100644 include/linux/time_namespace.h create mode 100644 include/linux/timens_offsets.h create mode 100644 kernel/time_namespace.c create mode 100644 tools/testing/selftests/timens/.gitignore create mode 100644 tools/testing/selftests/timens/Makefile create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c create mode 100644 tools/testing/selftests/timens/config create mode 100644 tools/testing/selftests/timens/log.h create mode 100644 tools/testing/selftests/timens/procfs.c create mode 100644 tools/testing/selftests/timens/timens.c create mode 100644 tools/testing/selftests/timens/timer.c create mode 100644 tools/testing/selftests/timens/timerfd.c
This test checks that all supported clocks can be changed by clock_settime.
Cc: linux-kselftest@vger.kernel.org Signed-off-by: Andrei Vagin avagin@openvz.org Co-developed-by: Andrei Vagin avagin@openvz.org Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/timens/.gitignore | 1 + tools/testing/selftests/timens/Makefile | 5 + tools/testing/selftests/timens/config | 1 + tools/testing/selftests/timens/log.h | 21 ++++ tools/testing/selftests/timens/timens.c | 196 ++++++++++++++++++++++++++++++ 5 files changed, 224 insertions(+) create mode 100644 tools/testing/selftests/timens/.gitignore create mode 100644 tools/testing/selftests/timens/Makefile create mode 100644 tools/testing/selftests/timens/config create mode 100644 tools/testing/selftests/timens/log.h create mode 100644 tools/testing/selftests/timens/timens.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore new file mode 100644 index 000000000000..27a693229ce1 --- /dev/null +++ b/tools/testing/selftests/timens/.gitignore @@ -0,0 +1 @@ +timens diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile new file mode 100644 index 000000000000..b877efb78974 --- /dev/null +++ b/tools/testing/selftests/timens/Makefile @@ -0,0 +1,5 @@ +TEST_GEN_PROGS := timens + +CFLAGS := -Wall -Werror + +include ../lib.mk diff --git a/tools/testing/selftests/timens/config b/tools/testing/selftests/timens/config new file mode 100644 index 000000000000..4480620f6f49 --- /dev/null +++ b/tools/testing/selftests/timens/config @@ -0,0 +1 @@ +CONFIG_TIME_NS=y diff --git a/tools/testing/selftests/timens/log.h b/tools/testing/selftests/timens/log.h new file mode 100644 index 000000000000..05fec7f97870 --- /dev/null +++ b/tools/testing/selftests/timens/log.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __SELFTEST_TIMENS_LOG_H__ +#define __SELFTEST_TIMENS_LOG_H__ + +#define pr_msg(fmt, lvl, ...) \ + fprintf(stderr, "[%s] (%s:%d)\t" fmt "\n", \ + lvl, __FILE__, __LINE__, ##__VA_ARGS__) + +#define pr_p(func, fmt, ...) func(fmt ": %m", ##__VA_ARGS__) + +#define pr_err(fmt, ...) \ + ({ \ + pr_msg(fmt, "ERR", ##__VA_ARGS__) \ + -1; \ + }) +#define pr_fail(fmt, ...) pr_msg(fmt, "FAIL", ##__VA_ARGS__) + +#define pr_perror(fmt, ...) pr_p(pr_err, fmt, ##__VA_ARGS__) + +#endif diff --git a/tools/testing/selftests/timens/timens.c b/tools/testing/selftests/timens/timens.c new file mode 100644 index 000000000000..dfa6701214b1 --- /dev/null +++ b/tools/testing/selftests/timens/timens.c @@ -0,0 +1,196 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <sched.h> +#include <stdio.h> +#include <stdbool.h> +#include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <time.h> +#include <unistd.h> +#include <time.h> + +#include "log.h" + +#ifndef CLONE_NEWTIME +# define CLONE_NEWTIME 0x00001000 +#endif + +/* + * Test shouldn't be run for a day, so add 10 days to child + * time and check parent's time to be in the same day. + */ +#define DAY_IN_SEC (60*60*24) +#define TEN_DAYS_IN_SEC (10*DAY_IN_SEC) + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) + +#define CLOCK_TYPES \ + ct(CLOCK_BOOTTIME), \ + ct(CLOCK_MONOTONIC), \ + ct(CLOCK_MONOTONIC_COARSE), \ + ct(CLOCK_MONOTONIC_RAW), \ + + +#define ct(clock) clock +static clockid_t clocks[] = { + CLOCK_TYPES +}; +#undef ct +#define ct(clock) #clock +static char *clock_names[] = { + CLOCK_TYPES +}; + +static int child_ns, parent_ns; + +static int switch_ns(int fd) +{ + if (setns(fd, CLONE_NEWTIME)) { + pr_perror("setns()"); + return -1; + } + + return 0; +} + +static int init_namespaces(void) +{ + char path[] = "/proc/self/ns/time"; + struct stat st1, st2; + + parent_ns = open(path, O_RDONLY); + if (parent_ns <= 0) + return pr_perror("Unable to open %s", path); + + if (fstat(parent_ns, &st1)) + return pr_perror("Unable to stat the parent timens"); + + if (unshare(CLONE_NEWTIME)) + return pr_perror("Can't unshare() timens"); + + child_ns = open(path, O_RDONLY); + if (child_ns <= 0) + return pr_perror("Unable to open %s", path); + + if (fstat(child_ns, &st2)) + return pr_perror("Unable to stat the timens"); + + if (st1.st_ino == st2.st_ino) + return pr_perror("The same child_ns after CLONE_NEWTIME"); + + return 0; +} + +static int _gettime(clockid_t clk_id, struct timespec *res, bool raw_syscall) +{ + int err; + + if (!raw_syscall) { + if (clock_gettime(clk_id, res)) { + pr_perror("clock_gettime(%d)", (int)clk_id); + return -1; + } + return 0; + } + + err = syscall(SYS_clock_gettime, clk_id, res); + if (err) + pr_perror("syscall(SYS_clock_gettime(%d))", (int)clk_id); + + return err; +} + +static int _settime(clockid_t clk_id, struct timespec *res, bool raw_syscall) +{ + int err; + + if (!raw_syscall) { + if (clock_settime(clk_id, res)) + return pr_perror("clock_settime(%d)", (int)clk_id); + return 0; + } + + err = syscall(SYS_clock_settime, clk_id, res); + if (err) + pr_perror("syscall(SYS_clock_settime(%d))", (int)clk_id); + + return err; +} + +static int test_gettime(clockid_t clock_index, bool raw_syscall, time_t offset) +{ + struct timespec child_ts_new, parent_ts_old, cur_ts; + char *entry = raw_syscall ? "syscall" : "vdso"; + double precision = 0.0; + + switch (clocks[clock_index]) { + case CLOCK_MONOTONIC_COARSE: + case CLOCK_MONOTONIC_RAW: + precision = -2.0; + break; + } + + if (switch_ns(parent_ns)) + return pr_err("switch_ns(%d)", child_ns); + + if (_gettime(clocks[clock_index], &parent_ts_old, raw_syscall)) + return -1; + + if (switch_ns(child_ns)) + return pr_err("switch_ns(%d)", child_ns); + + child_ts_new.tv_nsec = parent_ts_old.tv_nsec; + child_ts_new.tv_sec = parent_ts_old.tv_sec + offset; + + if (_settime(clocks[clock_index], &child_ts_new, raw_syscall)) + return -1; + + if (_gettime(clocks[clock_index], &cur_ts, raw_syscall)) + return -1; + + if (difftime(cur_ts.tv_sec, child_ts_new.tv_sec) < precision) { + pr_fail("Child's %s (%s) time has not changed: %lu -> %lu [%lu]", + clock_names[clock_index], entry, parent_ts_old.tv_sec, + child_ts_new.tv_sec, cur_ts.tv_sec); + return -1; + } + + if (switch_ns(parent_ns)) + return pr_err("switch_ns(%d)", parent_ns); + + if (_gettime(clocks[clock_index], &cur_ts, raw_syscall)) + return -1; + + if (difftime(cur_ts.tv_sec, parent_ts_old.tv_sec) > DAY_IN_SEC) { + pr_fail("Parent's %s (%s) time has changed: %lu -> %lu [%lu]", + clock_names[clock_index], entry, parent_ts_old.tv_sec, + child_ts_new.tv_sec, cur_ts.tv_sec); + /* Let's play nice and put it closer to original */ + clock_settime(clocks[clock_index], &cur_ts); + return -1; + } + + pr_msg("Passed for %s (%s)", "OK", clock_names[clock_index], entry); + return 0; +} + +int main(int argc, char *argv[]) +{ + unsigned int i; + int ret = 0; + + if (init_namespaces()) + return 1; + + for (i = 0; i < ARRAY_SIZE(clocks); i++) { + ret |= test_gettime(i, true, TEN_DAYS_IN_SEC); + ret |= test_gettime(i, true, -TEN_DAYS_IN_SEC); + ret |= test_gettime(i, false, TEN_DAYS_IN_SEC); + ret |= test_gettime(i, false, -TEN_DAYS_IN_SEC); + } + + return !!ret; +}
Hi Dmitry,
Thanks for adding tests with the kernel changes.
On 09/19/2018 02:50 PM, Dmitry Safonov wrote:
This test checks that all supported clocks can be changed by clock_settime.
It would good to elaborate a bit more on the nature of the tests in the here. Also a few things to consider.
I noticed that this test isn't added to selftests/Makefile as TARGET. If it is an oversight, please make that change as well. If not, it is fine.
Please make sure if the test can't be run because of unmet dependencies, the test will exit with KSFT_SKIP as opposed to an error. Dependencies include configuration, privilege, any other unsupported conditions.
This is a comment applies to all the test patches in this series.
thanks, -- Shuah
From: Andrei Vagin avagin@gmail.com
Check that timerfd_create takes into account clock offsets.
Cc: linux-kselftest@vger.kernel.org Signed-off-by: Andrei Vagin avagin@openvz.org Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/timens/.gitignore | 1 + tools/testing/selftests/timens/Makefile | 2 +- tools/testing/selftests/timens/timerfd.c | 96 +++++++++++++++++++++++++++++++ 3 files changed, 98 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/timens/timerfd.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore index 27a693229ce1..b609f6ee9fb9 100644 --- a/tools/testing/selftests/timens/.gitignore +++ b/tools/testing/selftests/timens/.gitignore @@ -1 +1,2 @@ timens +timerfd diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile index b877efb78974..66b90cd28e5c 100644 --- a/tools/testing/selftests/timens/Makefile +++ b/tools/testing/selftests/timens/Makefile @@ -1,4 +1,4 @@ -TEST_GEN_PROGS := timens +TEST_GEN_PROGS := timens timerfd
CFLAGS := -Wall -Werror
diff --git a/tools/testing/selftests/timens/timerfd.c b/tools/testing/selftests/timens/timerfd.c new file mode 100644 index 000000000000..914a4cd9a0df --- /dev/null +++ b/tools/testing/selftests/timens/timerfd.c @@ -0,0 +1,96 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <sched.h> + +#include <sys/timerfd.h> +#include <sys/syscall.h> +#include <time.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> + +#include "log.h" + +#ifndef CLONE_NEWTIME +# define CLONE_NEWTIME 0x00001000 +#endif + +int run_test(int clockid) +{ + struct itimerspec new_value; + struct timespec now; + long long elapsed; + int fd, i; + + if (clock_gettime(clockid, &now)) + return pr_perror("clock_gettime"); + + for (i = 0; i < 2; i++) { + int flags = 0; + + pr_msg("timerfd_settime: %d", "INFO", clockid); + new_value.it_value.tv_sec = 3600; + new_value.it_value.tv_nsec = 0; + new_value.it_interval.tv_sec = 1; + new_value.it_interval.tv_nsec = 0; + + if (i == 1) { + new_value.it_value.tv_sec += now.tv_sec; + new_value.it_value.tv_nsec += now.tv_nsec; + } + + fd = timerfd_create(clockid, 0); + if (fd == -1) + return pr_perror("timerfd_create"); + + if (i == 1) + flags |= TFD_TIMER_ABSTIME; + + if (timerfd_settime(fd, flags, &new_value, NULL)) + return pr_perror("timerfd_settime"); + + if (timerfd_gettime(fd, &new_value)) + return pr_perror("timerfd_gettime"); + + elapsed = new_value.it_value.tv_sec; + if (abs(elapsed - 3600) > 60) { + printf("FAIL\n"); + return 1; + } + + close(fd); + } + + printf("PASS\n"); + + return 0; +} + +int main(int argc, char *argv[]) +{ + struct timespec tp; + int ret; + + if (unshare(CLONE_NEWTIME)) + return pr_perror("unshare"); + + if (clock_gettime(CLOCK_MONOTONIC, &tp)) + return pr_perror("clock_gettime"); + tp.tv_sec = 7 * 24 * 3600; + if (clock_settime(CLOCK_MONOTONIC, &tp)) + return pr_perror("clock_settime"); + + if (clock_gettime(CLOCK_BOOTTIME, &tp)) + return pr_perror("clock_gettime"); + tp.tv_sec += 9 * 24 * 3600; + tp.tv_nsec = 0; + if (clock_settime(CLOCK_BOOTTIME, &tp)) + return pr_perror("clock_settime"); + + ret = 0; + ret |= run_test(CLOCK_BOOTTIME); + ret |= run_test(CLOCK_MONOTONIC); + return ret; +} +
From: Andrei Vagin avagin@gmail.com
Check that clock_nanosleep() takes into account clock offsets.
Cc: linux-kselftest@vger.kernel.org Signed-off-by: Andrei Vagin avagin@openvz.org Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/timens/.gitignore | 1 + tools/testing/selftests/timens/Makefile | 2 +- tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++++++++++++++++ 3 files changed, 100 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore index b609f6ee9fb9..9b6c8ddac2c8 100644 --- a/tools/testing/selftests/timens/.gitignore +++ b/tools/testing/selftests/timens/.gitignore @@ -1,2 +1,3 @@ +clock_nanosleep timens timerfd diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile index 66b90cd28e5c..76a1dc891184 100644 --- a/tools/testing/selftests/timens/Makefile +++ b/tools/testing/selftests/timens/Makefile @@ -1,4 +1,4 @@ -TEST_GEN_PROGS := timens timerfd +TEST_GEN_PROGS := timens timerfd clock_nanosleep
CFLAGS := -Wall -Werror
diff --git a/tools/testing/selftests/timens/clock_nanosleep.c b/tools/testing/selftests/timens/clock_nanosleep.c new file mode 100644 index 000000000000..5af780b4cfe0 --- /dev/null +++ b/tools/testing/selftests/timens/clock_nanosleep.c @@ -0,0 +1,98 @@ +#define _GNU_SOURCE +#include <sched.h> + +#include <sys/timerfd.h> +#include <sys/syscall.h> +#include <time.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> + +#include "log.h" + +#ifndef CLONE_NEWTIME +#define CLONE_NEWTIME 0x00001000 +#endif + +static long long get_elapsed_time(int clockid, struct timespec *start) +{ + struct timespec curr; + long long secs, nsecs; + + if (clock_gettime(clockid, &curr) == -1) + return pr_perror("clock_gettime"); + + secs = curr.tv_sec - start->tv_sec; + nsecs = curr.tv_nsec - start->tv_nsec; + if (nsecs < 0) { + secs--; + nsecs += 1000000000; + } + if (nsecs > 1000000000) { + secs++; + nsecs -= 1000000000; + } + return secs * 1000 + nsecs / 1000000; +} + +int run_test(int clockid) +{ + long long elapsed; + int i; + + for (i = 0; i < 2; i++) { + struct timespec now = {}; + struct timespec start; + + if (clock_gettime(clockid, &start) == -1) + return pr_perror("clock_gettime"); + + + if (i == 1) { + now.tv_sec = start.tv_sec; + now.tv_nsec = start.tv_nsec; + } + + printf("clock_nanosleep: %d\n", clockid); + now.tv_sec += 2; + clock_nanosleep(clockid, i ? TIMER_ABSTIME : 0, &now, NULL); + + elapsed = get_elapsed_time(clockid, &start); + if (elapsed < 1900 || elapsed > 2100) { + pr_fail("elapsed %lld\n", elapsed); + return 1; + } + } + + printf("PASS\n"); + + return 0; +} + +int main(int argc, char *argv[]) +{ + struct timespec tp; + int ret; + + if (unshare(CLONE_NEWTIME)) + return pr_perror("unshare");; + + if (clock_gettime(CLOCK_MONOTONIC, &tp)) + return pr_perror("clock_gettime"); + tp.tv_sec += 7 * 24 * 3600; + if (clock_settime(CLOCK_MONOTONIC, &tp)) + return pr_perror("clock_settime"); + + if (clock_gettime(CLOCK_BOOTTIME, &tp)) + return pr_perror("clock_gettime"); + tp.tv_sec += 9 * 24 * 3600; + tp.tv_nsec = 0; + if (clock_settime(CLOCK_BOOTTIME, &tp)) + return pr_perror("clock_settime"); + + ret = 0; + ret |= run_test(CLOCK_MONOTONIC); + return ret; +} +
Currently only uptime check, but procfs checks for REALTIME might be added in future.
Cc: linux-kselftest@vger.kernel.org Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/timens/.gitignore | 1 + tools/testing/selftests/timens/Makefile | 2 +- tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++++++++++++++++++ 3 files changed, 147 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/timens/procfs.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore index 9b6c8ddac2c8..94ffdd9cead7 100644 --- a/tools/testing/selftests/timens/.gitignore +++ b/tools/testing/selftests/timens/.gitignore @@ -1,3 +1,4 @@ clock_nanosleep +procfs timens timerfd diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile index 76a1dc891184..f96f50d1fef8 100644 --- a/tools/testing/selftests/timens/Makefile +++ b/tools/testing/selftests/timens/Makefile @@ -1,4 +1,4 @@ -TEST_GEN_PROGS := timens timerfd clock_nanosleep +TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
CFLAGS := -Wall -Werror
diff --git a/tools/testing/selftests/timens/procfs.c b/tools/testing/selftests/timens/procfs.c new file mode 100644 index 000000000000..5067cbbddcc5 --- /dev/null +++ b/tools/testing/selftests/timens/procfs.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <math.h> +#include <sched.h> +#include <stdio.h> +#include <stdbool.h> +#include <stdlib.h> +#include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <time.h> +#include <unistd.h> +#include <time.h> + +#include "log.h" + +#ifndef CLONE_NEWTIME +# define CLONE_NEWTIME 0x00001000 +#endif + +/* + * Test shouldn't be run for a day, so add 10 days to child + * time and check parent's time to be in the same day. + */ +#define MAX_TEST_TIME_SEC (60*5) +#define DAY_IN_SEC (60*60*24) +#define TEN_DAYS_IN_SEC (10*DAY_IN_SEC) + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) + +static int child_ns, parent_ns; + +static int switch_ns(int fd) +{ + if (setns(fd, CLONE_NEWTIME)) + return pr_perror("setns()"); + + return 0; +} + +static int init_namespaces(void) +{ + char path[] = "/proc/self/ns/time"; + struct stat st1, st2; + + parent_ns = open(path, O_RDONLY); + if (parent_ns <= 0) + return pr_perror("Unable to open %s", path); + + if (fstat(parent_ns, &st1)) + return pr_perror("Unable to stat the parent timens"); + + if (unshare(CLONE_NEWTIME)) + return pr_perror("Can't unshare() timens"); + + child_ns = open(path, O_RDONLY); + if (child_ns <= 0) + return pr_perror("Unable to open %s", path); + + if (fstat(child_ns, &st2)) + return pr_perror("Unable to stat the timens"); + + if (st1.st_ino == st2.st_ino) + return pr_err("The same child_ns after CLONE_NEWTIME"); + + return 0; +} + +static int read_proc_uptime(struct timespec *uptime) +{ + unsigned long up_sec, up_nsec; + FILE *proc; + + proc = fopen("/proc/uptime", "r"); + if (proc == NULL) { + pr_perror("Unable to open /proc/uptime"); + return -1; + } + + if (fscanf(proc, "%lu.%02lu", &up_sec, &up_nsec) != 2) { + if (errno) { + pr_perror("fscanf"); + return -errno; + } + pr_err("failed to parse /proc/uptime"); + return -1; + } + fclose(proc); + + uptime->tv_sec = up_sec; + uptime->tv_nsec = up_nsec; + return 0; +} + +static int check_uptime(void) +{ + struct timespec ts_btime, uptime_new, uptime_old; + time_t uptime_expected; + double prec = MAX_TEST_TIME_SEC; + + if (switch_ns(parent_ns)) + return pr_err("switch_ns(%d)", parent_ns); + + if (clock_gettime(CLOCK_BOOTTIME, &ts_btime)) + return pr_perror("clock_gettime()"); + + if (read_proc_uptime(&uptime_old)) + return 1; + + ts_btime.tv_sec += TEN_DAYS_IN_SEC; + + if (switch_ns(child_ns)) + return pr_err("switch_ns(%d)", child_ns); + + if (clock_settime(CLOCK_BOOTTIME, &ts_btime)) + return pr_perror("clock_settime()"); + + if (read_proc_uptime(&uptime_new)) + return 1; + + uptime_expected = uptime_old.tv_sec + TEN_DAYS_IN_SEC; + if (fabs(difftime(uptime_new.tv_sec, uptime_expected)) > prec) { + pr_fail("uptime in /proc/uptime: old %ld, new %ld [%ld]", + uptime_old.tv_sec, uptime_new.tv_sec, + uptime_old.tv_sec + TEN_DAYS_IN_SEC); + return 1; + } + + pr_msg("Passed for /proc/uptime", "OK"); + return 0; +} + +int main(int argc, char *argv[]) +{ + int ret = 0; + + if (init_namespaces()) + return 1; + + ret |= check_uptime(); + + return ret; +}
From: Andrei Vagin avagin@openvz.org
Check that timer_create takes into account clock offsets.
Cc: linux-kselftest@vger.kernel.org Signed-off-by: Andrei Vagin avagin@openvz.org Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/timens/.gitignore | 1 + tools/testing/selftests/timens/Makefile | 3 +- tools/testing/selftests/timens/timer.c | 95 +++++++++++++++++++++++++++++++ 3 files changed, 98 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/timens/timer.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore index 94ffdd9cead7..3b7eda8f35ce 100644 --- a/tools/testing/selftests/timens/.gitignore +++ b/tools/testing/selftests/timens/.gitignore @@ -1,4 +1,5 @@ clock_nanosleep procfs timens +timer timerfd diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile index f96f50d1fef8..ae1ffd24cc43 100644 --- a/tools/testing/selftests/timens/Makefile +++ b/tools/testing/selftests/timens/Makefile @@ -1,5 +1,6 @@ -TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs +TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
CFLAGS := -Wall -Werror +LDFLAGS := -lrt
include ../lib.mk diff --git a/tools/testing/selftests/timens/timer.c b/tools/testing/selftests/timens/timer.c new file mode 100644 index 000000000000..e3a0951aadc8 --- /dev/null +++ b/tools/testing/selftests/timens/timer.c @@ -0,0 +1,95 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <sched.h> + +#include <sys/syscall.h> +#include <time.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> +#include <signal.h> +#include <time.h> + +#include "log.h" + +#ifndef CLONE_NEWTIME +#define CLONE_NEWTIME 0x00001000 /* New time namespace */ +#endif + +int run_test(int clockid) +{ + struct itimerspec new_value; + struct timespec now; + long long elapsed; + timer_t fd; + int i; + + if (clock_gettime(clockid, &now) == -1) + return pr_perror("clock_gettime"); + + for (i = 0; i < 2; i++) { + struct sigevent sevp = {.sigev_notify = SIGEV_NONE}; + int flags = 0; + + pr_msg("timerfd_settime: %d", "INFO", clockid); + new_value.it_value.tv_sec = 3600; + new_value.it_value.tv_nsec = 0; + new_value.it_interval.tv_sec = 1; + new_value.it_interval.tv_nsec = 0; + + if (i == 1) { + new_value.it_value.tv_sec += now.tv_sec; + new_value.it_value.tv_nsec += now.tv_nsec; + } + + if (timer_create(clockid, &sevp, &fd) == -1) + return pr_perror("timerfd_create"); + + if (i == 1) + flags |= TIMER_ABSTIME; + if (timer_settime(fd, flags, &new_value, NULL) == -1) + return pr_perror("timerfd_settime"); + + if (timer_gettime(fd, &new_value) == -1) + return pr_perror("timerfd_gettime"); + + elapsed = new_value.it_value.tv_sec; + if (abs(elapsed - 3600) > 60) { + pr_fail("elapsed: %lld\n", elapsed); + return 1; + } + } + + printf("PASS\n"); + + return 0; +} + +int main(int argc, char *argv[]) +{ + struct timespec tp; + int ret; + + if (unshare(CLONE_NEWTIME)) + return pr_perror("unshare"); + + if (clock_gettime(CLOCK_MONOTONIC, &tp)) + return pr_perror("clock_gettime"); + tp.tv_sec -= 70 * 24 * 3600; + if (clock_settime(CLOCK_MONOTONIC, &tp)) + return pr_perror("clock_settime"); + + if (clock_gettime(CLOCK_BOOTTIME, &tp)) + return pr_perror("clock_gettime"); + tp.tv_sec -= 9 * 24 * 3600; + tp.tv_nsec = 0; + if (clock_settime(CLOCK_BOOTTIME, &tp)) + return pr_perror("clock_settime"); + + ret = 0; + ret |= run_test(CLOCK_BOOTTIME); + ret |= run_test(CLOCK_MONOTONIC); + return ret; +} +
Dmitry Safonov dima@arista.com writes:
Discussions around time virtualization are there for a long time. The first attempt to implement time namespace was in 2006 by Jeff Dike. From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
- change date and time inside a container;
- adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
The main idea behind this patch set is adding per-namespace offsets for system clocks. When a process in a non-root time namespace requests time of a clock, a namespace offset is added to the current value of this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks.
Questions to discuss:
- Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone system call.
- Realtime clock implementation details: Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times? Implementation for adjtime() syscall.
Overall I support this effort. In my quick skim this code looked good.
My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second.
What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer.
I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core?
What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly.
I don't know about the vfs. I think the danger is being able to write dates in the future or in the past. It appears that utimes(2) and utimesnat(2) already allow this except for status change. So it is possible we simply don't care. I seem to remember that what nfs does is take the time stamp from the host writing to the file.
I think the guide for filesystem timestamps should be to first ensure we don't introduce security issues, and then do what distributed filesystems do when dealing with hosts with different clocks.
Given those those two guidlines above I don't think there is a need to change timestamsp the way the user namespace changes uid when displayed.
As for the hardware like the real time clock we definitely should not let a root in a time namespace change it. We might even be able to get away with leaving the real time clock out of the time namespace. If not we need to be very careful how the real time clock is abstracted. I would start by leaving the real time clock hardware out of the time namespace and see if there is any part of userspace that cares.
Eric
Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: Adrian Reber adrian@lisas.de Cc: Andrei Vagin avagin@openvz.org Cc: Andy Lutomirski luto@kernel.org Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: Jeff Dike jdike@addtoit.com Cc: Oleg Nesterov oleg@redhat.com Cc: Pavel Emelyanov xemul@virtuozzo.com Cc: Shuah Khan shuah@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: containers@lists.linux-foundation.org Cc: criu@openvz.org Cc: linux-api@vger.kernel.org Cc: x86@kernel.org
Andrei Vagin (12): ns: Introduce Time Namespace timens: Add timens_offsets timens: Introduce CLOCK_MONOTONIC offsets timens: Introduce CLOCK_BOOTTIME offset timerfd/timens: Take into account ns clock offsets kernel: Take into account timens clock offsets in clock_nanosleep x86/vdso/timens: Add offsets page in vvar x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow posix-timers/timens: Take into account clock offsets selftest/timens: Add test for timerfd selftest/timens: Add test for clock_nanosleep timens/selftest: Add timer offsets test
Dmitry Safonov (8): timens: Shift /proc/uptime x86/vdso: Restrict splitting vvar vma x86/vdso: Purge timens page on setns()/unshare()/clone() x86/vdso: Look for vvar vma to purge timens page timens: Add align for timens_offsets timens: Optimize zero-offsets selftest: Add Time Namespace test for supported clocks timens/selftest: Add procfs selftest
arch/Kconfig | 5 + arch/x86/Kconfig | 1 + arch/x86/entry/vdso/vclock_gettime.c | 52 +++++ arch/x86/entry/vdso/vdso-layout.lds.S | 9 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 67 +++++++ arch/x86/include/asm/vdso.h | 2 + fs/proc/namespaces.c | 3 + fs/proc/uptime.c | 3 + fs/timerfd.c | 16 +- include/linux/nsproxy.h | 1 + include/linux/proc_ns.h | 1 + include/linux/time_namespace.h | 72 +++++++ include/linux/timens_offsets.h | 25 +++ include/linux/user_namespace.h | 1 + include/uapi/linux/sched.h | 1 + init/Kconfig | 8 + kernel/Makefile | 1 + kernel/fork.c | 3 +- kernel/nsproxy.c | 19 +- kernel/time/hrtimer.c | 8 + kernel/time/posix-timers.c | 89 ++++++++- kernel/time/posix-timers.h | 2 + kernel/time_namespace.c | 230 +++++++++++++++++++++++ tools/testing/selftests/timens/.gitignore | 5 + tools/testing/selftests/timens/Makefile | 6 + tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++ tools/testing/selftests/timens/config | 1 + tools/testing/selftests/timens/log.h | 21 +++ tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++ tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++ tools/testing/selftests/timens/timer.c | 95 ++++++++++ tools/testing/selftests/timens/timerfd.c | 96 ++++++++++ 33 files changed, 1272 insertions(+), 13 deletions(-) create mode 100644 include/linux/time_namespace.h create mode 100644 include/linux/timens_offsets.h create mode 100644 kernel/time_namespace.c create mode 100644 tools/testing/selftests/timens/.gitignore create mode 100644 tools/testing/selftests/timens/Makefile create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c create mode 100644 tools/testing/selftests/timens/config create mode 100644 tools/testing/selftests/timens/log.h create mode 100644 tools/testing/selftests/timens/procfs.c create mode 100644 tools/testing/selftests/timens/timens.c create mode 100644 tools/testing/selftests/timens/timer.c create mode 100644 tools/testing/selftests/timens/timerfd.c
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
Dmitry Safonov dima@arista.com writes:
Discussions around time virtualization are there for a long time. The first attempt to implement time namespace was in 2006 by Jeff Dike. From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
- change date and time inside a container;
- adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
The main idea behind this patch set is adding per-namespace offsets for system clocks. When a process in a non-root time namespace requests time of a clock, a namespace offset is added to the current value of this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks.
Questions to discuss:
- Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone system call.
- Realtime clock implementation details: Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times? Implementation for adjtime() syscall.
Overall I support this effort. In my quick skim this code looked good.
Hi Eric,
Thank you for the feedback.
My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second.
What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer.
I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core?
What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly.
It is an interesting idea, but I have a few questions:
1. Does it mean that timekeeping_update() will be called for each namespace? This functions is called periodically, it updates times on the timekeeper structure, updates vsyscall_gtod_data, etc. What will be an overhead of this?
2. What will we do with vdso? It looks like we will have to have a separate vsyscall_gtod_data for each ns and update each of them separately.
I don't know about the vfs. I think the danger is being able to write dates in the future or in the past. It appears that utimes(2) and utimesnat(2) already allow this except for status change. So it is possible we simply don't care. I seem to remember that what nfs does is take the time stamp from the host writing to the file.
I think the guide for filesystem timestamps should be to first ensure we don't introduce security issues, and then do what distributed filesystems do when dealing with hosts with different clocks.
Given those those two guidlines above I don't think there is a need to change timestamsp the way the user namespace changes uid when displayed.
As for the hardware like the real time clock we definitely should not let a root in a time namespace change it. We might even be able to get away with leaving the real time clock out of the time namespace. If not we need to be very careful how the real time clock is abstracted. I would start by leaving the real time clock hardware out of the time namespace and see if there is any part of userspace that cares.
Eric
Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: Adrian Reber adrian@lisas.de Cc: Andrei Vagin avagin@openvz.org Cc: Andy Lutomirski luto@kernel.org Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: Jeff Dike jdike@addtoit.com Cc: Oleg Nesterov oleg@redhat.com Cc: Pavel Emelyanov xemul@virtuozzo.com Cc: Shuah Khan shuah@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: containers@lists.linux-foundation.org Cc: criu@openvz.org Cc: linux-api@vger.kernel.org Cc: x86@kernel.org
Andrei Vagin (12): ns: Introduce Time Namespace timens: Add timens_offsets timens: Introduce CLOCK_MONOTONIC offsets timens: Introduce CLOCK_BOOTTIME offset timerfd/timens: Take into account ns clock offsets kernel: Take into account timens clock offsets in clock_nanosleep x86/vdso/timens: Add offsets page in vvar x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow posix-timers/timens: Take into account clock offsets selftest/timens: Add test for timerfd selftest/timens: Add test for clock_nanosleep timens/selftest: Add timer offsets test
Dmitry Safonov (8): timens: Shift /proc/uptime x86/vdso: Restrict splitting vvar vma x86/vdso: Purge timens page on setns()/unshare()/clone() x86/vdso: Look for vvar vma to purge timens page timens: Add align for timens_offsets timens: Optimize zero-offsets selftest: Add Time Namespace test for supported clocks timens/selftest: Add procfs selftest
arch/Kconfig | 5 + arch/x86/Kconfig | 1 + arch/x86/entry/vdso/vclock_gettime.c | 52 +++++ arch/x86/entry/vdso/vdso-layout.lds.S | 9 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 67 +++++++ arch/x86/include/asm/vdso.h | 2 + fs/proc/namespaces.c | 3 + fs/proc/uptime.c | 3 + fs/timerfd.c | 16 +- include/linux/nsproxy.h | 1 + include/linux/proc_ns.h | 1 + include/linux/time_namespace.h | 72 +++++++ include/linux/timens_offsets.h | 25 +++ include/linux/user_namespace.h | 1 + include/uapi/linux/sched.h | 1 + init/Kconfig | 8 + kernel/Makefile | 1 + kernel/fork.c | 3 +- kernel/nsproxy.c | 19 +- kernel/time/hrtimer.c | 8 + kernel/time/posix-timers.c | 89 ++++++++- kernel/time/posix-timers.h | 2 + kernel/time_namespace.c | 230 +++++++++++++++++++++++ tools/testing/selftests/timens/.gitignore | 5 + tools/testing/selftests/timens/Makefile | 6 + tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++ tools/testing/selftests/timens/config | 1 + tools/testing/selftests/timens/log.h | 21 +++ tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++ tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++ tools/testing/selftests/timens/timer.c | 95 ++++++++++ tools/testing/selftests/timens/timerfd.c | 96 ++++++++++ 33 files changed, 1272 insertions(+), 13 deletions(-) create mode 100644 include/linux/time_namespace.h create mode 100644 include/linux/timens_offsets.h create mode 100644 kernel/time_namespace.c create mode 100644 tools/testing/selftests/timens/.gitignore create mode 100644 tools/testing/selftests/timens/Makefile create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c create mode 100644 tools/testing/selftests/timens/config create mode 100644 tools/testing/selftests/timens/log.h create mode 100644 tools/testing/selftests/timens/procfs.c create mode 100644 tools/testing/selftests/timens/timens.c create mode 100644 tools/testing/selftests/timens/timer.c create mode 100644 tools/testing/selftests/timens/timerfd.c
Andrey Vagin avagin@virtuozzo.com writes:
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
Dmitry Safonov dima@arista.com writes:
Discussions around time virtualization are there for a long time. The first attempt to implement time namespace was in 2006 by Jeff Dike. From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
- change date and time inside a container;
- adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
The main idea behind this patch set is adding per-namespace offsets for system clocks. When a process in a non-root time namespace requests time of a clock, a namespace offset is added to the current value of this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks.
Questions to discuss:
- Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone system call.
- Realtime clock implementation details: Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times? Implementation for adjtime() syscall.
Overall I support this effort. In my quick skim this code looked good.
Hi Eric,
Thank you for the feedback.
My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second.
What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer.
I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core?
What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly.
It is an interesting idea, but I have a few questions:
- Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the timekeeper structure, updates vsyscall_gtod_data, etc. What will be an overhead of this?
I don't know if periodically is a proper characterization. There may be a code path that does that. But from what I can see timekeeping_update is the guts of settimeofday (and a few related functions).
So it appears to make sense for timekeeping_update to be per namespace.
Hmm. Looking at what is updated in the vsyscall_gtod_data it does look like you would have to periodically update things, but I don't know big that period would be. As long as the period is reasonably large, or the time namespaces were sufficiently deschronized it should not be a problem. But that is the class of problem that could make my ideal impractical if there is measuarable overhead.
Where were you seeing timekeeping_update being called periodically?
- What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them separately.
Yes. But you don't have to have introduce another variable just make certain vsyscall_gtod_data is a page aligned thing per time namespace.
If I read the summary of the existing patchset something very similiar is already going on.
Each process would only map one. And unshare of the time namespace would need to act like the pid namespace or be limited to only being allowed when there is only a single task using the mm.
Eric
On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
Andrey Vagin avagin@virtuozzo.com writes:
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
Dmitry Safonov dima@arista.com writes:
Discussions around time virtualization are there for a long time. The first attempt to implement time namespace was in 2006 by Jeff Dike. From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
- change date and time inside a container;
- adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
The main idea behind this patch set is adding per-namespace offsets for system clocks. When a process in a non-root time namespace requests time of a clock, a namespace offset is added to the current value of this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks.
Questions to discuss:
- Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone system call.
- Realtime clock implementation details: Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times? Implementation for adjtime() syscall.
Overall I support this effort. In my quick skim this code looked good.
Hi Eric,
Thank you for the feedback.
My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second.
What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer.
I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core?
What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly.
It is an interesting idea, but I have a few questions:
- Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the timekeeper structure, updates vsyscall_gtod_data, etc. What will be an overhead of this?
I don't know if periodically is a proper characterization. There may be a code path that does that. But from what I can see timekeeping_update is the guts of settimeofday (and a few related functions).
So it appears to make sense for timekeeping_update to be per namespace.
Hmm. Looking at what is updated in the vsyscall_gtod_data it does look like you would have to periodically update things, but I don't know big that period would be. As long as the period is reasonably large, or the time namespaces were sufficiently deschronized it should not be a problem. But that is the class of problem that could make my ideal impractical if there is measuarable overhead.
Where were you seeing timekeeping_update being called periodically?
timekeeping_update() is called HZ times per-second:
[ 67.912858] timekeeping_update.cold.26+0x5/0xa [ 67.913332] timekeeping_advance+0x361/0x5c0 [ 67.913857] ? tick_sched_do_timer+0x55/0x70 [ 67.914409] ? tick_sched_do_timer+0x70/0x70 [ 67.914947] tick_sched_do_timer+0x55/0x70 [ 67.915505] tick_sched_timer+0x27/0x70 [ 67.916042] __hrtimer_run_queues+0x10f/0x440 [ 67.916639] hrtimer_interrupt+0x100/0x220 [ 67.917305] smp_apic_timer_interrupt+0x79/0x220 [ 67.918030] apic_timer_interrupt+0xf/0x20
- What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them separately.
Yes. But you don't have to have introduce another variable just make certain vsyscall_gtod_data is a page aligned thing per time namespace.
If I read the summary of the existing patchset something very similiar is already going on.
I mean vsyscall_gtod_data has some data which are often updated. There are timestamps for monotonic and wall clocks. clock_gettime() reads a time stamp from vsyscall_gtod_data and then use tsc to approximate the current value of a clock.
Actually, this is not the second question, it is a part of the first question. update_vsyscall() is called from timekeeping_update().
Each process would only map one. And unshare of the time namespace would need to act like the pid namespace or be limited to only being allowed when there is only a single task using the mm.
Eric
Andrey Vagin avagin@virtuozzo.com writes:
On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
Andrey Vagin avagin@virtuozzo.com writes:
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
Dmitry Safonov dima@arista.com writes:
Discussions around time virtualization are there for a long time. The first attempt to implement time namespace was in 2006 by Jeff Dike. From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
- change date and time inside a container;
- adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not defined and are different for each running system. When a container is migrated from one node to another, all clocks have to be restored into consistent states; in other words, they have to continue running from the same points where they have been dumped.
The main idea behind this patch set is adding per-namespace offsets for system clocks. When a process in a non-root time namespace requests time of a clock, a namespace offset is added to the current value of this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks.
Questions to discuss:
- Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone system call.
- Realtime clock implementation details: Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times? Implementation for adjtime() syscall.
Overall I support this effort. In my quick skim this code looked good.
Hi Eric,
Thank you for the feedback.
My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second.
What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer.
I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core?
What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly.
It is an interesting idea, but I have a few questions:
- Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the timekeeper structure, updates vsyscall_gtod_data, etc. What will be an overhead of this?
I don't know if periodically is a proper characterization. There may be a code path that does that. But from what I can see timekeeping_update is the guts of settimeofday (and a few related functions).
So it appears to make sense for timekeeping_update to be per namespace.
Hmm. Looking at what is updated in the vsyscall_gtod_data it does look like you would have to periodically update things, but I don't know big that period would be. As long as the period is reasonably large, or the time namespaces were sufficiently deschronized it should not be a problem. But that is the class of problem that could make my ideal impractical if there is measuarable overhead.
Where were you seeing timekeeping_update being called periodically?
timekeeping_update() is called HZ times per-second:
[ 67.912858] timekeeping_update.cold.26+0x5/0xa [ 67.913332] timekeeping_advance+0x361/0x5c0 [ 67.913857] ? tick_sched_do_timer+0x55/0x70 [ 67.914409] ? tick_sched_do_timer+0x70/0x70 [ 67.914947] tick_sched_do_timer+0x55/0x70 [ 67.915505] tick_sched_timer+0x27/0x70 [ 67.916042] __hrtimer_run_queues+0x10f/0x440 [ 67.916639] hrtimer_interrupt+0x100/0x220 [ 67.917305] smp_apic_timer_interrupt+0x79/0x220 [ 67.918030] apic_timer_interrupt+0xf/0x20
Interesting.
Reading the code the calling sequence there is: tick_sched_do_timer tick_do_update_jiffies64 update_wall_time timekeeping_advance timekeepging_update
If I read that properly under the right nohz circumstances that update can be delayed indefinitely.
So I think we could prototype a time namespace that was per timekeeping_update and just had update_wall_time iterate through all of the time namespaces.
I don't think the naive version would scale to very many time namespaces.
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
I think this direction is definitely worth exploring. My experience with namespaces is that if we don't get the advanced features working there is little to no interest from the core developers of the code, and the namespaces don't solve additional problems. Which makes the namespace a hard sell. Especially when it does not solve problems the developers of the subsystem have.
The advantage of timekeeping_update per time namespace is that it allows different lengths of seconds per time namespace. Which allows testing ntp and the kernel in interesting ways while still having a working production configuration on the same system.
Eric
2018-09-26 18:36 GMT+01:00 Eric W. Biederman ebiederm@xmission.com:
The advantage of timekeeping_update per time namespace is that it allows different lengths of seconds per time namespace. Which allows testing ntp and the kernel in interesting ways while still having a working production configuration on the same system.
Just a quick note: the different length of second per namespace sounds very interesting in my POV, I remember I've seen this article: http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-...
And their realisation with a simulation of time going with different speed per-pid (with vdso disabled): https://github.com/littlepretty/VirtualTimeKernel
Thanks, Dmitry
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
Reading the code the calling sequence there is: tick_sched_do_timer tick_do_update_jiffies64 update_wall_time timekeeping_advance timekeepging_update
If I read that properly under the right nohz circumstances that update can be delayed indefinitely.
So I think we could prototype a time namespace that was per timekeeping_update and just had update_wall_time iterate through all of the time namespaces.
Please don't go there. timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible,
I don't think the naive version would scale to very many time namespaces.
:)
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough.
From there it becomes hairy, because it's not only timekeeping,
i.e. reading time, this is also affecting all timers which are armed from a namespace.
That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them.
The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts.
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
Thanks,
tglx
On Thu, 27 Sep 2018, Thomas Gleixner wrote:
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
And just a quick scan tells me that this is broken. Timers will expire early or late. The latter is acceptible to some extent, but larger delays might come with surprise. Expiring early is an absolute nono.
Thanks,
tglx
On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
On Thu, 27 Sep 2018, Thomas Gleixner wrote:
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
And just a quick scan tells me that this is broken. Timers will expire early or late. The latter is acceptible to some extent, but larger delays might come with surprise. Expiring early is an absolute nono.
Do you mean that we have to adjust all timers after changing offset for CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for monotonic and boot times will be set immediately after creating a time namespace before using any timers.
It is interesting to think what a use-case for changing these offsets after creating timers. It may be useful for testing needs. A user sets a timer in an hour and then change a clock offset forward and check that a test application handles the timer properly.
Thanks,
tglx
On Mon, 1 Oct 2018, Andrey Vagin wrote:
On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
On Thu, 27 Sep 2018, Thomas Gleixner wrote:
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
And just a quick scan tells me that this is broken. Timers will expire early or late. The latter is acceptible to some extent, but larger delays might come with surprise. Expiring early is an absolute nono.
Do you mean that we have to adjust all timers after changing offset for CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for monotonic and boot times will be set immediately after creating a time namespace before using any timers.
I explained that in detail in this thread, but it's not about the initial setting of clock mono/boot before any timers have been armed.
It's about setting the offset or clock realtime (via settimeofday) when timers are already armed. Also having a entirely different time domain, e.g. separate NTP adjustments, makes that necessary.
Thanks,
tglx
Hi Thomas, Andrei, Eric,
On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner tglx@linutronix.de wrote:
On Mon, 1 Oct 2018, Andrey Vagin wrote:
On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
On Thu, 27 Sep 2018, Thomas Gleixner wrote:
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
And just a quick scan tells me that this is broken. Timers will expire early or late. The latter is acceptible to some extent, but larger delays might come with surprise. Expiring early is an absolute nono.
Do you mean that we have to adjust all timers after changing offset for CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for monotonic and boot times will be set immediately after creating a time namespace before using any timers.
I explained that in detail in this thread, but it's not about the initial setting of clock mono/boot before any timers have been armed.
It's about setting the offset or clock realtime (via settimeofday) when timers are already armed. Also having a entirely different time domain, e.g. separate NTP adjustments, makes that necessary.
It looks like, there is a bit of misunderstanding each other: Andrei was talking about the current RFC version, where we haven't introduced offsets for clock realtime. While Thomas IIUC, is looking how-to expand time namespace over realtime.
As CLOCK_REALTIME virtualization raises so many complex questions like a different length of the second or list of realtime timers in ns we haven't added any realization for it.
It seems like an initial introduction for timens can be expanded after to cover realtime clocks too. While it may seem incomplete, it solves issues for restoring/migration of real-world applications like nodejs, Oracle DB server which fails after being restored if there is a leap in monotonic time.
While solving the mentioned issues, it doesn't bring overhead. (well, Andy noted that cmp for zero-offsets on vdso can be optimized too, which will be done in v1).
Thomas, thanks much for your input - now we know that we'll need to introduce list for timers in namespace when we'll add realtime clocks. Do you believe that CLOCK_MONOTONIC_SYNC would be an easier concept than offsets per-namespace?
Thanks, Dmitry
Dmitry,
On Tue, 2 Oct 2018, Dmitry Safonov wrote:
On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner tglx@linutronix.de wrote:
I explained that in detail in this thread, but it's not about the initial setting of clock mono/boot before any timers have been armed.
It's about setting the offset or clock realtime (via settimeofday) when timers are already armed. Also having a entirely different time domain, e.g. separate NTP adjustments, makes that necessary.
It looks like, there is a bit of misunderstanding each other: Andrei was talking about the current RFC version, where we haven't introduced offsets for clock realtime. While Thomas IIUC, is looking how-to expand time namespace over realtime.
As CLOCK_REALTIME virtualization raises so many complex questions like a different length of the second or list of realtime timers in ns we haven't added any realization for it.
It seems like an initial introduction for timens can be expanded after to cover realtime clocks too. While it may seem incomplete, it solves issues for restoring/migration of real-world applications like nodejs, Oracle DB server which fails after being restored if there is a leap in monotonic time.
Well, yes. But you really have to think about the full picture. Just adding part of the overall solution right now, just because it can be glued into the code easily, is not the best approach IMO as it might result in substantial rework of the whole thing sooner than later. I really don't want to end up with something which is not extensible and has to be supported forever.
Just for the record, the current approach with name space offsets for monotonic is also prone to malfunction vs. timers, unless you can prevent changing the offset _after_ the namespace has been set up and timers have been armed. I admit, that I did not look close enough to verify that.
While solving the mentioned issues, it doesn't bring overhead. (well, Andy noted that cmp for zero-offsets on vdso can be optimized too, which will be done in v1).
Thomas, thanks much for your input - now we know that we'll need to introduce list for timers in namespace when we'll add realtime clocks. Do you believe that CLOCK_MONOTONIC_SYNC would be an easier concept than offsets per-namespace?
Haven't thought it through. This was just an idea in reaction to Eric's question whether setting clock monotonic might be feasible. But yes, it might be worth to think about it.
I think you should really define the long term requirements for time namespaces and perhaps set some limitations in functionality upfront.
Thanks,
tglx
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
Reading the code the calling sequence there is: tick_sched_do_timer tick_do_update_jiffies64 update_wall_time timekeeping_advance timekeepging_update
If I read that properly under the right nohz circumstances that update can be delayed indefinitely.
So I think we could prototype a time namespace that was per timekeeping_update and just had update_wall_time iterate through all of the time namespaces.
Please don't go there. timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible,
I don't think the naive version would scale to very many time namespaces.
:)
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough.
The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet.
From there it becomes hairy, because it's not only timekeeping, i.e. reading time, this is also affecting all timers which are armed from a namespace.
That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them.
The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts.
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
Then it sounds like this will take some more digging.
Please pardon me for thinking out load.
There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds.
Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source.
There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC.
I see two fundamental driving cases for a time namespace. 1) Migration from one node to another node in a cluster in almost real time.
The problem is that CLOCK_MONOTONIC between nodes in the cluster has not relation ship to each other (except a synchronized length of the second). So applications that migrate can see CLOCK_MONOTONIC and CLOCK_BOOTTIME go backwards.
This is the truly pressing problem and adding some kind of offset sounds like it would be the solution. Possibly by allowing a boot time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
2) Dealing with two separate time management domains. Say a machine that needes to deal with both something inside of google where they slew time to avoid leap time seconds and something in the outside world proper UTC time is kept as an offset from TAI with the occasional leap seconds.
In the later case it would fundamentally require having seconds of different length.
A pure 64bit nanoseond counter is good for 500 years. So 64bit variables can be used to hold time, and everything can be converted from there.
This suggests we can for ticks have two values. - The number of ticks from the time source. - The number of times the ticks would have rolled over.
That sounds like it may be a little simplistic as it would require being very diligent about firing a timer exactly at rollover and not losing that, but for a handwaving argument is probably enough to generate a 64bit tick counter.
If the focus is on a 64bit tick counter then what update_wall_time has to do is very limited. Just deal the accounting needed to cope with tick rollover.
Getting the actual time looks like it would be as simple as now, with perhaps an extra addition to account for the number of times the tick counter has rolled over. With limited precision arithmetic and various optimizations I don't think it is that simple to implement but it feels like it should be very little extra work.
For timers my inclination would be to assume no adjustments to the current time parameters and set the timer to go off then. If the time on the appropriate clock has been changed since the timer was set and the timer is going off early reschedule so the timer fires at the appropriate time.
With the above I think it is theoretically possible to build a time namespace that supports multiple lengths of second, and does not have much overhead.
Not that I think a final implementation would necessary look like what I have described. I just think it is possible with extreme care to evolve the current code base into something that can efficiently handle multiple time domains with slightly different lenghts of second.
Thomas does it sound like I am completely out of touch with reality?
It does though sound like it is going to take some serious digging through the code to understand how what everything does and how and why everthing works the way it does. Not something grafted on top with just a cursory understanding of how the code works.
Eric
Eric,
On Fri, 28 Sep 2018, Eric W. Biederman wrote:
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough.
The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet.
It's done by limiting the NOHZ idle time when all CPUs are going into deep sleep for a long time, i.e. we make sure that at least one CPU comes back sufficiently _before_ the wraparound happens and invokes the update function.
It's not so much a problem for TSC, but not every clocksource the kernel supports has wraparound times in the range of hundreds of years.
But yes, your idea of keeping track of wraparounds might work. Tricky, but looks feasible on first sight, but we should be aware of the dragons.
Please pardon me for thinking out load.
There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds.
Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source.
There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC.
Yes. That's the readout side. This one is doable. But now look at timers.
If you arm the timer from a name space, then it needs to be converted to host time in order to sort it into the hrtimer queue and at some point arm the clockevent device for it. This works as long as host and name space time have a constant offset and the same skew.
Once the name space time has a different skew this falls apart because the armed timer will either expire late or early.
Late might be acceptable, early violates the spec. You could do an extra check for rescheduling it, if it's early, but that requires to store the name space time accessor in the hrtimer itself because not every timer expiry happens so that it can be checked in the name space context (think signal based timers). We need to add this extra magic right into __hrtimer_run_queues() which is called from the hard and soft interrupt. We really don't want to touch all relevant callbacks or syscalls. The latter is not sufficient anyway for signal based timer delivery.
That's going to be interesting in terms of synchronization and might also cause substantial overhead at least for the timers which belong to name spaces.
But that also means that anything which is early can and probably will cause rearming of the timer hardware possibly for a very short delta. We need to think about whether this can be abused to create interrupt storms.
Now if you accept a bit late, which I'm not really happy about, then you surely won't accept very late, i.e. hours, days. But that can happen when settimeofday() comes into play. Right now with a single time domain, this is easy. When settimeofday() or adjtimex() makes time jump, we just go and reprogramm the hardware timers accordingly, which might also result in immediate expiry of timers.
But this does not help for time jumps in name spaces because the timer is enqueued on the host time base.
And no, we should not think about creating per name space hrtimer queues and then have to walk through all of them for finding the first expiring timer in order to arm the hardware. That cannot scale.
Walking all hrtimer bases on all CPUs and check all queued timers whether they belong to the affected name space does not scale either.
So we'd need to keep track of queued timers belonging to a name space and then just handle them. Interesting locking problem and also a scalability issue because this might need to be done on all online CPUs. Haven't thought it through, but it makes me shudder.
I see two fundamental driving cases for a time namespace.
<SNIP>
I completely understand the problem you are trying to solve and yes, the read out of time should be a solvable problem.
For timers my inclination would be to assume no adjustments to the current time parameters and set the timer to go off then. If the time on the appropriate clock has been changed since the timer was set and the timer is going off early reschedule so the timer fires at the appropriate time.
See above.
Not that I think a final implementation would necessary look like what I have described. I just think it is possible with extreme care to evolve the current code base into something that can efficiently handle multiple time domains with slightly different lenghts of second.
Yes, it really needs some serious thoughts and timekeeping is a really complex place especially with NTP/PTP in play. We had quite some quality time to make it work correctly and reliably, now you come along and want to transform it into a multidimensional puzzle. :)
Thomas does it sound like I am completely out of touch with reality?
Which reality are you talking about? :)
It does though sound like it is going to take some serious digging through the code to understand how what everything does and how and why everthing works the way it does. Not something grafted on top with just a cursory understanding of how the code works.
I fully agree and I'm happy to help with explanations and ideas and being the one who shoots holes into yours.
Thanks,
tglx
Thomas Gleixner tglx@linutronix.de writes:
Eric,
On Fri, 28 Sep 2018, Eric W. Biederman wrote:
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough.
The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet.
It's done by limiting the NOHZ idle time when all CPUs are going into deep sleep for a long time, i.e. we make sure that at least one CPU comes back sufficiently _before_ the wraparound happens and invokes the update function.
It's not so much a problem for TSC, but not every clocksource the kernel supports has wraparound times in the range of hundreds of years.
But yes, your idea of keeping track of wraparounds might work. Tricky, but looks feasible on first sight, but we should be aware of the dragons.
Oh. Yes. Definitely. A key enabler of any namespace implementation is figuring out how to tame the dragons.
Please pardon me for thinking out load.
There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds.
Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source.
There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC.
Yes. That's the readout side. This one is doable. But now look at timers.
If you arm the timer from a name space, then it needs to be converted to host time in order to sort it into the hrtimer queue and at some point arm the clockevent device for it. This works as long as host and name space time have a constant offset and the same skew.
Once the name space time has a different skew this falls apart because the armed timer will either expire late or early.
Late might be acceptable, early violates the spec. You could do an extra check for rescheduling it, if it's early, but that requires to store the name space time accessor in the hrtimer itself because not every timer expiry happens so that it can be checked in the name space context (think signal based timers). We need to add this extra magic right into __hrtimer_run_queues() which is called from the hard and soft interrupt. We really don't want to touch all relevant callbacks or syscalls. The latter is not sufficient anyway for signal based timer delivery.
That's going to be interesting in terms of synchronization and might also cause substantial overhead at least for the timers which belong to name spaces.
But that also means that anything which is early can and probably will cause rearming of the timer hardware possibly for a very short delta. We need to think about whether this can be abused to create interrupt storms.
Now if you accept a bit late, which I'm not really happy about, then you surely won't accept very late, i.e. hours, days. But that can happen when settimeofday() comes into play. Right now with a single time domain, this is easy. When settimeofday() or adjtimex() makes time jump, we just go and reprogramm the hardware timers accordingly, which might also result in immediate expiry of timers.
But this does not help for time jumps in name spaces because the timer is enqueued on the host time base.
And no, we should not think about creating per name space hrtimer queues and then have to walk through all of them for finding the first expiring timer in order to arm the hardware. That cannot scale.
Walking all hrtimer bases on all CPUs and check all queued timers whether they belong to the affected name space does not scale either.
So we'd need to keep track of queued timers belonging to a name space and then just handle them. Interesting locking problem and also a scalability issue because this might need to be done on all online CPUs. Haven't thought it through, but it makes me shudder.
Yes. I can see how this is a dragon that we need to figure out how to tame. It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME but still.
I see two fundamental driving cases for a time namespace.
<SNIP>
I completely understand the problem you are trying to solve and yes, the read out of time should be a solvable problem.
There is simplified subproblem that I want to ask about but I will reply separately for that.
Not that I think a final implementation would necessary look like what I have described. I just think it is possible with extreme care to evolve the current code base into something that can efficiently handle multiple time domains with slightly different lenghts of second.
Yes, it really needs some serious thoughts and timekeeping is a really complex place especially with NTP/PTP in play. We had quite some quality time to make it work correctly and reliably, now you come along and want to transform it into a multidimensional puzzle. :)
I thought it was Einstein who pointed out what a puzzle timekeeping is, with the rest of us just playing catch up. ;-)
It does though sound like it is going to take some serious digging through the code to understand how what everything does and how and why everthing works the way it does. Not something grafted on top with just a cursory understanding of how the code works.
I fully agree and I'm happy to help with explanations and ideas and being the one who shoots holes into yours.
Sounds good.
Eric
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve your problem that is you are looking at a time namespace to solve?
Eric
On Mon, 1 Oct 2018, Eric W. Biederman wrote:
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
It's used very early on in the kernel, so that would be a major surprise for many things including user space which has expectations on clock monotonic.
It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in the way you described and then in name spaces make it possible to magically map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
It still wouldn't allow to have different NTP/PTP time domains, but might be a good start to address the main migration headaches.
Thanks,
tglx
On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner tglx@linutronix.de wrote:
On Mon, 1 Oct 2018, Eric W. Biederman wrote:
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
It's used very early on in the kernel, so that would be a major surprise for many things including user space which has expectations on clock monotonic.
It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in the way you described and then in name spaces make it possible to magically map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
It still wouldn't allow to have different NTP/PTP time domains, but might be a good start to address the main migration headaches.
If we make CLOCK_MONOTONIC settable this way in a namespace, do you think that should include device drivers that report timestamps in CLOCK_MONOTONIC base, or only the timekeeping clock and timer interfaces?
Examples for drivers that can report timestamps are input, sound, v4l, and drm. I think most of these can report stamps in either monotonic or realtime base, while socket timestamps notably are always in realtime.
We can probably get away with not setting the timebase for those device drivers as long as the checkpoint/restart and migration features are not expected to restore the state of an open character device in that way. I don't know if that is a reasonable assumption to make for the examples I listed.
Arnd
On Tue, 2 Oct 2018, Arnd Bergmann wrote:
On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner tglx@linutronix.de wrote:
On Mon, 1 Oct 2018, Eric W. Biederman wrote:
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
It's used very early on in the kernel, so that would be a major surprise for many things including user space which has expectations on clock monotonic.
It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in the way you described and then in name spaces make it possible to magically map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
It still wouldn't allow to have different NTP/PTP time domains, but might be a good start to address the main migration headaches.
If we make CLOCK_MONOTONIC settable this way in a namespace, do you think that should include device drivers that report timestamps in CLOCK_MONOTONIC base, or only the timekeeping clock and timer interfaces?
Uurgh. That gets messy very fast.
Examples for drivers that can report timestamps are input, sound, v4l, and drm. I think most of these can report stamps in either monotonic or realtime base, while socket timestamps notably are always in realtime.
We can probably get away with not setting the timebase for those device drivers as long as the checkpoint/restart and migration features are not expected to restore the state of an open character device in that way. I don't know if that is a reasonable assumption to make for the examples I listed.
No idea. I'm not a container migration wizard.
Thanks,
tglx
Thomas Gleixner tglx@linutronix.de writes:
On Tue, 2 Oct 2018, Arnd Bergmann wrote:
On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner tglx@linutronix.de wrote:
On Mon, 1 Oct 2018, Eric W. Biederman wrote:
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
It's used very early on in the kernel, so that would be a major surprise for many things including user space which has expectations on clock monotonic.
It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in the way you described and then in name spaces make it possible to magically map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
It still wouldn't allow to have different NTP/PTP time domains, but might be a good start to address the main migration headaches.
If we make CLOCK_MONOTONIC settable this way in a namespace, do you think that should include device drivers that report timestamps in CLOCK_MONOTONIC base, or only the timekeeping clock and timer interfaces?
Uurgh. That gets messy very fast.
Examples for drivers that can report timestamps are input, sound, v4l, and drm. I think most of these can report stamps in either monotonic or realtime base, while socket timestamps notably are always in realtime.
We can probably get away with not setting the timebase for those device drivers as long as the checkpoint/restart and migration features are not expected to restore the state of an open character device in that way. I don't know if that is a reasonable assumption to make for the examples I listed.
No idea. I'm not a container migration wizard.
Direct access to hardware/drivers and not through an abstraction like the vfs (an abstraction over block devices) can legitimately be handled by hotplug events. I unplug one keyboard I plug in another.
I don't know if the input layer is more of a general abstraction or more of a hardware device. I have not dug into it but my guess is abstraction from what I have heard.
The scary difficulty here is if after restart input is reporting times in CLOCK_MONOTONIC and the applications in the namespace are talking about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even with a fixed offset the times don't match up.
So a time namespace absolutely needs to do is figure out how to deal with all of the kernel interfaces reporting times and figure out how to report them in the current time namespace.
Eric
On Wed, 3 Oct 2018, Eric W. Biederman wrote:
Direct access to hardware/drivers and not through an abstraction like the vfs (an abstraction over block devices) can legitimately be handled by hotplug events. I unplug one keyboard I plug in another.
I don't know if the input layer is more of a general abstraction or more of a hardware device. I have not dug into it but my guess is abstraction from what I have heard.
The scary difficulty here is if after restart input is reporting times in CLOCK_MONOTONIC and the applications in the namespace are talking about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even with a fixed offset the times don't match up.
So a time namespace absolutely needs to do is figure out how to deal with all of the kernel interfaces reporting times and figure out how to report them in the current time namespace.
So you want to talk to Arnd who is leading the y2038 effort. He knowns how many and which interfaces are involved aside of the obvious core timer ones. It's quite an amount and the problem is that you really need to do that at the interface level, because many of those time stamps are taken in contexts which are completely oblivious of name spaces. Ditto for timeouts and similar things which are handed in through these interfaces.
Thanks,
tglx
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 3 Oct 2018, Eric W. Biederman wrote:
Direct access to hardware/drivers and not through an abstraction like the vfs (an abstraction over block devices) can legitimately be handled by hotplug events. I unplug one keyboard I plug in another.
I don't know if the input layer is more of a general abstraction or more of a hardware device. I have not dug into it but my guess is abstraction from what I have heard.
The scary difficulty here is if after restart input is reporting times in CLOCK_MONOTONIC and the applications in the namespace are talking about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even with a fixed offset the times don't match up.
So a time namespace absolutely needs to do is figure out how to deal with all of the kernel interfaces reporting times and figure out how to report them in the current time namespace.
So you want to talk to Arnd who is leading the y2038 effort. He knowns how many and which interfaces are involved aside of the obvious core timer ones. It's quite an amount and the problem is that you really need to do that at the interface level, because many of those time stamps are taken in contexts which are completely oblivious of name spaces. Ditto for timeouts and similar things which are handed in through these interfaces.
Yep. That sounds right.
Eric
On Wed, Oct 3, 2018 at 8:14 AM Eric W. Biederman ebiederm@xmission.com wrote:
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 3 Oct 2018, Eric W. Biederman wrote:
Direct access to hardware/drivers and not through an abstraction like the vfs (an abstraction over block devices) can legitimately be handled by hotplug events. I unplug one keyboard I plug in another.
I don't know if the input layer is more of a general abstraction or more of a hardware device. I have not dug into it but my guess is abstraction from what I have heard.
The scary difficulty here is if after restart input is reporting times in CLOCK_MONOTONIC and the applications in the namespace are talking about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even with a fixed offset the times don't match up.
So a time namespace absolutely needs to do is figure out how to deal with all of the kernel interfaces reporting times and figure out how to report them in the current time namespace.
So you want to talk to Arnd who is leading the y2038 effort. He knowns how many and which interfaces are involved aside of the obvious core timer ones. It's quite an amount and the problem is that you really need to do that at the interface level, because many of those time stamps are taken in contexts which are completely oblivious of name spaces. Ditto for timeouts and similar things which are handed in through these interfaces.
Yep. That sounds right.
Let's stay with the input event example for the moment: Here, we have a character device, and a user calls read() to retrieve one or more records of type 'struct input_event' using the evdev_read() function. The original timestamp gets put there using this logic:
ktime_t time; struct timespec64 ts; time = client->clk_type == EV_CLK_REAL ? ktime_get_real() : client->clk_type == EV_CLK_MONO ? ktime_get() : ktime_get_boottime(); ts = ktime_to_timespec64(time); ev.input_event_sec = ts.tv_sec; ev.input_event_usec = ts.tv_nsec / NSEC_PER_USEC;
clk_type can get set using an ioctl() to real, monotonic or boottime. We have to stop using EV_CLK_REAL in the future because that breaks in y2038, but I guess EV_CLK_MONO and EV_CLK_BOOK should stay.
If we want this to work correctly in a namespace that has a user defined CLOCK_MONOTONIC timebase, one way to do it might be to always call ktime_get() when we record the timestamp in the kernel-internal CLOCK_MONOTONIC base, but then convert it to the correct base when copying to user space.
Note that AFAIU practically all users of evdev do /not/ actually care about the time base, they only care about the elapsed time between intervals, e.g. to track how fast a pointer should move based on input from a trackpad. I don't see any reason why one would compare this timestamp to a clock_gettime() value, but of course at the moment this has well-defined behavior that would break if we change clock_gettime(), and we have a process in the namespace that opens /dev/input/eventX and relies on meaningful timestamps relative to a particular base.
Arnd
On Wed, 3 Oct 2018, Thomas Gleixner wrote:
On Wed, 3 Oct 2018, Eric W. Biederman wrote:
Direct access to hardware/drivers and not through an abstraction like the vfs (an abstraction over block devices) can legitimately be handled by hotplug events. I unplug one keyboard I plug in another.
I don't know if the input layer is more of a general abstraction or more of a hardware device. I have not dug into it but my guess is abstraction from what I have heard.
The scary difficulty here is if after restart input is reporting times in CLOCK_MONOTONIC and the applications in the namespace are talking about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even with a fixed offset the times don't match up.
So a time namespace absolutely needs to do is figure out how to deal with all of the kernel interfaces reporting times and figure out how to report them in the current time namespace.
So you want to talk to Arnd who is leading the y2038 effort. He knowns how many and which interfaces are involved aside of the obvious core timer ones. It's quite an amount and the problem is that you really need to do that at the interface level, because many of those time stamps are taken in contexts which are completely oblivious of name spaces. Ditto for timeouts and similar things which are handed in through these interfaces.
Plus you have to make sure, that any new interface will have that treatment. For y2038 that's easy as we just require to use timespec64 for new ones. For your problem that's not so trivial.
Thanks,
tglx
On Mon, Oct 01, 2018 at 11:15:32AM +0200, Eric W. Biederman wrote:
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
Here is a question about how to synchronize clocks between nodes. It looks like we will need to have a working network for this, but a network configuration may be non-trivial and it can require to run a few processes which can use CLOCK_MONOTNIC...
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve your problem that is you are looking at a time namespace to solve?
Process migration is only one of use-cases. Another use-case is restoring from snapshots. It may be even more popular than process migration. We can't guarantee that all snapshots will be done in one cluster. For example, a user meets a bug, does a container snapshot and attaches it to a bug report.
Eric
On Mon, 1 Oct 2018, Andrey Vagin wrote:
On Mon, Oct 01, 2018 at 11:15:32AM +0200, Eric W. Biederman wrote:
In the context of process migration there is a simpler subproblem that I think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between machines. Not having a matching CLOCK_MONOTONIC prevents successful process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very beginning of time? So that all of the nodes in a cluster can be in sync?
Here is a question about how to synchronize clocks between nodes. It looks like we will need to have a working network for this, but a network configuration may be non-trivial and it can require to run a few processes which can use CLOCK_MONOTNIC...
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't know if allowing CLOCK_MONOTONIC to be set would be practical but it seems work exploring all on it's own.
Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve your problem that is you are looking at a time namespace to solve?
Process migration is only one of use-cases. Another use-case is restoring from snapshots. It may be even more popular than process migration. We can't guarantee that all snapshots will be done in one cluster. For example, a user meets a bug, does a container snapshot and attaches it to a bug report.
Sure, but see my reply to Eric. That could be solved with that extra clock id, which then gets mapped to monotonic for name spaces.
Thanks,
tglx
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
Reading the code the calling sequence there is: tick_sched_do_timer tick_do_update_jiffies64 update_wall_time timekeeping_advance timekeepging_update
If I read that properly under the right nohz circumstances that update can be delayed indefinitely.
So I think we could prototype a time namespace that was per timekeeping_update and just had update_wall_time iterate through all of the time namespaces.
Please don't go there. timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible,
I don't think the naive version would scale to very many time namespaces.
:)
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough.
The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet.
From there it becomes hairy, because it's not only timekeeping, i.e. reading time, this is also affecting all timers which are armed from a namespace.
That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them.
The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts.
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
Then it sounds like this will take some more digging.
Please pardon me for thinking out load.
There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds.
Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source.
There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC.
I see two fundamental driving cases for a time namespace.
Migration from one node to another node in a cluster in almost real time.
The problem is that CLOCK_MONOTONIC between nodes in the cluster has not relation ship to each other (except a synchronized length of the second). So applications that migrate can see CLOCK_MONOTONIC and CLOCK_BOOTTIME go backwards.
This is the truly pressing problem and adding some kind of offset sounds like it would be the solution. Possibly by allowing a boot time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
Dealing with two separate time management domains. Say a machine that needes to deal with both something inside of google where they slew time to avoid leap time seconds and something in the outside world proper UTC time is kept as an offset from TAI with the occasional leap seconds.
In the later case it would fundamentally require having seconds of different length.
I want to add that the second case should be optional.
When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock.
Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently.
Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this series to make it ready to be merged?
I know that we need to:
* look at device drivers that report timestamps in CLOCK_MONOTONIC base. * forbid changing offsets after creating timers
Anything else?
Thanks, Andrei
A pure 64bit nanoseond counter is good for 500 years. So 64bit variables can be used to hold time, and everything can be converted from there.
This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.
That sounds like it may be a little simplistic as it would require being very diligent about firing a timer exactly at rollover and not losing that, but for a handwaving argument is probably enough to generate a 64bit tick counter.
If the focus is on a 64bit tick counter then what update_wall_time has to do is very limited. Just deal the accounting needed to cope with tick rollover.
Getting the actual time looks like it would be as simple as now, with perhaps an extra addition to account for the number of times the tick counter has rolled over. With limited precision arithmetic and various optimizations I don't think it is that simple to implement but it feels like it should be very little extra work.
For timers my inclination would be to assume no adjustments to the current time parameters and set the timer to go off then. If the time on the appropriate clock has been changed since the timer was set and the timer is going off early reschedule so the timer fires at the appropriate time.
With the above I think it is theoretically possible to build a time namespace that supports multiple lengths of second, and does not have much overhead.
Not that I think a final implementation would necessary look like what I have described. I just think it is possible with extreme care to evolve the current code base into something that can efficiently handle multiple time domains with slightly different lenghts of second.
Thomas does it sound like I am completely out of touch with reality?
It does though sound like it is going to take some serious digging through the code to understand how what everything does and how and why everthing works the way it does. Not something grafted on top with just a cursory understanding of how the code works.
Eric _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
Thomas Gleixner tglx@linutronix.de writes:
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
Reading the code the calling sequence there is: tick_sched_do_timer tick_do_update_jiffies64 update_wall_time timekeeping_advance timekeepging_update
If I read that properly under the right nohz circumstances that update can be delayed indefinitely.
So I think we could prototype a time namespace that was per timekeeping_update and just had update_wall_time iterate through all of the time namespaces.
Please don't go there. timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible,
I don't think the naive version would scale to very many time namespaces.
:)
At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough.
The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet.
From there it becomes hairy, because it's not only timekeeping, i.e. reading time, this is also affecting all timers which are armed from a namespace.
That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them.
The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts.
Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early.
I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place.
Then it sounds like this will take some more digging.
Please pardon me for thinking out load.
There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds.
Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source.
There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC.
I see two fundamental driving cases for a time namespace.
Migration from one node to another node in a cluster in almost real time.
The problem is that CLOCK_MONOTONIC between nodes in the cluster has not relation ship to each other (except a synchronized length of the second). So applications that migrate can see CLOCK_MONOTONIC and CLOCK_BOOTTIME go backwards.
This is the truly pressing problem and adding some kind of offset sounds like it would be the solution. Possibly by allowing a boot time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
Dealing with two separate time management domains. Say a machine that needes to deal with both something inside of google where they slew time to avoid leap time seconds and something in the outside world proper UTC time is kept as an offset from TAI with the occasional leap seconds.
In the later case it would fundamentally require having seconds of different length.
I want to add that the second case should be optional.
When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock.
Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently.
Eric and Tomas, what do you think about this? If you agree that these
Sorry Thomas, I mistyped your name.
two cases can be implemented separately, what should we do with this series to make it ready to be merged?
I know that we need to:
- look at device drivers that report timestamps in CLOCK_MONOTONIC base.
- forbid changing offsets after creating timers
Anything else?
Thanks, Andrei
A pure 64bit nanoseond counter is good for 500 years. So 64bit variables can be used to hold time, and everything can be converted from there.
This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.
That sounds like it may be a little simplistic as it would require being very diligent about firing a timer exactly at rollover and not losing that, but for a handwaving argument is probably enough to generate a 64bit tick counter.
If the focus is on a 64bit tick counter then what update_wall_time has to do is very limited. Just deal the accounting needed to cope with tick rollover.
Getting the actual time looks like it would be as simple as now, with perhaps an extra addition to account for the number of times the tick counter has rolled over. With limited precision arithmetic and various optimizations I don't think it is that simple to implement but it feels like it should be very little extra work.
For timers my inclination would be to assume no adjustments to the current time parameters and set the timer to go off then. If the time on the appropriate clock has been changed since the timer was set and the timer is going off early reschedule so the timer fires at the appropriate time.
With the above I think it is theoretically possible to build a time namespace that supports multiple lengths of second, and does not have much overhead.
Not that I think a final implementation would necessary look like what I have described. I just think it is possible with extreme care to evolve the current code base into something that can efficiently handle multiple time domains with slightly different lenghts of second.
Thomas does it sound like I am completely out of touch with reality?
It does though sound like it is going to take some serious digging through the code to understand how what everything does and how and why everthing works the way it does. Not something grafted on top with just a cursory understanding of how the code works.
Eric _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers
Andrei,
On Sat, 20 Oct 2018, Andrei Vagin wrote:
When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock.
Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently.
Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this series to make it ready to be merged?
I know that we need to:
- look at device drivers that report timestamps in CLOCK_MONOTONIC base.
and CLOCK_BOOTTIME and that's quite a few.
- forbid changing offsets after creating timers
There are more things to think about. What about interfaces which expose boot time or monotonic time in /proc?
Aside of that (I finally came around to look at the series in more detail) I'm really unhappy about the unconditional overhead once the Time namespace config switch is enabled. This applies especially to the VDSO. We spent quite some time recently to squeeze a few cycles out of those functions and it would be a pity to pointlessly waste cycles for the !namespace case.
I can see the urge for this, but please let us think it through properly before rushing anything in which we are going to regret once we want to do more sophisticated time domain management, e.g. support for isolated clock real time. I'm worried, that without a clear plan about the overall picture, we end up with duct tape which is hard to distangle after the fact.
There have been a few other things brought up versus time management in general, like the TSN folks utilizing grand clock masters which expose random time instead of proper TAI. Plus some requirements for exposing some sort of 'monotonic' clocks which are derived from external synchronization mechanisms, but should not affect the regular time keeping clocks.
While different issues, these all fall into the category of separate time domains, so taking a step back to the drawing board is probably the best thing what we can do now.
There are certainly a few things which can be looked at independently, e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole kernel with these name space functions applying offsets left and right. I rather have dedicated core functionality which replaces/amends existing timer functions to become time namespace aware.
I'll try to find some time in the next weeks to look deeper into that, but I can't promise anything before returning from LPC. Btw, LPC would be a great opportunity to discuss that. Are you and the other name space wizards there by any chance?
Thanks,
tglx
Thomas Gleixner tglx@linutronix.de writes:
Andrei,
On Sat, 20 Oct 2018, Andrei Vagin wrote:
When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock.
Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently.
Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this series to make it ready to be merged?
I know that we need to:
- look at device drivers that report timestamps in CLOCK_MONOTONIC base.
and CLOCK_BOOTTIME and that's quite a few.
- forbid changing offsets after creating timers
There are more things to think about. What about interfaces which expose boot time or monotonic time in /proc?
Aside of that (I finally came around to look at the series in more detail) I'm really unhappy about the unconditional overhead once the Time namespace config switch is enabled. This applies especially to the VDSO. We spent quite some time recently to squeeze a few cycles out of those functions and it would be a pity to pointlessly waste cycles for the !namespace case.
I can see the urge for this, but please let us think it through properly before rushing anything in which we are going to regret once we want to do more sophisticated time domain management, e.g. support for isolated clock real time. I'm worried, that without a clear plan about the overall picture, we end up with duct tape which is hard to distangle after the fact.
There have been a few other things brought up versus time management in general, like the TSN folks utilizing grand clock masters which expose random time instead of proper TAI. Plus some requirements for exposing some sort of 'monotonic' clocks which are derived from external synchronization mechanisms, but should not affect the regular time keeping clocks.
While different issues, these all fall into the category of separate time domains, so taking a step back to the drawing board is probably the best thing what we can do now.
There are certainly a few things which can be looked at independently, e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole kernel with these name space functions applying offsets left and right. I rather have dedicated core functionality which replaces/amends existing timer functions to become time namespace aware.
I'll try to find some time in the next weeks to look deeper into that, but I can't promise anything before returning from LPC. Btw, LPC would be a great opportunity to discuss that. Are you and the other name space wizards there by any chance?
I will be and there are going to be both container and CRIU mini-conferences. So there should at least some of us around.
Eric
Eric,
On Mon, 29 Oct 2018, Eric W. Biederman wrote:
Thomas Gleixner tglx@linutronix.de writes:
I'll try to find some time in the next weeks to look deeper into that, but I can't promise anything before returning from LPC. Btw, LPC would be a great opportunity to discuss that. Are you and the other name space wizards there by any chance?
I will be and there are going to be both container and CRIU mini-conferences. So there should at least some of us around.
So let's try to find a slot for a BOF or similar (there might be still slots for the kernel summit available, i'll ask).
Thanks,
tglx
On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote:
Andrei,
On Sat, 20 Oct 2018, Andrei Vagin wrote:
When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock.
Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently.
Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this series to make it ready to be merged?
I know that we need to:
- look at device drivers that report timestamps in CLOCK_MONOTONIC base.
and CLOCK_BOOTTIME and that's quite a few.
- forbid changing offsets after creating timers
There are more things to think about. What about interfaces which expose boot time or monotonic time in /proc?
We didn't find any proc files where boot or monotonic time is reported, but we will double check this.
Aside of that (I finally came around to look at the series in more detail) I'm really unhappy about the unconditional overhead once the Time namespace config switch is enabled. This applies especially to the VDSO. We spent quite some time recently to squeeze a few cycles out of those functions and it would be a pity to pointlessly waste cycles for the !namespace case.
It is a good point. We will work on it.
I can see the urge for this, but please let us think it through properly before rushing anything in which we are going to regret once we want to do more sophisticated time domain management, e.g. support for isolated clock real time. I'm worried, that without a clear plan about the overall picture, we end up with duct tape which is hard to distangle after the fact.
Thomas, there is no rush at all. This functionality is critical for CRUI, but we have enough time to solve it properly.
The only thing what I want is that this functionality continues moving forward and will not be put in the back burner.
There have been a few other things brought up versus time management in general, like the TSN folks utilizing grand clock masters which expose random time instead of proper TAI. Plus some requirements for exposing some sort of 'monotonic' clocks which are derived from external synchronization mechanisms, but should not affect the regular time keeping clocks.
While different issues, these all fall into the category of separate time domains, so taking a step back to the drawing board is probably the best thing what we can do now.
There are certainly a few things which can be looked at independently, e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole kernel with these name space functions applying offsets left and right. I rather have dedicated core functionality which replaces/amends existing timer functions to become time namespace aware.
I'll try to find some time in the next weeks to look deeper into that, but I can't promise anything before returning from LPC. Btw, LPC would be a great opportunity to discuss that. Are you and the other name space wizards there by any chance?
Dmitry and I are going to be there.
Thanks! Andrei
Thanks,
tglx
linux-kselftest-mirror@lists.linaro.org