current_time is one of the few callers of current_kernel_time64(), which is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter directly for consistency with the rest of the kernel that is moving to the ktime_get_ family of time accessors.
An open questions is whether we may want to actually call the more accurate ktime_get_real_ts64() for file systems that save high-resolution timestamps in their on-disk format. This would add a small but measurable overhead to each update of the inode stamps but lead to inode timestamps to actually have a usable resolution better than one jiffy (1 to 10 milliseconds normally).
I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". This adds the original call to current_kernel_time and the truncation to the resolution of the file system, but makes no mention of the intended accuracy. At the time, we had a do_gettimeofday() interface that on some architectures could return a microsecond-resolution timestamp, but there was no interface for getting an accurate timestamp in nanosecond resolution, neither inside the kernel nor from user space. This makes me suspect that the use of coarse timestamps was never really a conscious decision but instead a result of whatever API was available 16 years ago.
Signed-off-by: Arnd Bergmann arnd@arndb.de --- fs/inode.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c index 2c300e981796..e27bd9334939 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc); */ struct timespec64 current_time(struct inode *inode) { - struct timespec64 now = current_kernel_time64(); + struct timespec64 now; + + ktime_get_coarse_real_ts64(&now);
if (unlikely(!inode->i_sb)) { WARN(1, "current_time() called with uninitialized super_block in the inode");
Arnd Bergmann arnd@arndb.de writes:
I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". This adds the original call to current_kernel_time and the truncation to the resolution of the file system, but makes no mention of the intended accuracy. At the time, we had a do_gettimeofday() interface that on some architectures could return a microsecond-resolution timestamp, but there was no interface for getting an accurate timestamp in nanosecond resolution, neither inside the kernel nor from user space. This makes me suspect that the use of coarse timestamps was never really a conscious decision but instead a result of whatever API was available 16 years ago.
Kind of. VFS/system calls are expensive enough that you need multiple us in and out so us resolution was considered good enough.
Also if you do this change you really need to do some benchmarks, especially on setups without lazy atime. This might potentially cause a lot more inode flushes.
-Andi
On Wed, Jun 20, 2018 at 5:40 PM, Andi Kleen ak@linux.intel.com wrote:
Arnd Bergmann arnd@arndb.de writes:
I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". This adds the original call to current_kernel_time and the truncation to the resolution of the file system, but makes no mention of the intended accuracy. At the time, we had a do_gettimeofday() interface that on some architectures could return a microsecond-resolution timestamp, but there was no interface for getting an accurate timestamp in nanosecond resolution, neither inside the kernel nor from user space. This makes me suspect that the use of coarse timestamps was never really a conscious decision but instead a result of whatever API was available 16 years ago.
Kind of. VFS/system calls are expensive enough that you need multiple us in and out so us resolution was considered good enough.
To clarify: current_kernel_time() uses at most millisecond resolution rather than microsecond, as tkr_mono.xtime_nsec only gets updated during the timer tick.
Has that time scale changed over the past 16 years as CPUs got faster (and system call entry times slower down again with recent changes)?
I tried a simple test on the shell, in tmpfs here and saw:
$ for i in `seq -w 100000` ; do > $i ; done $ stat * | less | grep Modify | uniq -c | head 601 Modify: 2018-06-20 18:04:48.794314629 +0200 920 Modify: 2018-06-20 18:04:48.798314691 +0200 936 Modify: 2018-06-20 18:04:48.802314753 +0200 937 Modify: 2018-06-20 18:04:48.806314816 +0200 901 Modify: 2018-06-20 18:04:48.810314878 +0200 929 Modify: 2018-06-20 18:04:48.814314940 +0200 931 Modify: 2018-06-20 18:04:48.818315002 +0200 894 Modify: 2018-06-20 18:04:48.822315064 +0200 952 Modify: 2018-06-20 18:04:48.826315128 +0200 898 Modify: 2018-06-20 18:04:48.830315190 +0200
which indicates that the result of ktime_get_coarse_real_ts64() gets updated every four milliseconds here (matching the CONFIG_HZ_250 setting in my running kernel), and that we can create around 900 files during that time that each get the same timestamp (strace shows 10 system calls for each new file). Trying the same on btrfs, I get around 260 files per jiffy.
Also if you do this change you really need to do some benchmarks, especially on setups without lazy atime. This might potentially cause a lot more inode flushes.
Good point. On the other hand, there may be some reasons to do it even if there is a noticeable overhead, in cases where we actually want hires timestamps, so perhaps this could be a mount option.
Arnd
Arnd Bergmann arnd@arndb.de writes:
To clarify: current_kernel_time() uses at most millisecond resolution rather than microsecond, as tkr_mono.xtime_nsec only gets updated during the timer tick.
Ah you're right. I remember now: the motivation was to make sure there is basically no overhead. In some setups the full gtod can be rather slow, particularly if it falls back to some crappy timer.
I think it would be ok if it falls back to jiffies if TSC or a similar fast timer doesn't work. But the function you're using likely doesn't do that?
Has that time scale changed over the past 16 years as CPUs got faster (and system call entry times slower down again with recent changes)?
Maybe a bit, but not substantially.
-Andi
On Wed, Jun 20, 2018 at 6:19 PM, Andi Kleen ak@linux.intel.com wrote:
Arnd Bergmann arnd@arndb.de writes:
To clarify: current_kernel_time() uses at most millisecond resolution rather than microsecond, as tkr_mono.xtime_nsec only gets updated during the timer tick.
Ah you're right. I remember now: the motivation was to make sure there is basically no overhead. In some setups the full gtod can be rather slow, particularly if it falls back to some crappy timer.
I think it would be ok if it falls back to jiffies if TSC or a similar fast timer doesn't work. But the function you're using likely doesn't do that?
My patch as posted just uses ktime_get_coarse_real_ts64(), which doesn't ever access the hires clocksource, the change is just cosmetic so far.
The timekeeping and clocksource core code (maintainers added to Cc) doesn't yet export an API that we can use to determine whether the clocksource is "fast" or not, but I would expect that we can decide to add that if needed.
This is also something that definitely changed over the years since your patch was originally added. Back then, the x86 TSC probably wasn't reliable enough to depend on it but now I would guess that very few x86 machines in production use care. On embedded systems, we used to have all kinds of clocksource drivers with varying characteristics, but nowadays the embedded market is dominated by ARMv7VE (Cortex-A7/A15/A17) or ARMv8, which are required to have a fast clocksource (drivers/clocksource/arm_arch_timer.c), and a lot of the others have it too (risc-v, modern mips, all ppc32, most ARM Cortex-A9, ...). The traditional non-x86 architectures (s390, powerpc, sparc) that are still being used have of course had low-latency clocksource access for a much longer time.
This means, we're probably fine with a compile-time option that distros can choose to enable depending on what classes of hardware they are targetting, like
struct timespec64 current_time(struct inode *inode) { struct timespec64 now; u64 gran = inode->i_sb->s_time_gran;
if (IS_ENABLED(CONFIG_HIRES_INODE_TIMES) && gran <= NSEC_PER_JIFFY) ktime_get_real_ts64(&now); else ktime_get_coarse_real_ts64(&now);
return timespec64_trunc(now, gran); }
With that implementation, we could still let file systems choose to get coarse timestamps by tuning the granularity in the superblock s_time_gran, which would result in nice round tv_nsec values that represent the actual accuracy.
Obviously this still needs performance testing on various bits of real hardware, but I can imagine that the overhead is rather small on hardware from the past five years.
Arnd
On Wed, Jun 20, 2018 at 9:35 PM, Arnd Bergmann arnd@arndb.de wrote:
On Wed, Jun 20, 2018 at 6:19 PM, Andi Kleen ak@linux.intel.com wrote:
Arnd Bergmann arnd@arndb.de writes:
To clarify: current_kernel_time() uses at most millisecond resolution rather than microsecond, as tkr_mono.xtime_nsec only gets updated during the timer tick.
Ah you're right. I remember now: the motivation was to make sure there is basically no overhead. In some setups the full gtod can be rather slow, particularly if it falls back to some crappy timer.
This means, we're probably fine with a compile-time option that distros can choose to enable depending on what classes of hardware they are targetting, like
struct timespec64 current_time(struct inode *inode) { struct timespec64 now; u64 gran = inode->i_sb->s_time_gran;
if (IS_ENABLED(CONFIG_HIRES_INODE_TIMES) && gran <= NSEC_PER_JIFFY) ktime_get_real_ts64(&now); else ktime_get_coarse_real_ts64(&now); return timespec64_trunc(now, gran);
}
With that implementation, we could still let file systems choose to get coarse timestamps by tuning the granularity in the superblock s_time_gran, which would result in nice round tv_nsec values that represent the actual accuracy.
I've done some simple tests and found that on a variety of x86, arm32 and arm64 CPUs, it takes between 70 and 100 CPU cycles to read the TSC and add it to the coarse clock, e.g. on a 3.1GHz Ryzen, using the little test program below:
vdso hires: 37.18ns vdso coarse: 6.44ns sysc hires: 161.62ns sysc coarse: 133.87ns
On the same machine, it takes around 400ns (1240 cycles) to write one byte into a tmpfs file with pwrite(). Adding 5% to 10% overhead for accurate timestamps would definitely be noticed, so I guess we wouldn't enable that unconditionally, but could do it as an opt-in mount option if someone had a use case.
Arnd
--- /* measure times for high-resolution clocksource access from userspace */ #include <stdio.h> #include <time.h> #include <unistd.h> #include <stdbool.h> #include <sys/syscall.h>
static int do_clock_gettime(clockid_t clkid, struct timespec *tp, bool vdso) { if (vdso) return clock_gettime(clkid, tp);
return syscall(__NR_clock_gettime, clkid, tp); }
static int loop1sec(int clkid, bool vdso) { int i; struct timespec t, start;
do_clock_gettime(clkid, &start, vdso); i = 0; do { do_clock_gettime(clkid, &t, vdso); i++; } while (t.tv_sec == start.tv_sec || t.tv_nsec < start.tv_nsec);
return i; }
int main(void) { printf("vdso hires: %7.2fns\n", 1000000000.0 / loop1sec(CLOCK_REALTIME, true)); printf("vdso coarse: %7.2fns\n", 1000000000.0 / loop1sec(CLOCK_REALTIME_COARSE, true)); printf("sysc hires: %7.2fns\n", 1000000000.0 / loop1sec(CLOCK_REALTIME, false)); printf("sysc coarse: %7.2fns\n", 1000000000.0 / loop1sec(CLOCK_REALTIME_COARSE, false));
return 0; }
On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:
current_time is one of the few callers of current_kernel_time64(), which is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter directly for consistency with the rest of the kernel that is moving to the ktime_get_ family of time accessors.
An open questions is whether we may want to actually call the more accurate ktime_get_real_ts64() for file systems that save high-resolution timestamps in their on-disk format. This would add a small but measurable overhead to each update of the inode stamps but lead to inode timestamps to actually have a usable resolution better than one jiffy (1 to 10 milliseconds normally).
I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". This adds the original call to current_kernel_time and the truncation to the resolution of the file system, but makes no mention of the intended accuracy. At the time, we had a do_gettimeofday() interface that on some architectures could return a microsecond-resolution timestamp, but there was no interface for getting an accurate timestamp in nanosecond resolution, neither inside the kernel nor from user space. This makes me suspect that the use of coarse timestamps was never really a conscious decision but instead a result of whatever API was available 16 years ago.
Signed-off-by: Arnd Bergmann arnd@arndb.de
fs/inode.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c index 2c300e981796..e27bd9334939 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc); */ struct timespec64 current_time(struct inode *inode) {
- struct timespec64 now = current_kernel_time64();
- struct timespec64 now;
- ktime_get_coarse_real_ts64(&now);
Can I just say as a filesystem dev who has no idea at all about kernel timer implementations: this is an awful API change. There are hundreds of callers of current_time(), so I'm not going to be the only person looking at this function who has no clue about WTF "ktime_get_coarse_real" actually means or does. Further, this function is not documented, and jumps straight into internal time implementation stuff, so I'm lost straight away if somebody asks me "what does that function do"?. i.e. I have *no clue* what this function returns or why this code uses it.
i.e. the function goes from an obvious self documenting name that has been blessed as the current kernel timestamp to something that only people who work on the time subsystem understand and know when to use. It might make sense to you, but it sucks for everyone else....
Keep the wrapper, please. Change it to ktime_get_current(), if you really must change the function namespace...
Cheers,
Dave.
On Thu, Jun 21, 2018 at 10:23 PM, Dave Chinner david@fromorbit.com wrote:
On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:
diff --git a/fs/inode.c b/fs/inode.c index 2c300e981796..e27bd9334939 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc); */ struct timespec64 current_time(struct inode *inode) {
struct timespec64 now = current_kernel_time64();
struct timespec64 now;
ktime_get_coarse_real_ts64(&now);
Can I just say as a filesystem dev who has no idea at all about kernel timer implementations: this is an awful API change. There are hundreds of callers of current_time(), so I'm not going to be the only person looking at this function who has no clue about WTF "ktime_get_coarse_real" actually means or does. Further, this function is not documented, and jumps straight into internal time implementation stuff, so I'm lost straight away if somebody asks me "what does that function do"?. i.e. I have *no clue* what this function returns or why this code uses it.
You definitely have a point about the documentation. I meant to fix that as part of the recent rework of the timekeeping.h header but haven't finished it, partly because the header is still being changed as we get rid of the older interfaces.
i.e. the function goes from an obvious self documenting name that has been blessed as the current kernel timestamp to something that only people who work on the time subsystem understand and know when to use. It might make sense to you, but it sucks for everyone else....
Keep the wrapper, please. Change it to ktime_get_current(), if you really must change the function namespace...
The thing about current_kernel_time/current_kernel_time64/ ktime_get_coarse_real_ts64 is that it hasn't been a good choice as a default time accessor for a long time, pretty much everything outside of current_time() has a better option:
- For measuring time intervals, you want a monotonic time source, not a 'real' one, to prevent time from going backwards during leap seconds or settimeofday() adjustments. - For getting a timestamp as fast as possible, using a timespec64 based interface adds extra overhead compared to 'jiffies' or ktime_t (64-bit nanoseconds), especially when passing the struct as the return value. - As mentioned before, almost all machines these days have a fast clocksource device, so getting a 'coarse' timestamp is barely faster than an accurate one. - Having an interface with (at best) millisecond precsision but nanosecond resolution can be confusing, as it suggests a much more precision than we can give (the 'coarse' in the name is meant to clarify that).
While changing over all device drivers from the old-style interfaces with 32-bit time_t (current_kernel_time, get_seconds, do_getimeofday and getnstimeofday, getrawmonotonic, get_monotonic_coarse, get_monotonic_boottime), we tried to also address those problems, using ktime_get() as the preferred interface, or an interface with a longer name where we had specific reasons.
Outside of file system timestamps, there are only two other files left that ask for a timestamp that is both 'coarse' and 'realtime', and both only do that when user space specifically asks for that combination.
Arnd
On Fri, Jun 22, 2018 at 03:24:48PM +0200, Arnd Bergmann wrote:
On Thu, Jun 21, 2018 at 10:23 PM, Dave Chinner david@fromorbit.com wrote:
On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:
diff --git a/fs/inode.c b/fs/inode.c index 2c300e981796..e27bd9334939 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc); */ struct timespec64 current_time(struct inode *inode) {
struct timespec64 now = current_kernel_time64();
struct timespec64 now;
ktime_get_coarse_real_ts64(&now);
Can I just say as a filesystem dev who has no idea at all about kernel timer implementations: this is an awful API change. There are hundreds of callers of current_time(), so I'm not going to be the only person looking at this function who has no clue about WTF "ktime_get_coarse_real" actually means or does. Further, this function is not documented, and jumps straight into internal time implementation stuff, so I'm lost straight away if somebody asks me "what does that function do"?. i.e. I have *no clue* what this function returns or why this code uses it.
You definitely have a point about the documentation. I meant to fix that as part of the recent rework of the timekeeping.h header but haven't finished it, partly because the header is still being changed as we get rid of the older interfaces.
The interface documentation should be introduced with the new interfaces, not left for later as that leaves people like me with no fucking clue about what the changes actually mean or why they are being done. Perhaps you'd have done better to explain this API as an internal implementation of clock_gettime(CLOCK_REALTIME_COARSE), which is clearly documented in the man page as:
"Use when you need very fast, but not fine-grained timestamps."
Put that comment on ktime_get_coarse_real_ts64(), and almost all the questions about "WTF does this whacky function do?" go away....
Cheers,
Dave.
On Tue, Jun 26, 2018 at 2:24 AM, Dave Chinner david@fromorbit.com wrote:
On Fri, Jun 22, 2018 at 03:24:48PM +0200, Arnd Bergmann wrote:
On Thu, Jun 21, 2018 at 10:23 PM, Dave Chinner david@fromorbit.com wrote:
On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:
You definitely have a point about the documentation. I meant to fix that as part of the recent rework of the timekeeping.h header but haven't finished it, partly because the header is still being changed as we get rid of the older interfaces.
The interface documentation should be introduced with the new interfaces, not left for later as that leaves people like me with no fucking clue about what the changes actually mean or why they are being done. Perhaps you'd have done better to explain this API as an internal implementation of clock_gettime(CLOCK_REALTIME_COARSE), which is clearly documented in the man page as:
"Use when you need very fast, but not fine-grained timestamps."
Put that comment on ktime_get_coarse_real_ts64(), and almost all the questions about "WTF does this whacky function do?" go away....
I've tried to come up with a coherent documentation now and sent a patch for that. For this version, I ended up not documenting each of the ktime_get() functions separately but instead added a new file in the core-api documentation and a reference to that from the header. This seemed easier to understand than duplicating the same text for over 30 very similar interfaces.
Arnd