On Wed, Jun 20, 2018 at 5:40 PM, Andi Kleen ak@linux.intel.com wrote:
Arnd Bergmann arnd@arndb.de writes:
I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". This adds the original call to current_kernel_time and the truncation to the resolution of the file system, but makes no mention of the intended accuracy. At the time, we had a do_gettimeofday() interface that on some architectures could return a microsecond-resolution timestamp, but there was no interface for getting an accurate timestamp in nanosecond resolution, neither inside the kernel nor from user space. This makes me suspect that the use of coarse timestamps was never really a conscious decision but instead a result of whatever API was available 16 years ago.
Kind of. VFS/system calls are expensive enough that you need multiple us in and out so us resolution was considered good enough.
To clarify: current_kernel_time() uses at most millisecond resolution rather than microsecond, as tkr_mono.xtime_nsec only gets updated during the timer tick.
Has that time scale changed over the past 16 years as CPUs got faster (and system call entry times slower down again with recent changes)?
I tried a simple test on the shell, in tmpfs here and saw:
$ for i in `seq -w 100000` ; do > $i ; done $ stat * | less | grep Modify | uniq -c | head 601 Modify: 2018-06-20 18:04:48.794314629 +0200 920 Modify: 2018-06-20 18:04:48.798314691 +0200 936 Modify: 2018-06-20 18:04:48.802314753 +0200 937 Modify: 2018-06-20 18:04:48.806314816 +0200 901 Modify: 2018-06-20 18:04:48.810314878 +0200 929 Modify: 2018-06-20 18:04:48.814314940 +0200 931 Modify: 2018-06-20 18:04:48.818315002 +0200 894 Modify: 2018-06-20 18:04:48.822315064 +0200 952 Modify: 2018-06-20 18:04:48.826315128 +0200 898 Modify: 2018-06-20 18:04:48.830315190 +0200
which indicates that the result of ktime_get_coarse_real_ts64() gets updated every four milliseconds here (matching the CONFIG_HZ_250 setting in my running kernel), and that we can create around 900 files during that time that each get the same timestamp (strace shows 10 system calls for each new file). Trying the same on btrfs, I get around 260 files per jiffy.
Also if you do this change you really need to do some benchmarks, especially on setups without lazy atime. This might potentially cause a lot more inode flushes.
Good point. On the other hand, there may be some reasons to do it even if there is a noticeable overhead, in cases where we actually want hires timestamps, so perhaps this could be a mount option.
Arnd