On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs.
I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand.
In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) { - v |= FDPUT_POS_UNLOCK; - mutex_lock(&file->f_pos_lock); + if (file_count(file) > 1 || fdget_pos_mutex) { + v |= FDPUT_POS_UNLOCK; + mutex_lock(&file->f_pos_lock); + } } return v; }
I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and forth while the bench was running consistently gave aforementioned ~5% difference.
perf top claims: [snip] 32.64% [kernel] [k] copyout 10.83% [kernel] [k] entry_SYSCALL_64 6.69% libc.so.6 [.] read 6.16% [kernel] [k] filemap_get_read_batch 5.62% [kernel] [k] filemap_read 3.39% [kernel] [k] __fget_light 2.92% [kernel] [k] mutex_lock <-- oh 2.70% [kernel] [k] mutex_unlock <-- no 2.33% [kernel] [k] vfs_read 2.18% [kernel] [k] _copy_to_iter 1.93% [kernel] [k] atime_needs_update 1.74% [kernel] [k] __fsnotify_parent 1.29% read1_processes [.] testcase [/snip]
[note running perf along with the bench changes throughput to some extent]
So yes, atomics remain expensive on x86-64 even on a very moden uarch and their impact is measurable in a syscall like read.
Consequently eliding this mutex trip would be most welcome.
Happy to rant,