On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote:
On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs.
I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand.
In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
if (file_count(file) > 1 || fdget_pos_mutex) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
} } return v;
}
I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and
We've had the whole lkp intel performance testsuite run on this for a long time to see whether there any performance regressions that aren't in the noise. This includes all of will-it-scale. Also with a focus on single-thread performance. So I'm a little skeptical about the reliability of manual performance runs in comparison.
If Linus thinks it matters then I think we should only take the f_pos_lock unconditionally on directory fds as this is where it really matters. For read/write we're only doing it because of posix anyway and using pidfd_getfd() is squarely out of posix anyway. So we can document that using pidfd_getfd() on a regular file descriptor outside of a seccomp stopped task is not recommended.