On Sun, Jul 05, 2020 at 01:44:54PM +0200, Greg KH wrote:
On Sun, Jul 05, 2020 at 01:07:14AM -0700, Vito Caputo wrote:
On Sun, Jul 05, 2020 at 04:27:32AM +0100, Matthew Wilcox wrote:
On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox willy@infradead.org wrote:
You should probably take a look at io_uring. That has the level of complexity of this proposal and supports open/read/close along with many other opcodes.
Then glibc can implement readfile using io_uring and there is no need for a new single-file readfile syscall.
It could, sure. But there's also a value in having a simple interface to accomplish a simple task. Your proposed API added a very complex interface to satisfy needs that clearly aren't part of the problem space that Greg is looking to address.
I disagree re: "aren't part of the problem space".
Reading small files from procfs was specifically called out in the rationale for the syscall.
In my experience you're rarely monitoring a single proc file in any situation where you care about the syscall overhead. You're monitoring many of them, and any serious effort to do this efficiently in a repeatedly sampled situation has cached the open fds and already uses pread() to simply restart from 0 on every sample and not repeatedly pay for the name lookup.
That's your use case, but many other use cases are just "read a bunch of sysfs files in one shot". Examples of that are tools that monitor uevents and lots of hardware-information gathering tools.
Also not all tools sem to be as smart as you think they are, look at util-linux for loads of the "open/read/close" lots of files pattern. I had a half-baked patch to convert it to use readfile which I need to polish off and post with the next series to show how this can be used to both make userspace simpler as well as use less cpu time.
Basically anything optimally using the existing interfaces for sampling proc files needs a way to read multiple open file descriptors in a single syscall to move the needle.
Is psutils using this type of interface, or do they constantly open different files?
When I last checked, psutils was not an optimal example, nor did I suggest it was.
What about fun tools like bashtop: https://github.com/aristocratos/bashtop.git which thankfully now relies on python's psutil package to parse proc in semi-sane ways, but that package does loads of constant open/read/close of proc files all the time from what I can tell.
And lots of people rely on python's psutil, right?
If python's psutil is constantly reopening the same files in /proc, this is an argument to go improve python's psutil, especially if it's popular.
Your proposed syscall doesn't magically make everything suboptimally sampling proc more efficient. It still requires going out and modifying everything to use the new syscall.
In order to actually realize a gain comparable to what can be done using existing interfaces, but with your new syscall, if the code wasn't already reusing the open fd it still requires a refactor to do so with your syscall, to eliminate the directory lookup on every sample.
At the end of the day, if you did all this work, you'd have code that only works on kernels with the new syscall, didn't enjoy a significant performance gain over what could have been achieved using the existing interfaces, and still required basically the same amount of work as optimizing for the existing interfaces would have. For what gain?
This syscall doesn't provide that. It doesn't really give any advantage over what we can achieve already. It seems basically pointless to me, from a monitoring proc files perspective.
What "good" monitoring programs do you suggest follow the pattern you recommend?
"Good" is not generally a word I'd use to describe software, surely that's not me you're quoting... but I assume you mean "optimal".
I'm sure sysprof is at least reusing open files when sampling proc, because we discussed the issue when Christian took over maintenance.
It appears he's currently using the lseek()->read() sequence:
https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-... https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-... https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-...
It'd be more efficient to just use pread() and lose the lseek(), at which point it'd be just a single pread() call per sample per proc file. Nothing your proposed syscall would improve upon, not that it'd be eligible for software that wants to work on existing kernels from distros like Debian and Centos/RHEL anyways.
If this were a conversation about providing something like a better scatter-gather interface akin to p{read,write}v but with the fd in the iovec, then we'd be talking about something very lucrative for proc sampling. But like you've said elsewhere in this thread, io_uring() may suffice as an alternative solution in that vein.
My personal interest in this topic stems from an experimental window manager I made, and still use, which monitors every descendant process for the X session at frequencies up to 60HZ. The code opens a bunch of proc files for every process, and keeps them open until the process goes away or falls out of scope. See the attachment for some idea of what /proc/$(pidof wm)/fd looks like. All those proc files are read at up to 60HZ continuously.
All top-like tools are really no different, and already shouldn't be reopening things on every sample. They should be fixed if not - with or without your syscall, it's equal effort, but the existing interfaces... exist.
Regards, Vito Caputo