Hi Peter,
On Monday, September 16th, 2024 at 07:13, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Sep 16, 2024 at 05:08:49AM +0000, Michael Pratt wrote:
From userspace, spawning a new process with, for example, posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior kept since Linux 2.6 or earlier.
Right, and the current behaviour is entirely in-line with the POSIX specs.
I'm just mentioning this for context. In this case, "in-line with POSIX specs" has nothing to do with whether or not the feature works. POSIX says nothing about which policies should be accepting which values or not and how they are processed. Like many things, it is simply implementation-specific.
The current behavior is that it doesn't work, and I would like it to work.
I realize this might be a pain, but why should we change this spec conforming and very long standing behavior?
The fact that the overall behavior is "very long standing" is a coincidence. The code here conforms to the specs both before and after the patch, and the difference is functionality.
In fact, I am not aiming to change the exact behavior of "reject every priority value other than 0" but rather work around that by translating it to niceness so long as it is a valid range passed as the priority by the user. This method is not just to maintain that priority must be 0, but I found it necessary, because if the syscall were allowed to change the static priority, then a future change in the "niceness" value would theoretically allow the priority to pass into the RT range for non-RT policies.
Worse, you're proposing a nice ABI that is entirely different from the normal [-20,19] range.
Please take a closer look... The resulting niceness value is exactly that range. PRIO_TO_NICE([MAX_RT_PRIO,MAX_PRIO-1]) = [-20,19]
I am not writing this so that the value passed as a "priority" value should be assumed to be the "niceness" value instead by the user, but rather that the user should pass a value for "priority" that will actually result in that value, but with the "niceness" adjusted instead, as that is the user-specific method to effectively do the same thing.
The "niceness" value has no meaning in the world of POSIX, it only means something in the world of Linux, and just like the translation from sched_param to sched_attr structs, this is the place where we would translate priority to niceness. Everything outside the internals of the kernel should be understood as the "actual" priority, because POSIX is a userspace that doesn't acknowledge or understand the kernel's ABIs, not the other way around.
Otherwise, we have a confusing conflation between the meaning of the two values, where a value of 19 makes sense for niceness, but is obviously invalid for priority for SCHED_NORMAL, and a negative value makes sense for niceness, but is obviously invalid for priority in any policy.
Implementations of posix_spawn functions ask for the "priority", and POSIX states that the value passed in with the sched_param struct should be the "priority" and that the usage is implementation-specific, not the other way around, where the meaning of the value would be implementation-specific, but the usage of the value would be clearly defined instead. I'm trying to stay in-line with the semantics as well.
Why do you feel this is the best way forward? Would not adding POSIX_SPAWN_SETSCHEDATTR be a more future proof mechanism?
New flags don't change the fact that the value will be rejected in the kernel, unless I am misunderstanding what you mean...
I believe this is the simplest and the smallest possible change that is conforming both to POSIX and the kernel's styling in order to make posix_spawnattr_setschedparam() work instead of _just_ being "conforming and compliant", which like I said is a low requirement of "just reject all values".
Flags like POSIX_SPAWN_SETSCHEDATTR would be used at the library level and we have no problems at the library level, except for Linux-only libraries that have not implemented posix_spawnattr_setschedparam() because it currently fails. Notably, the musl C library is an example of this, but that might change if we finally add support for this.
It would be nice if POSIX would add a flag to specifically cater to linux, however, that would likely require them to add the sched_attr struct definition or replace the sched_param struct, and as we know things usually work the other way around.
Thanks for your time.
-- MCP