From userspace, spawning a new process with, for example, posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior kept since Linux 2.6 or earlier.
Linux translates the usage of the sched_param struct into it's own internal sched_attr struct during the syscall, but the user has no way to manage the other values within the sched_attr struct using only POSIX functions.
The only other way to adjust niceness while using posix_spawn() would be to set the value after the process has started, but this introduces the risk of the process being dead before the next syscall can set the priority after the fact.
To resolve this, allow the use of the priority value originally from the POSIX sched_param struct in order to set the niceness value instead of rejecting the priority value.
Edit the sched_get_priority_*() POSIX syscalls in order to reflect the range of values accepted.
Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c Signed-off-by: Michael Pratt mcpratt@pm.me --- kernel/sched/syscalls.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index ae1b42775ef9..52c02b80f037 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -853,6 +853,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy, attr.sched_policy = policy; }
+ if (attr.sched_priority > MAX_PRIO-1) + return -EINVAL; + + /* + * If priority is set for SCHED_NORMAL or SCHED_BATCH, + * set the niceness instead, but only for user calls. + */ + if (check && attr.sched_priority > MAX_RT_PRIO-1 && + ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) { + attr.sched_nice = PRIO_TO_NICE(attr.sched_priority); + attr.sched_priority = 0; + } + return __sched_setscheduler(p, &attr, check, true); } /** @@ -1598,9 +1611,11 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy) case SCHED_RR: ret = MAX_RT_PRIO-1; break; - case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH: + ret = MAX_PRIO-1; + break; + case SCHED_DEADLINE: case SCHED_IDLE: ret = 0; break; @@ -1625,9 +1640,11 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy) case SCHED_RR: ret = 1; break; - case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH: + ret = MAX_RT_PRIO; + break; + case SCHED_DEADLINE: case SCHED_IDLE: ret = 0; }
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
On Mon, Sep 16, 2024 at 05:08:49AM +0000, Michael Pratt wrote:
From userspace, spawning a new process with, for example, posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior kept since Linux 2.6 or earlier.
Right, and the current behaviour is entirely in-line with the POSIX specs.
I realize this might be a pain, but why should be change this spec conforming and very long standing behaviour?
Worse, you're proposing a nice ABI that is entirely different from the normal [-20,19] range.
Why do you feel this is the best way forward? Would not adding POSIX_SPAWN_SETSCHEDATTR be a more future proof mechanism?
Hi Peter,
On Monday, September 16th, 2024 at 07:13, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Sep 16, 2024 at 05:08:49AM +0000, Michael Pratt wrote:
From userspace, spawning a new process with, for example, posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior kept since Linux 2.6 or earlier.
Right, and the current behaviour is entirely in-line with the POSIX specs.
I'm just mentioning this for context. In this case, "in-line with POSIX specs" has nothing to do with whether or not the feature works. POSIX says nothing about which policies should be accepting which values or not and how they are processed. Like many things, it is simply implementation-specific.
The current behavior is that it doesn't work, and I would like it to work.
I realize this might be a pain, but why should we change this spec conforming and very long standing behavior?
The fact that the overall behavior is "very long standing" is a coincidence. The code here conforms to the specs both before and after the patch, and the difference is functionality.
In fact, I am not aiming to change the exact behavior of "reject every priority value other than 0" but rather work around that by translating it to niceness so long as it is a valid range passed as the priority by the user. This method is not just to maintain that priority must be 0, but I found it necessary, because if the syscall were allowed to change the static priority, then a future change in the "niceness" value would theoretically allow the priority to pass into the RT range for non-RT policies.
Worse, you're proposing a nice ABI that is entirely different from the normal [-20,19] range.
Please take a closer look... The resulting niceness value is exactly that range. PRIO_TO_NICE([MAX_RT_PRIO,MAX_PRIO-1]) = [-20,19]
I am not writing this so that the value passed as a "priority" value should be assumed to be the "niceness" value instead by the user, but rather that the user should pass a value for "priority" that will actually result in that value, but with the "niceness" adjusted instead, as that is the user-specific method to effectively do the same thing.
The "niceness" value has no meaning in the world of POSIX, it only means something in the world of Linux, and just like the translation from sched_param to sched_attr structs, this is the place where we would translate priority to niceness. Everything outside the internals of the kernel should be understood as the "actual" priority, because POSIX is a userspace that doesn't acknowledge or understand the kernel's ABIs, not the other way around.
Otherwise, we have a confusing conflation between the meaning of the two values, where a value of 19 makes sense for niceness, but is obviously invalid for priority for SCHED_NORMAL, and a negative value makes sense for niceness, but is obviously invalid for priority in any policy.
Implementations of posix_spawn functions ask for the "priority", and POSIX states that the value passed in with the sched_param struct should be the "priority" and that the usage is implementation-specific, not the other way around, where the meaning of the value would be implementation-specific, but the usage of the value would be clearly defined instead. I'm trying to stay in-line with the semantics as well.
Why do you feel this is the best way forward? Would not adding POSIX_SPAWN_SETSCHEDATTR be a more future proof mechanism?
New flags don't change the fact that the value will be rejected in the kernel, unless I am misunderstanding what you mean...
I believe this is the simplest and the smallest possible change that is conforming both to POSIX and the kernel's styling in order to make posix_spawnattr_setschedparam() work instead of _just_ being "conforming and compliant", which like I said is a low requirement of "just reject all values".
Flags like POSIX_SPAWN_SETSCHEDATTR would be used at the library level and we have no problems at the library level, except for Linux-only libraries that have not implemented posix_spawnattr_setschedparam() because it currently fails. Notably, the musl C library is an example of this, but that might change if we finally add support for this.
It would be nice if POSIX would add a flag to specifically cater to linux, however, that would likely require them to add the sched_attr struct definition or replace the sched_param struct, and as we know things usually work the other way around.
Thanks for your time.
-- MCP
one more detail I forgot: it actually would not be compliant for niceness as the input...
On Monday, September 16th, 2024 at 15:23, Michael Pratt mcpratt@pm.me wrote:
On Monday, September 16th, 2024 at 07:13, Peter Zijlstra peterz@infradead.org wrote:
Worse, you're proposing a nice ABI that is entirely different from the normal [-20,19] range.
... ... ...
Otherwise, we have a confusing conflation between the meaning of the two values, where a value of 19 makes sense for niceness, but is obviously invalid for priority for SCHED_NORMAL, and a negative value makes sense for niceness, but is obviously invalid for priority in any policy.
POSIX doesn't allow a negative value for the ABI at all:
If successful, the sched_get_priority_max() and sched_get_priority_min() functions return the appropriate maximum or minimum values, respectively. If unsuccessful, they return a value of -1 and set errno to indicate the error.
-- MCP
Hi Peter,
Have I addressed your concerns?
In my opinion, this is a simple change that adds ABI functionality, without affecting any other parts of the syscalls involved, whether breaking or non-breaking.
Like I said in previous replies, POSIX doesn't allow passing a negative number as a successful value, and setting the static priority directly can expose the scheduler to invalid priority values for CFS policies in later syscalls.
As far as I can see, this is the only way for usage of posix_spawnattr_setschedparam() to be both compliant _and_ functional for SCHED_NORMAL and SCHED_BATCH.
-- Thanks for your time, MCP
Hi everyone, again,
I'm still waiting for a serious review of this very short patch after about a month. I'll be getting ready to send this a 3rd time, and to more people, if I don't hear back again.
I believe that I have correctly addressed Peter's concerns in my replies, and that my patch is still appropriate for inclusion into Linux as it is now.
If I'm wrong please let me know why instead of just silence.
-- MCP
linux-stable-mirror@lists.linaro.org