From userspace, spawning a new process with, for example,
posix_spawn(), only allows the user to work with
the scheduling priority value defined by POSIX
in the sched_param struct.
However, sched_setparam() and similar syscalls lead to
__sched_setscheduler() which rejects any new value
for the priority other than 0 for non-RT schedule classes,
a behavior kept since Linux 2.6 or earlier.
Linux translates the usage of the sched_param struct
into it's own internal sched_attr struct during the syscall,
but the user has no way to manage the other values
within the sched_attr struct using only POSIX functions.
The only other way to adjust niceness while using posix_spawn()
would be to set the value after the process has started,
but this introduces the risk of the process being dead
before the next syscall can set the priority after the fact.
To resolve this, allow the use of the priority value
originally from the POSIX sched_param struct in order to
set the niceness value instead of rejecting the priority value.
Edit the sched_get_priority_*() POSIX syscalls
in order to reflect the range of values accepted.
Cc: stable(a)vger.kernel.org # Apply to kernel/sched/core.c
Signed-off-by: Michael Pratt <mcpratt(a)pm.me>
---
kernel/sched/syscalls.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ae1b42775ef9..52c02b80f037 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -853,6 +853,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
attr.sched_policy = policy;
}
+ if (attr.sched_priority > MAX_PRIO-1)
+ return -EINVAL;
+
+ /*
+ * If priority is set for SCHED_NORMAL or SCHED_BATCH,
+ * set the niceness instead, but only for user calls.
+ */
+ if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
+ ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
+ attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
+ attr.sched_priority = 0;
+ }
+
return __sched_setscheduler(p, &attr, check, true);
}
/**
@@ -1598,9 +1611,11 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
case SCHED_RR:
ret = MAX_RT_PRIO-1;
break;
- case SCHED_DEADLINE:
case SCHED_NORMAL:
case SCHED_BATCH:
+ ret = MAX_PRIO-1;
+ break;
+ case SCHED_DEADLINE:
case SCHED_IDLE:
ret = 0;
break;
@@ -1625,9 +1640,11 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
case SCHED_RR:
ret = 1;
break;
- case SCHED_DEADLINE:
case SCHED_NORMAL:
case SCHED_BATCH:
+ ret = MAX_RT_PRIO;
+ break;
+ case SCHED_DEADLINE:
case SCHED_IDLE:
ret = 0;
}
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
--
2.30.2
Robert Gill reported below #GP when dosemu software was executing vm86()
system call:
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 4 PID: 4610 Comm: dosemu.bin Not tainted 6.6.21-gentoo-x86 #1
Hardware name: Dell Inc. PowerEdge 1950/0H723K, BIOS 2.7.0 10/30/2010
EIP: restore_all_switch_stack+0xbe/0xcf
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: 00000000 ESP: ff8affdc
DS: 0000 ES: 0000 FS: 0000 GS: 0033 SS: 0068 EFLAGS: 00010046
CR0: 80050033 CR2: 00c2101c CR3: 04b6d000 CR4: 000406d0
Call Trace:
show_regs+0x70/0x78
die_addr+0x29/0x70
exc_general_protection+0x13c/0x348
exc_bounds+0x98/0x98
handle_exception+0x14d/0x14d
exc_bounds+0x98/0x98
restore_all_switch_stack+0xbe/0xcf
exc_bounds+0x98/0x98
restore_all_switch_stack+0xbe/0xcf
This only happens when VERW based mitigations like MDS/RFDS are enabled.
This is because segment registers with an arbitrary user value can result
in #GP when executing VERW. Intel SDM vol. 2C documents the following
behavior for VERW instruction:
#GP(0) - If a memory operand effective address is outside the CS, DS, ES,
FS, or GS segment limit.
CLEAR_CPU_BUFFERS macro executes VERW instruction before returning to user
space. Replace CLEAR_CPU_BUFFERS with a safer version that uses %ss to
refer VERW operand mds_verw_sel. This ensures VERW will not #GP for an
arbitrary user %ds. Also, in NMI return path, move VERW to after
RESTORE_ALL_NMI that touches GPRs.
For clarity, below are the locations where the new CLEAR_CPU_BUFFERS_SAFE
version is being used:
* entry_INT80_32(), entry_SYSENTER_32() and interrupts (via
handle_exception_return) do:
restore_all_switch_stack:
[...]
mov %esi,%esi
verw %ss:0xc0fc92c0 <-------------
iret
* Opportunistic SYSEXIT:
[...]
verw %ss:0xc0fc92c0 <-------------
btrl $0x9,(%esp)
popf
pop %eax
sti
sysexit
* nmi_return and nmi_from_espfix:
mov %esi,%esi
verw %ss:0xc0fc92c0 <-------------
jmp .Lirq_return
Fixes: a0e2dab44d22 ("x86/entry_32: Add VERW just before userspace transition")
Cc: stable(a)vger.kernel.org # 5.10+
Reported-by: Robert Gill <rtgill82(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218707
Closes: https://lore.kernel.org/all/8c77ccfd-d561-45a1-8ed5-6b75212c7a58@leemhuis.i…
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Suggested-by: Brian Gerst <brgerst(a)gmail.com> # Use %ss
Signed-off-by: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
---
v5:
- Simplify the use of ALTERNATIVE construct (Uros/Jiri/Peter).
v4: https://lore.kernel.org/r/20240710-fix-dosemu-vm86-v4-1-aa6464e1de6f@linux.…
- Further simplify the patch by using %ss for all VERW calls in 32-bit mode (Brian).
- In NMI exit path move VERW after RESTORE_ALL_NMI that touches GPRs (Dave).
v3: https://lore.kernel.org/r/20240701-fix-dosemu-vm86-v3-1-b1969532c75a@linux.…
- Simplify CLEAR_CPU_BUFFERS_SAFE by using %ss instead of %ds (Brian).
- Do verw before popf in SYSEXIT path (Jari).
v2: https://lore.kernel.org/r/20240627-fix-dosemu-vm86-v2-1-d5579f698e77@linux.…
- Safe guard against any other system calls like vm86() that might change %ds (Dave).
v1: https://lore.kernel.org/r/20240426-fix-dosemu-vm86-v1-1-88c826a3f378@linux.…
---
---
arch/x86/entry/entry_32.S | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d3a814efbff6..25c942149fb5 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -253,6 +253,14 @@
.Lend_\@:
.endm
+/*
+ * Safer version of CLEAR_CPU_BUFFERS that uses %ss to reference VERW operand
+ * mds_verw_sel. This ensures VERW will not #GP for an arbitrary user %ds.
+ */
+.macro CLEAR_CPU_BUFFERS_SAFE
+ ALTERNATIVE "", __stringify(verw %ss:_ASM_RIP(mds_verw_sel)), X86_FEATURE_CLEAR_CPU_BUF
+.endm
+
.macro RESTORE_INT_REGS
popl %ebx
popl %ecx
@@ -871,6 +879,8 @@ SYM_FUNC_START(entry_SYSENTER_32)
/* Now ready to switch the cr3 */
SWITCH_TO_USER_CR3 scratch_reg=%eax
+ /* Clobbers ZF */
+ CLEAR_CPU_BUFFERS_SAFE
/*
* Restore all flags except IF. (We restore IF separately because
@@ -881,7 +891,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
BUG_IF_WRONG_CR3 no_user_check=1
popfl
popl %eax
- CLEAR_CPU_BUFFERS
/*
* Return back to the vDSO, which will pop ecx and edx.
@@ -951,7 +960,7 @@ restore_all_switch_stack:
/* Restore user state */
RESTORE_REGS pop=4 # skip orig_eax/error_code
- CLEAR_CPU_BUFFERS
+ CLEAR_CPU_BUFFERS_SAFE
.Lirq_return:
/*
* ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
@@ -1144,7 +1153,6 @@ SYM_CODE_START(asm_exc_nmi)
/* Not on SYSENTER stack. */
call exc_nmi
- CLEAR_CPU_BUFFERS
jmp .Lnmi_return
.Lnmi_from_sysenter_stack:
@@ -1165,6 +1173,7 @@ SYM_CODE_START(asm_exc_nmi)
CHECK_AND_APPLY_ESPFIX
RESTORE_ALL_NMI cr3_reg=%edi pop=4
+ CLEAR_CPU_BUFFERS_SAFE
jmp .Lirq_return
#ifdef CONFIG_X86_ESPFIX32
@@ -1206,6 +1215,7 @@ SYM_CODE_START(asm_exc_nmi)
* 1 - orig_ax
*/
lss (1+5+6)*4(%esp), %esp # back to espfix stack
+ CLEAR_CPU_BUFFERS_SAFE
jmp .Lirq_return
#endif
SYM_CODE_END(asm_exc_nmi)
---
base-commit: f2661062f16b2de5d7b6a5c42a9a5c96326b8454
change-id: 20240426-fix-dosemu-vm86-dd111a01737e
From: Simon Arlott <simon(a)octiron.net>
The mcp251x_hw_wake() function is called with the mpc_lock mutex held and
disables the interrupt handler so that no interrupts can be processed while
waking the device. If an interrupt has already occurred then waiting for
the interrupt handler to complete will deadlock because it will be trying
to acquire the same mutex.
CPU0 CPU1
---- ----
mcp251x_open()
mutex_lock(&priv->mcp_lock)
request_threaded_irq()
<interrupt>
mcp251x_can_ist()
mutex_lock(&priv->mcp_lock)
mcp251x_hw_wake()
disable_irq() <-- deadlock
Use disable_irq_nosync() instead because the interrupt handler does
everything while holding the mutex so it doesn't matter if it's still
running.
Fixes: 8ce8c0abcba3 ("can: mcp251x: only reset hardware as required")
Signed-off-by: Simon Arlott <simon(a)octiron.net>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/4fc08687-1d80-43fe-9f0d-8ef8475e75f6@0882a8b5-c…
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
drivers/net/can/spi/mcp251x.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/can/spi/mcp251x.c b/drivers/net/can/spi/mcp251x.c
index 3b8736ff0345..ec5c64006a16 100644
--- a/drivers/net/can/spi/mcp251x.c
+++ b/drivers/net/can/spi/mcp251x.c
@@ -752,7 +752,7 @@ static int mcp251x_hw_wake(struct spi_device *spi)
int ret;
/* Force wakeup interrupt to wake device, but don't execute IST */
- disable_irq(spi->irq);
+ disable_irq_nosync(spi->irq);
mcp251x_write_2regs(spi, CANINTE, CANINTE_WAKIE, CANINTF_WAKIF);
/* Wait for oscillator startup timer after wake up */
--
2.45.2