From: Nick Desaulniers <ndesaulniers(a)google.com>
Basically, consider .text.{hot|unlikely|unknown}.* part of .text, too.
When compiling with profiling information (collected via PGO
instrumentations or AutoFDO sampling), Clang will separate code into
.text.hot, .text.unlikely, or .text.unknown sections based on profiling
information. After D79600 (clang-11), these sections will have a
trailing `.` suffix, ie. .text.hot., .text.unlikely., .text.unknown..
When using -ffunction-sections together with profiling infomation,
either explicitly (FGKASLR) or implicitly (LTO), code may be placed in
sections following the convention:
.text.hot.<foo>, .text.unlikely.<bar>, .text.unknown.<baz>
where <foo>, <bar>, and <baz> are functions. (This produces one section
per function; we generally try to merge these all back via linker script
so that we don't have 50k sections).
For the above cases, we need to teach our linker scripts that such
sections might exist and that we'd explicitly like them grouped
together, otherwise we can wind up with code outside of the
_stext/_etext boundaries that might not be mapped properly for some
architectures, resulting in boot failures.
If the linker script is not told about possible input sections, then
where the section is placed as output is a heuristic-laiden mess that's
non-portable between linkers (ie. BFD and LLD), and has resulted in many
hard to debug bugs. Kees Cook is working on cleaning this up by adding
--orphan-handling=warn linker flag used in ARCH=powerpc to additional
architectures. In the case of linker scripts, borrowing from the Zen of
Python: explicit is better than implicit.
Also, ld.bfd's internal linker script considers .text.hot AND
.text.hot.* to be part of .text, as well as .text.unlikely and
.text.unlikely.*. I didn't see support for .text.unknown.*, and didn't
see Clang producing such code in our kernel builds, but I see code in
LLVM that can produce such section names if profiling information is
missing. That may point to a larger issue with generating or collecting
profiles, but I would much rather be safe and explicit than have to
debug yet another issue related to orphan section placement.
Reported-by: Jian Cai <jiancai(a)google.com>
Suggested-by: Fāng-ruì Sòng <maskray(a)google.com>
Tested-by: Luis Lozano <llozano(a)google.com>
Tested-by: Manoj Gupta <manojgupta(a)google.com>
Acked-by: Kees Cook <keescook(a)chromium.org>
Cc: stable(a)vger.kernel.org
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=add44f8d5c5c0…
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=1de778ed23ce7…
Link: https://reviews.llvm.org/D79600
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1084760
Debugged-by: Luis Lozano <llozano(a)google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
include/asm-generic/vmlinux.lds.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 2593957f6e8b..af5211ca857c 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -561,7 +561,10 @@
*/
#define TEXT_TEXT \
ALIGN_FUNCTION(); \
- *(.text.hot TEXT_MAIN .text.fixup .text.unlikely) \
+ *(.text.hot .text.hot.*) \
+ *(TEXT_MAIN .text.fixup) \
+ *(.text.unlikely .text.unlikely.*) \
+ *(.text.unknown .text.unknown.*) \
NOINSTR_TEXT \
*(.text..refcount) \
*(.ref.text) \
--
2.25.1
For support of long running hypercalls xen_maybe_preempt_hcall() is
calling cond_resched() in case a hypercall marked as preemptible has
been interrupted.
Normally this is no problem, as only hypercalls done via some ioctl()s
are marked to be preemptible. In rare cases when during such a
preemptible hypercall an interrupt occurs and any softirq action is
started from irq_exit(), a further hypercall issued by the softirq
handler will be regarded to be preemptible, too. This might lead to
rescheduling in spite of the softirq handler potentially having set
preempt_disable(), leading to splats like:
BUG: sleeping function called from invalid context at drivers/xen/preempt.c:37
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 20775, name: xl
INFO: lockdep is turned off.
CPU: 1 PID: 20775 Comm: xl Tainted: G D W 5.4.46-1_prgmr_debug.el7.x86_64 #1
Call Trace:
<IRQ>
dump_stack+0x8f/0xd0
___might_sleep.cold.76+0xb2/0x103
xen_maybe_preempt_hcall+0x48/0x70
xen_do_hypervisor_callback+0x37/0x40
RIP: e030:xen_hypercall_xen_version+0xa/0x20
Code: ...
RSP: e02b:ffffc900400dcc30 EFLAGS: 00000246
RAX: 000000000004000d RBX: 0000000000000200 RCX: ffffffff8100122a
RDX: ffff88812e788000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffffff83ee3ad0 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: ffff8881824aa0b0
R13: 0000000865496000 R14: 0000000865496000 R15: ffff88815d040000
? xen_hypercall_xen_version+0xa/0x20
? xen_force_evtchn_callback+0x9/0x10
? check_events+0x12/0x20
? xen_restore_fl_direct+0x1f/0x20
? _raw_spin_unlock_irqrestore+0x53/0x60
? debug_dma_sync_single_for_cpu+0x91/0xc0
? _raw_spin_unlock_irqrestore+0x53/0x60
? xen_swiotlb_sync_single_for_cpu+0x3d/0x140
? mlx4_en_process_rx_cq+0x6b6/0x1110 [mlx4_en]
? mlx4_en_poll_rx_cq+0x64/0x100 [mlx4_en]
? net_rx_action+0x151/0x4a0
? __do_softirq+0xed/0x55b
? irq_exit+0xea/0x100
? xen_evtchn_do_upcall+0x2c/0x40
? xen_do_hypervisor_callback+0x29/0x40
</IRQ>
? xen_hypercall_domctl+0xa/0x20
? xen_hypercall_domctl+0x8/0x20
? privcmd_ioctl+0x221/0x990 [xen_privcmd]
? do_vfs_ioctl+0xa5/0x6f0
? ksys_ioctl+0x60/0x90
? trace_hardirqs_off_thunk+0x1a/0x20
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x62/0x250
? entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix that by testing preempt_count() before calling cond_resched().
In kernel 5.8 this can't happen any more due to the entry code rework.
Reported-by: Sarah Newman <srn(a)prgmr.com>
Fixes: 0fa2f5cb2b0ecd8 ("sched/preempt, xen: Use need_resched() instead of should_resched()")
Cc: Sarah Newman <srn(a)prgmr.com>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
---
drivers/xen/preempt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/xen/preempt.c b/drivers/xen/preempt.c
index 17240c5325a3..6ad87b5c95ed 100644
--- a/drivers/xen/preempt.c
+++ b/drivers/xen/preempt.c
@@ -27,7 +27,7 @@ EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
asmlinkage __visible void xen_maybe_preempt_hcall(void)
{
if (unlikely(__this_cpu_read(xen_in_preemptible_hcall)
- && need_resched())) {
+ && need_resched() && !preempt_count())) {
/*
* Clear flag as we may be rescheduled on a different
* cpu.
--
2.26.2
If the value of the link register is not correct (tail call from asm
that didn't set it, stack corruption, memory no longer mapped), then
using it for an address calculation may trigger an exception. Without a
fixup handler, this will lead to a panic, which will unwind, which will
trigger the fault repeatedly in an infinite loop.
We don't observe such failures currently, but we have. Just to be safe,
add a fixup handler here so that at least we don't have an infinite
loop.
Cc: stable(a)vger.kernel.org
Fixes: commit 6dc5fd93b2f1 ("ARM: 8900/1: UNWINDER_FRAME_POINTER implementation for Clang")
Reported-by: Miles Chen <miles.chen(a)mediatek.com>
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
---
arch/arm/lib/backtrace-clang.S | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/arm/lib/backtrace-clang.S b/arch/arm/lib/backtrace-clang.S
index 5388ac664c12..40eb2215eaf4 100644
--- a/arch/arm/lib/backtrace-clang.S
+++ b/arch/arm/lib/backtrace-clang.S
@@ -146,7 +146,7 @@ for_each_frame: tst frame, mask @ Check for address exceptions
tst sv_lr, #0 @ If there's no previous lr,
beq finished_setup @ we're done.
- ldr r0, [sv_lr, #-4] @ get call instruction
+prev_call: ldr r0, [sv_lr, #-4] @ get call instruction
ldr r3, .Lopcode+4
and r2, r3, r0 @ is this a bl call
teq r2, r3
@@ -206,6 +206,13 @@ finished_setup:
mov r2, frame
bl printk
no_frame: ldmfd sp!, {r4 - r9, fp, pc}
+/*
+ * Accessing the address pointed to by the link register triggered an
+ * exception, don't try to unwind through it.
+ */
+bad_lr: mov sv_fp, #0
+ mov sv_lr, #0
+ b finished_setup
ENDPROC(c_backtrace)
.pushsection __ex_table,"a"
.align 3
@@ -214,6 +221,7 @@ ENDPROC(c_backtrace)
.long 1003b, 1006b
.long 1004b, 1006b
.long 1005b, 1006b
+ .long prev_call, bad_lr
.popsection
.Lbad: .asciz "%sBacktrace aborted due to bad frame pointer <%p>\n"
--
2.28.0.163.g6104cc2f0b6-goog
Adjust the calls to `user_regset_copyout' and `user_regset_copyin' in
`riscv_fpr_get' and `riscv_fpr_set' respectively so as to use @start_pos
and @end_pos according to API documentation in <linux/regset.h>, that is
to point at the beginning and the end respectively of the data chunk to
be copied. Update @data accordingly, also for the first call, to make
it clear which structure member is accessed.
We currently have @start_pos fixed at 0 across all calls, which works as
a result of the implementation, in particular because we have no padding
between the FP general registers and the FP control and status register,
but appears not to have been the intent of the API and is not what other
ports do, requiring one to study the copy handlers to understand what is
going on here.
Signed-off-by: Maciej W. Rozycki <macro(a)wdc.com>
Fixes: b8c8a9590e4f ("RISC-V: Add FP register ptrace support for gdb.")
Cc: stable(a)vger.kernel.org # 4.20+
---
arch/riscv/kernel/ptrace.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
linux-riscv-ptrace-fcsr.diff
Index: linux-hv/arch/riscv/kernel/ptrace.c
===================================================================
--- linux-hv.orig/arch/riscv/kernel/ptrace.c
+++ linux-hv/arch/riscv/kernel/ptrace.c
@@ -61,10 +61,13 @@ static int riscv_fpr_get(struct task_str
int ret;
struct __riscv_d_ext_state *fstate = &target->thread.fstate;
- ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf, fstate, 0,
+ ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf, &fstate->f, 0,
offsetof(struct __riscv_d_ext_state, fcsr));
if (!ret) {
- ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf, fstate, 0,
+ ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+ &fstate->fcsr,
+ offsetof(struct __riscv_d_ext_state,
+ fcsr),
offsetof(struct __riscv_d_ext_state, fcsr) +
sizeof(fstate->fcsr));
}
@@ -80,10 +83,13 @@ static int riscv_fpr_set(struct task_str
int ret;
struct __riscv_d_ext_state *fstate = &target->thread.fstate;
- ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, fstate, 0,
+ ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &fstate->f, 0,
offsetof(struct __riscv_d_ext_state, fcsr));
if (!ret) {
- ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, fstate, 0,
+ ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+ &fstate->fcsr,
+ offsetof(struct __riscv_d_ext_state,
+ fcsr),
offsetof(struct __riscv_d_ext_state, fcsr) +
sizeof(fstate->fcsr));
}
The v4.4 stable kernel lacks this bugfix:
commit 327868212381 ("make skb_copy_datagram_msg() et.al. preserve ->msg_iter on error").
As a result, the v4.4 kernel can deliver corrupt data to the application
when a corrupt UDP packet is closely followed by a valid UDP packet: the
same invocation of the recvmsg() syscall can deliver the corrupt packet's
UDP payload to the application with the UDP payload length and the
"from IP/Port" of the valid packet.
Details:
For a UDP packet longer than 76 bytes (see the v5.8-rc6 kernel's
include/linux/skbuff.h:3951), Linux delays the UDP checksum verification
until the application invokes the syscall recvmsg().
In the recvmsg() syscall handler, while Linux is copying the UDP payload
to the application's memory, it calculates the UDP checksum. If the
calculated checksum doesn't match the received checksum, Linux drops the
corrupt UDP packet, and then starts to process the next packet (if any),
and if the next packet is valid (i.e. the checksum is correct), Linux
will copy the valid UDP packet's payload to the application's receiver
buffer.
The bug is: before Linux starts to copy the valid UDP packet, the data
structure used to track how many more bytes should be copied to the
application memory is not reset to what it was when the application just
entered the kernel by the syscall! Consequently, only a small portion or
none of the valid packet's payload is copied to the application's
receive buffer, and later when the application exits from the kernel,
actually most of the application's receive buffer contains the payload
of the corrupt packet while recvmsg() returns the length of the UDP
payload of the valid packet.
For the mainline kernel, the bug was fixed in commit 327868212381,
but unluckily the bugfix is only backported to v4.9+. It turns out
backporting 327868212381 to v4.4 means that some supporting patches
must be backported first, so the overall changes seem too big, so the
alternative is performs the csum validation earlier and drops the
corrupt packets earlier.
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
---
net/ipv4/udp.c | 3 +--
net/ipv6/udp.c | 6 ++----
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index bb30699..49ab587 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1589,8 +1589,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
}
}
- if (rcu_access_pointer(sk->sk_filter) &&
- udp_lib_checksum_complete(skb))
+ if (udp_lib_checksum_complete(skb))
goto csum_error;
if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 73f1112..2d6703d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -686,10 +686,8 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
}
}
- if (rcu_access_pointer(sk->sk_filter)) {
- if (udp_lib_checksum_complete(skb))
- goto csum_error;
- }
+ if (udp_lib_checksum_complete(skb))
+ goto csum_error;
if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
UDP6_INC_STATS_BH(sock_net(sk),
--
1.8.3.1
If the link register was zeroed out, do not attempt to use it for
address calculations for which there are currently no fixup handlers,
which can lead to a panic during unwind. Since panicking triggers
another unwind, this can lead to an infinite loop. If this occurs
during start_kernel(), this can prevent a kernel from booting.
commit 59b6359dd92d ("ARM: 8702/1: head-common.S: Clear lr before jumping to start_kernel()")
intentionally zeros out the link register in __mmap_switched which tail
calls into start kernel. Test for this condition so that we can stop
unwinding when initiated within start_kernel() correctly.
Cc: stable(a)vger.kernel.org
Fixes: commit 6dc5fd93b2f1 ("ARM: 8900/1: UNWINDER_FRAME_POINTER implementation for Clang")
Reported-by: Miles Chen <miles.chen(a)mediatek.com>
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
---
arch/arm/lib/backtrace-clang.S | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm/lib/backtrace-clang.S b/arch/arm/lib/backtrace-clang.S
index 6174c45f53a5..5388ac664c12 100644
--- a/arch/arm/lib/backtrace-clang.S
+++ b/arch/arm/lib/backtrace-clang.S
@@ -144,6 +144,8 @@ for_each_frame: tst frame, mask @ Check for address exceptions
*/
1003: ldr sv_lr, [sv_fp, #4] @ get saved lr from next frame
+ tst sv_lr, #0 @ If there's no previous lr,
+ beq finished_setup @ we're done.
ldr r0, [sv_lr, #-4] @ get call instruction
ldr r3, .Lopcode+4
and r2, r3, r0 @ is this a bl call
--
2.28.0.163.g6104cc2f0b6-goog