From: He Ying <heying24(a)huawei.com>
[ Upstream commit a1b29ba2f2c171b9bea73be993bfdf0a62d37d15 ]
The following KASAN warning was reported in our kernel.
BUG: KASAN: stack-out-of-bounds in get_wchan+0x188/0x250
Read of size 4 at addr d216f958 by task ps/14437
CPU: 3 PID: 14437 Comm: ps Tainted: G O 5.10.0 #1
Call Trace:
[daa63858] [c0654348] dump_stack+0x9c/0xe4 (unreliable)
[daa63888] [c035cf0c] print_address_description.constprop.3+0x8c/0x570
[daa63908] [c035d6bc] kasan_report+0x1ac/0x218
[daa63948] [c00496e8] get_wchan+0x188/0x250
[daa63978] [c0461ec8] do_task_stat+0xce8/0xe60
[daa63b98] [c0455ac8] proc_single_show+0x98/0x170
[daa63bc8] [c03cab8c] seq_read_iter+0x1ec/0x900
[daa63c38] [c03cb47c] seq_read+0x1dc/0x290
[daa63d68] [c037fc94] vfs_read+0x164/0x510
[daa63ea8] [c03808e4] ksys_read+0x144/0x1d0
[daa63f38] [c005b1dc] ret_from_syscall+0x0/0x38
--- interrupt: c00 at 0x8fa8f4
LR = 0x8fa8cc
The buggy address belongs to the page:
page:98ebcdd2 refcount:0 mapcount:0 mapping:00000000 index:0x2 pfn:0x1216f
flags: 0x0()
raw: 00000000 00000000 01010122 00000000 00000002 00000000 ffffffff 00000000
raw: 00000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
d216f800: 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 00 00
d216f880: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>d216f900: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
^
d216f980: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
d216fa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
After looking into this issue, I find the buggy address belongs
to the task stack region. It seems KASAN has something wrong.
I look into the code of __get_wchan in x86 architecture and
find the same issue has been resolved by the commit
f7d27c35ddff ("x86/mm, kasan: Silence KASAN warnings in get_wchan()").
The solution could be applied to powerpc architecture too.
As Andrey Ryabinin said, get_wchan() is racy by design, it may
access volatile stack of running task, thus it may access
redzone in a stack frame and cause KASAN to warn about this.
Use READ_ONCE_NOCHECK() to silence these warnings.
Reported-by: Wanming Hu <huwanming(a)huaweil.com>
Signed-off-by: He Ying <heying24(a)huawei.com>
Signed-off-by: Chen Jingwen <chenjingwen6(a)huawei.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220121014418.155675-1-heying24@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/kernel/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 02b69a68139c..56c33285b1df 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2017,12 +2017,12 @@ unsigned long get_wchan(struct task_struct *p)
return 0;
do {
- sp = *(unsigned long *)sp;
+ sp = READ_ONCE_NOCHECK(*(unsigned long *)sp);
if (!validate_sp(sp, p, STACK_FRAME_OVERHEAD) ||
p->state == TASK_RUNNING)
return 0;
if (count > 0) {
- ip = ((unsigned long *)sp)[STACK_FRAME_LR_SAVE];
+ ip = READ_ONCE_NOCHECK(((unsigned long *)sp)[STACK_FRAME_LR_SAVE]);
if (!in_sched_functions(ip))
return ip;
}
--
2.35.1
From: He Ying <heying24(a)huawei.com>
[ Upstream commit a1b29ba2f2c171b9bea73be993bfdf0a62d37d15 ]
The following KASAN warning was reported in our kernel.
BUG: KASAN: stack-out-of-bounds in get_wchan+0x188/0x250
Read of size 4 at addr d216f958 by task ps/14437
CPU: 3 PID: 14437 Comm: ps Tainted: G O 5.10.0 #1
Call Trace:
[daa63858] [c0654348] dump_stack+0x9c/0xe4 (unreliable)
[daa63888] [c035cf0c] print_address_description.constprop.3+0x8c/0x570
[daa63908] [c035d6bc] kasan_report+0x1ac/0x218
[daa63948] [c00496e8] get_wchan+0x188/0x250
[daa63978] [c0461ec8] do_task_stat+0xce8/0xe60
[daa63b98] [c0455ac8] proc_single_show+0x98/0x170
[daa63bc8] [c03cab8c] seq_read_iter+0x1ec/0x900
[daa63c38] [c03cb47c] seq_read+0x1dc/0x290
[daa63d68] [c037fc94] vfs_read+0x164/0x510
[daa63ea8] [c03808e4] ksys_read+0x144/0x1d0
[daa63f38] [c005b1dc] ret_from_syscall+0x0/0x38
--- interrupt: c00 at 0x8fa8f4
LR = 0x8fa8cc
The buggy address belongs to the page:
page:98ebcdd2 refcount:0 mapcount:0 mapping:00000000 index:0x2 pfn:0x1216f
flags: 0x0()
raw: 00000000 00000000 01010122 00000000 00000002 00000000 ffffffff 00000000
raw: 00000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
d216f800: 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 00 00
d216f880: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>d216f900: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
^
d216f980: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
d216fa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
After looking into this issue, I find the buggy address belongs
to the task stack region. It seems KASAN has something wrong.
I look into the code of __get_wchan in x86 architecture and
find the same issue has been resolved by the commit
f7d27c35ddff ("x86/mm, kasan: Silence KASAN warnings in get_wchan()").
The solution could be applied to powerpc architecture too.
As Andrey Ryabinin said, get_wchan() is racy by design, it may
access volatile stack of running task, thus it may access
redzone in a stack frame and cause KASAN to warn about this.
Use READ_ONCE_NOCHECK() to silence these warnings.
Reported-by: Wanming Hu <huwanming(a)huaweil.com>
Signed-off-by: He Ying <heying24(a)huawei.com>
Signed-off-by: Chen Jingwen <chenjingwen6(a)huawei.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220121014418.155675-1-heying24@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/kernel/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index c94bba9142e7..832663f21422 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2001,12 +2001,12 @@ static unsigned long __get_wchan(struct task_struct *p)
return 0;
do {
- sp = *(unsigned long *)sp;
+ sp = READ_ONCE_NOCHECK(*(unsigned long *)sp);
if (!validate_sp(sp, p, STACK_FRAME_OVERHEAD) ||
p->state == TASK_RUNNING)
return 0;
if (count > 0) {
- ip = ((unsigned long *)sp)[STACK_FRAME_LR_SAVE];
+ ip = READ_ONCE_NOCHECK(((unsigned long *)sp)[STACK_FRAME_LR_SAVE]);
if (!in_sched_functions(ip))
return ip;
}
--
2.35.1
From: He Ying <heying24(a)huawei.com>
[ Upstream commit a1b29ba2f2c171b9bea73be993bfdf0a62d37d15 ]
The following KASAN warning was reported in our kernel.
BUG: KASAN: stack-out-of-bounds in get_wchan+0x188/0x250
Read of size 4 at addr d216f958 by task ps/14437
CPU: 3 PID: 14437 Comm: ps Tainted: G O 5.10.0 #1
Call Trace:
[daa63858] [c0654348] dump_stack+0x9c/0xe4 (unreliable)
[daa63888] [c035cf0c] print_address_description.constprop.3+0x8c/0x570
[daa63908] [c035d6bc] kasan_report+0x1ac/0x218
[daa63948] [c00496e8] get_wchan+0x188/0x250
[daa63978] [c0461ec8] do_task_stat+0xce8/0xe60
[daa63b98] [c0455ac8] proc_single_show+0x98/0x170
[daa63bc8] [c03cab8c] seq_read_iter+0x1ec/0x900
[daa63c38] [c03cb47c] seq_read+0x1dc/0x290
[daa63d68] [c037fc94] vfs_read+0x164/0x510
[daa63ea8] [c03808e4] ksys_read+0x144/0x1d0
[daa63f38] [c005b1dc] ret_from_syscall+0x0/0x38
--- interrupt: c00 at 0x8fa8f4
LR = 0x8fa8cc
The buggy address belongs to the page:
page:98ebcdd2 refcount:0 mapcount:0 mapping:00000000 index:0x2 pfn:0x1216f
flags: 0x0()
raw: 00000000 00000000 01010122 00000000 00000002 00000000 ffffffff 00000000
raw: 00000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
d216f800: 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 00 00
d216f880: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>d216f900: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
^
d216f980: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
d216fa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
After looking into this issue, I find the buggy address belongs
to the task stack region. It seems KASAN has something wrong.
I look into the code of __get_wchan in x86 architecture and
find the same issue has been resolved by the commit
f7d27c35ddff ("x86/mm, kasan: Silence KASAN warnings in get_wchan()").
The solution could be applied to powerpc architecture too.
As Andrey Ryabinin said, get_wchan() is racy by design, it may
access volatile stack of running task, thus it may access
redzone in a stack frame and cause KASAN to warn about this.
Use READ_ONCE_NOCHECK() to silence these warnings.
Reported-by: Wanming Hu <huwanming(a)huaweil.com>
Signed-off-by: He Ying <heying24(a)huawei.com>
Signed-off-by: Chen Jingwen <chenjingwen6(a)huawei.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220121014418.155675-1-heying24@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/kernel/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 3064694afea1..cfb8fd76afb4 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2108,12 +2108,12 @@ static unsigned long __get_wchan(struct task_struct *p)
return 0;
do {
- sp = *(unsigned long *)sp;
+ sp = READ_ONCE_NOCHECK(*(unsigned long *)sp);
if (!validate_sp(sp, p, STACK_FRAME_OVERHEAD) ||
p->state == TASK_RUNNING)
return 0;
if (count > 0) {
- ip = ((unsigned long *)sp)[STACK_FRAME_LR_SAVE];
+ ip = READ_ONCE_NOCHECK(((unsigned long *)sp)[STACK_FRAME_LR_SAVE]);
if (!in_sched_functions(ip))
return ip;
}
--
2.35.1
From: He Ying <heying24(a)huawei.com>
[ Upstream commit a1b29ba2f2c171b9bea73be993bfdf0a62d37d15 ]
The following KASAN warning was reported in our kernel.
BUG: KASAN: stack-out-of-bounds in get_wchan+0x188/0x250
Read of size 4 at addr d216f958 by task ps/14437
CPU: 3 PID: 14437 Comm: ps Tainted: G O 5.10.0 #1
Call Trace:
[daa63858] [c0654348] dump_stack+0x9c/0xe4 (unreliable)
[daa63888] [c035cf0c] print_address_description.constprop.3+0x8c/0x570
[daa63908] [c035d6bc] kasan_report+0x1ac/0x218
[daa63948] [c00496e8] get_wchan+0x188/0x250
[daa63978] [c0461ec8] do_task_stat+0xce8/0xe60
[daa63b98] [c0455ac8] proc_single_show+0x98/0x170
[daa63bc8] [c03cab8c] seq_read_iter+0x1ec/0x900
[daa63c38] [c03cb47c] seq_read+0x1dc/0x290
[daa63d68] [c037fc94] vfs_read+0x164/0x510
[daa63ea8] [c03808e4] ksys_read+0x144/0x1d0
[daa63f38] [c005b1dc] ret_from_syscall+0x0/0x38
--- interrupt: c00 at 0x8fa8f4
LR = 0x8fa8cc
The buggy address belongs to the page:
page:98ebcdd2 refcount:0 mapcount:0 mapping:00000000 index:0x2 pfn:0x1216f
flags: 0x0()
raw: 00000000 00000000 01010122 00000000 00000002 00000000 ffffffff 00000000
raw: 00000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
d216f800: 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 00 00
d216f880: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>d216f900: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
^
d216f980: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
d216fa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
After looking into this issue, I find the buggy address belongs
to the task stack region. It seems KASAN has something wrong.
I look into the code of __get_wchan in x86 architecture and
find the same issue has been resolved by the commit
f7d27c35ddff ("x86/mm, kasan: Silence KASAN warnings in get_wchan()").
The solution could be applied to powerpc architecture too.
As Andrey Ryabinin said, get_wchan() is racy by design, it may
access volatile stack of running task, thus it may access
redzone in a stack frame and cause KASAN to warn about this.
Use READ_ONCE_NOCHECK() to silence these warnings.
Reported-by: Wanming Hu <huwanming(a)huaweil.com>
Signed-off-by: He Ying <heying24(a)huawei.com>
Signed-off-by: Chen Jingwen <chenjingwen6(a)huawei.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220121014418.155675-1-heying24@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/kernel/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 50436b52c213..39a0a13a3a27 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2124,12 +2124,12 @@ static unsigned long __get_wchan(struct task_struct *p)
return 0;
do {
- sp = *(unsigned long *)sp;
+ sp = READ_ONCE_NOCHECK(*(unsigned long *)sp);
if (!validate_sp(sp, p, STACK_FRAME_OVERHEAD) ||
task_is_running(p))
return 0;
if (count > 0) {
- ip = ((unsigned long *)sp)[STACK_FRAME_LR_SAVE];
+ ip = READ_ONCE_NOCHECK(((unsigned long *)sp)[STACK_FRAME_LR_SAVE]);
if (!in_sched_functions(ip))
return ip;
}
--
2.35.1
From: He Ying <heying24(a)huawei.com>
[ Upstream commit a1b29ba2f2c171b9bea73be993bfdf0a62d37d15 ]
The following KASAN warning was reported in our kernel.
BUG: KASAN: stack-out-of-bounds in get_wchan+0x188/0x250
Read of size 4 at addr d216f958 by task ps/14437
CPU: 3 PID: 14437 Comm: ps Tainted: G O 5.10.0 #1
Call Trace:
[daa63858] [c0654348] dump_stack+0x9c/0xe4 (unreliable)
[daa63888] [c035cf0c] print_address_description.constprop.3+0x8c/0x570
[daa63908] [c035d6bc] kasan_report+0x1ac/0x218
[daa63948] [c00496e8] get_wchan+0x188/0x250
[daa63978] [c0461ec8] do_task_stat+0xce8/0xe60
[daa63b98] [c0455ac8] proc_single_show+0x98/0x170
[daa63bc8] [c03cab8c] seq_read_iter+0x1ec/0x900
[daa63c38] [c03cb47c] seq_read+0x1dc/0x290
[daa63d68] [c037fc94] vfs_read+0x164/0x510
[daa63ea8] [c03808e4] ksys_read+0x144/0x1d0
[daa63f38] [c005b1dc] ret_from_syscall+0x0/0x38
--- interrupt: c00 at 0x8fa8f4
LR = 0x8fa8cc
The buggy address belongs to the page:
page:98ebcdd2 refcount:0 mapcount:0 mapping:00000000 index:0x2 pfn:0x1216f
flags: 0x0()
raw: 00000000 00000000 01010122 00000000 00000002 00000000 ffffffff 00000000
raw: 00000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
d216f800: 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 00 00
d216f880: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>d216f900: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
^
d216f980: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
d216fa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
After looking into this issue, I find the buggy address belongs
to the task stack region. It seems KASAN has something wrong.
I look into the code of __get_wchan in x86 architecture and
find the same issue has been resolved by the commit
f7d27c35ddff ("x86/mm, kasan: Silence KASAN warnings in get_wchan()").
The solution could be applied to powerpc architecture too.
As Andrey Ryabinin said, get_wchan() is racy by design, it may
access volatile stack of running task, thus it may access
redzone in a stack frame and cause KASAN to warn about this.
Use READ_ONCE_NOCHECK() to silence these warnings.
Reported-by: Wanming Hu <huwanming(a)huaweil.com>
Signed-off-by: He Ying <heying24(a)huawei.com>
Signed-off-by: Chen Jingwen <chenjingwen6(a)huawei.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220121014418.155675-1-heying24@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/kernel/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 984813a4d5dc..a75d20f23dac 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2160,12 +2160,12 @@ static unsigned long ___get_wchan(struct task_struct *p)
return 0;
do {
- sp = *(unsigned long *)sp;
+ sp = READ_ONCE_NOCHECK(*(unsigned long *)sp);
if (!validate_sp(sp, p, STACK_FRAME_OVERHEAD) ||
task_is_running(p))
return 0;
if (count > 0) {
- ip = ((unsigned long *)sp)[STACK_FRAME_LR_SAVE];
+ ip = READ_ONCE_NOCHECK(((unsigned long *)sp)[STACK_FRAME_LR_SAVE]);
if (!in_sched_functions(ip))
return ip;
}
--
2.35.1
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 4f2829634d79..88947f5fd778 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2938,8 +2938,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&tmp, &child->thread.TS_FPR(fpidx),
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ tmp = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ } else {
+ memcpy(&tmp, &child->thread.TS_FPR(fpidx),
+ sizeof(long));
+ }
else
tmp = child->thread.fp_state.fpscr;
}
@@ -2971,8 +2976,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data,
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ } else {
+ memcpy(&child->thread.TS_FPR(fpidx), &data,
+ sizeof(long));
+ }
else
child->thread.fp_state.fpscr = data;
ret = 0;
--
2.35.3
Dearest beloved in the Lord,
I am Ms. Agnes George, a 75 year old British woman. I was born an
orphan and GOD blessed me abundantly with riches but no children nor
husband which makes me an unhappy woman. Now I am affected with cancer
of the lung and breast with a partial stroke which has affected my
speech. I can no longer talk well and half of my body is paralyzed, I
sent this email to you with the help of my private female nurse.
My condition is really deteriorating day by day and it is really
giving me lots to think about. This has prompted my decision to
donate all I have for charity; I have made numerous donations all over
the world. After going through your profile, I decided to make my last
donation of Ten Million Five Hundred Thousand United Kingdom Pounds
(UK£10.500, 000, 00) to you as my investment manager. I want you to
build an Orphanage home with my name ( Agnes George ) in your
country.
If you are willing and able to do this task for the sake of humanity
then send me below information for more details to receive the funds.
1. Name...................................................
2. Phone number...............................
3. Address.............................................
4. Country of Origin and residence
Ms. Agnes George.
hallo Greg
5.18.4-rc1
compiles (not without warnings), boots and runs here on x86_64
(Intel i5-11400, Fedora 36)
Tested-by: Ronald Warsow <rwarsow(a)gmx.de
Thanks
Ronald
From: Phil Elwell <phil(a)raspberrypi.org>
The dmas property is used to hold the dmaengine channel used for audio
output.
Older device trees were missing that property, so if it's not there we
disable the audio output entirely.
However, some overlays have set an empty value to that property, mostly
to workaround the fact that overlays cannot remove a property. Let's add
a test for that case and if it's empty, let's disable it as well.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Phil Elwell <phil(a)raspberrypi.org>
Signed-off-by: Maxime Ripard <maxime(a)cerno.tech>
---
drivers/gpu/drm/vc4/vc4_hdmi.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.c b/drivers/gpu/drm/vc4/vc4_hdmi.c
index 6aadb65eb640..c8571e17afa8 100644
--- a/drivers/gpu/drm/vc4/vc4_hdmi.c
+++ b/drivers/gpu/drm/vc4/vc4_hdmi.c
@@ -2034,12 +2034,12 @@ static int vc4_hdmi_audio_init(struct vc4_hdmi *vc4_hdmi)
struct device *dev = &vc4_hdmi->pdev->dev;
struct platform_device *codec_pdev;
const __be32 *addr;
- int index;
+ int index, len;
int ret;
- if (!of_find_property(dev->of_node, "dmas", NULL)) {
+ if (!of_find_property(dev->of_node, "dmas", &len) || !len) {
dev_warn(dev,
- "'dmas' DT property is missing, no HDMI audio\n");
+ "'dmas' DT property is missing or empty, no HDMI audio\n");
return 0;
}
--
2.36.1
From: Dave Stevenson <dave.stevenson(a)raspberrypi.com>
vc4_drv isn't necessarily under the /soc node in DT as it is a
virtual device, but it is the one that does the allocations.
The DMA addresses are consumed by primarily the HVS or V3D, and
those require VideoCore cache alias address mapping, and so will be
under /soc.
During probe find the a suitable device node for HVS or V3D,
and adopt the DMA configuration of that node.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dave Stevenson <dave.stevenson(a)raspberrypi.com>
Signed-off-by: Maxime Ripard <maxime(a)cerno.tech>
---
drivers/gpu/drm/vc4/vc4_drv.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/gpu/drm/vc4/vc4_drv.c b/drivers/gpu/drm/vc4/vc4_drv.c
index 162bc18e7497..14a7d529144d 100644
--- a/drivers/gpu/drm/vc4/vc4_drv.c
+++ b/drivers/gpu/drm/vc4/vc4_drv.c
@@ -209,6 +209,15 @@ static void vc4_match_add_drivers(struct device *dev,
}
}
+const struct of_device_id vc4_dma_range_matches[] = {
+ { .compatible = "brcm,bcm2711-hvs" },
+ { .compatible = "brcm,bcm2835-hvs" },
+ { .compatible = "brcm,bcm2835-v3d" },
+ { .compatible = "brcm,cygnus-v3d" },
+ { .compatible = "brcm,vc4-v3d" },
+ {}
+};
+
static int vc4_drm_bind(struct device *dev)
{
struct platform_device *pdev = to_platform_device(dev);
@@ -227,6 +236,16 @@ static int vc4_drm_bind(struct device *dev)
vc4_drm_driver.driver_features &= ~DRIVER_RENDER;
of_node_put(node);
+ node = of_find_matching_node_and_match(NULL, vc4_dma_range_matches,
+ NULL);
+ if (node) {
+ ret = of_dma_configure(dev, node, true);
+ of_node_put(node);
+
+ if (ret)
+ return ret;
+ }
+
vc4 = devm_drm_dev_alloc(dev, &vc4_drm_driver, struct vc4_dev, base);
if (IS_ERR(vc4))
return PTR_ERR(vc4);
--
2.36.1
Hi kernel stable team,
I want to report a regression in v4.19.245 / v4.19.246 in case someone else
also hits this: strongswan 4.6 errors out with an assertion:
Jun 13 08:55:02 mis1 pluto[4096]: "C1"[1] 10.2.0.1:10954 #2: ASSERTION FAILED at ipsec_doi.c:2852: case 3 unexpected
(the source line number is not relevant due to extra patches)
-> Our automated distro testsuite had IPSec related VPN test failures.
A fix for the issue is queued for v4.19.247 already:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tre…
'Revert "net: af_key: add check for pfkey_broadcast in function pfkey_process"'
From the commit message of the revert:
**********************************
"One visible effect is that racoon daemon fails to find encryption
algorithms like aes and refuses to start."
**********************************
For the older strongwan 4.6 "pluto" daemon the problem showed itself
at a later stage during phase 2 of an IPSec tunnel setup.
Thanks for the great work on the stable kernel!
Regressions are a rare thing these days.
Best regards,
Thomas Jarosch
When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.
When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.
Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.
The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.
When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.
We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.
Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.
New regression test sent to xfstests.
Link: https://github.com/lxc/lxd/issues/10537
Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts")
Cc: Seth Forshee <seth.forshee(a)digitalocean.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: stable(a)vger.kernel.org # 5.15+
CC: linux-fsdevel(a)vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner(a)kernel.org>
---
Hey,
Detected while working on additional tests and also reported by a user
and brain on my part. I plan to work on patches to disentangle the
codepaths here a bit and make them easier to understand going forward.
Passes xfstests including new tests and LTP test suite.
Thanks!
Christian
---
fs/attr.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/fs/attr.c b/fs/attr.c
index 66899b6e9bd8..dbe996b0dedf 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -61,9 +61,15 @@ static bool chgrp_ok(struct user_namespace *mnt_userns,
const struct inode *inode, kgid_t gid)
{
kgid_t kgid = i_gid_into_mnt(mnt_userns, inode);
- if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode)) &&
- (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
- return true;
+ if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) {
+ kgid_t mapped_gid;
+
+ if (gid_eq(gid, inode->i_gid))
+ return true;
+ mapped_gid = mapped_kgid_fs(mnt_userns, i_user_ns(inode), gid);
+ if (in_group_p(mapped_gid))
+ return true;
+ }
if (capable_wrt_inode_uidgid(mnt_userns, inode, CAP_CHOWN))
return true;
if (gid_eq(kgid, INVALID_GID) &&
@@ -123,12 +129,20 @@ int setattr_prepare(struct user_namespace *mnt_userns, struct dentry *dentry,
/* Make sure a caller can chmod. */
if (ia_valid & ATTR_MODE) {
+ kgid_t mapped_gid;
+
if (!inode_owner_or_capable(mnt_userns, inode))
return -EPERM;
+
+ if (ia_valid & ATTR_GID)
+ mapped_gid = mapped_kgid_fs(mnt_userns,
+ i_user_ns(inode), attr->ia_gid);
+ else
+ mapped_gid = i_gid_into_mnt(mnt_userns, inode);
+
/* Also check the setgid bit! */
- if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
- i_gid_into_mnt(mnt_userns, inode)) &&
- !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
+ if (!in_group_p(mapped_gid) &&
+ !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
attr->ia_mode &= ~S_ISGID;
}
base-commit: f2906aa863381afb0015a9eb7fefad885d4e5a56
--
2.34.1
Commit 787af64d05cd ("mm: page_alloc: validate buddy before check its migratetype.")
added buddy check code. But unfortunately, this fix isn't backported to
linux-5.17.y and the former stable branches. The reason is it added wrong
fixes message:
Fixes: 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable
pageblocks with others")
Actually, this issue is involved by commit:
commit d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and other pageblocks")
For RISC-V arch, the first 2M is reserved for sbi, so the start PFN is 512,
but it got buddy PFN 0 for PFN 0x2000:
0 = 0x2000 ^ (1 << 12)
With the illegal buddy PFN 0, it got an illegal buddy page, which caused
crash in __get_pfnblock_flags_mask().
With the patch, it can avoid the calling of get_pageblock_migratetype() if
it isn't buddy page.
Fixes: d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and other pageblocks")
Cc: stable(a)vger.kernel.org
Reported-by: zjb194813(a)alibaba-inc.com
Reported-by: tianhu.hh(a)alibaba-inc.com
Signed-off-by: Xianting Tian <xianting.tian(a)linux.alibaba.com>
---
mm/page_alloc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1caa1c6c887..5b423caa68fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1129,6 +1129,9 @@ static inline void __free_one_page(struct page *page,
buddy_pfn = __find_buddy_pfn(pfn, order);
buddy = page + (buddy_pfn - pfn);
+
+ if (!page_is_buddy(page, buddy, order))
+ goto done_merging;
buddy_mt = get_pageblock_migratetype(buddy);
if (migratetype != buddy_mt
--
2.17.1
In building 5.15.46 & 5.10.121 with CRYPTO_LIB_CURVE25519=m I get the
following. My workaround is to leave it as CRYPTO_LIB_CURVE25519=n
for now.
CONFIG_OR1K_1200=y
CONFIG_OPENRISC_BUILTIN_DTB="or1ksim"
sed 's/\.ko$/\.o/' modules.order | scripts/mod/modpost -o
modules-only.symvers -i vmlinux.symvers -T - ERROR: modpost:
"__crypto_memneq" [lib/crypto/libcurve25519.ko] undefined! make[1]:
*** [scripts/Makefile.modpost:134: modules-only.symvers] Error 1
make[1]: *** Deleting file 'modules-only.symvers' make: ***
[Makefile:1783: modules] Error 2
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a2a513be7139b279f1b5b2cee59c6c4950c34346 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Date: Thu, 2 Jun 2022 23:16:57 +0900
Subject: [PATCH] zonefs: fix handling of explicit_open option on mount
Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().
Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index bcb21aea990a..ecce84909ca1 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -1760,12 +1760,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
atomic_set(&sbi->s_wro_seq_files, 0);
sbi->s_max_wro_seq_files = bdev_max_open_zones(sb->s_bdev);
- if (!sbi->s_max_wro_seq_files &&
- sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
- zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
- sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
- }
-
atomic_set(&sbi->s_active_seq_files, 0);
sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
@@ -1790,6 +1784,12 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
zonefs_info(sb, "Mounting %u zones",
blkdev_nr_zones(sb->s_bdev->bd_disk));
+ if (!sbi->s_max_wro_seq_files &&
+ sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
+ zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
+ sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
+ }
+
/* Create root directory inode */
ret = -ENOMEM;
inode = new_inode(sb);
hello,
Ilya reports bad TCP throughput when GSO packets hit an OVS rule that does
tc MTU policing. According to his observations [1], the problem is fixed
by upstream commit 4ddc844eb81d ("net/sched: act_police: more accurate MTU
policing"). Can we queue this commit for inclusion in stable trees?
thanks!
--
davide
[1] https://mail.openvswitch.org/pipermail/ovs-dev/2022-June/394802.html
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index bfc5f59d9f1b..ef5875f83692 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2920,8 +2920,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&tmp, &child->thread.TS_FPR(fpidx),
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ tmp = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ } else {
+ memcpy(&tmp, &child->thread.TS_FPR(fpidx),
+ sizeof(long));
+ }
else
tmp = child->thread.fp_state.fpscr;
}
@@ -2953,8 +2958,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data,
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ } else {
+ memcpy(&child->thread.TS_FPR(fpidx), &data,
+ sizeof(long));
+ }
else
child->thread.fp_state.fpscr = data;
ret = 0;
--
2.35.3
Hi,
all stable release queues from v4.9.y up to v5.4.y have boot stall
problems. The culprit is the backport of commit d7ea0d9df2a6 ("net:
remove two BUG() from skb_checksum_help()"), specifically the following
code.
diff --git a/net/core/dev.c b/net/core/dev.c
index 47468fc5d0c9..d725ca4d4455 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2518,11 +2518,15 @@ int skb_checksum_help(struct sk_buff *skb)
...
- BUG_ON(offset >= skb_headlen(skb));
+ ret = -EINVAL;
^^^^^^^^^^^^^^
+ if (WARN_ON_ONCE(offset >= skb_headlen(skb)))
+ goto out;
+
While that works fine in the upstream kernel since ret is subsequently
always overwritten, that is not the case in older kernels. In those,
the function now always returns -EINVAL.
Guenter
This is a preparation patch for the S29GL064N buffer writes fix. There
is no functional change.
Link: https://lore.kernel.org/r/b687c259-6413-26c9-d4c9-b3afa69ea124@pengutronix.…
Fixes: dfeae1073583("mtd: cfi_cmdset_0002: Change write buffer to check correct value")
Signed-off-by: Tokunori Ikegami <ikegami.t(a)gmail.com>
Cc: stable(a)vger.kernel.org
Acked-by: Vignesh Raghavendra <vigneshr(a)ti.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Link: https://lore.kernel.org/linux-mtd/20220323170458.5608-2-ikegami.t@gmail.com
---
drivers/mtd/chips/cfi_cmdset_0002.c | 77 ++++++++++++-----------------
1 file changed, 32 insertions(+), 45 deletions(-)
diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index 3c4819a05bf0..bcc2ce67c996 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -726,50 +726,34 @@ static struct mtd_info *cfi_amdstd_setup(struct mtd_info *mtd)
}
/*
- * Return true if the chip is ready.
+ * Return true if the chip is ready and has the correct value.
*
* Ready is one of: read mode, query mode, erase-suspend-read mode (in any
* non-suspended sector) and is indicated by no toggle bits toggling.
*
+ * Error are indicated by toggling bits or bits held with the wrong value,
+ * or with bits toggling.
+ *
* Note that anything more complicated than checking if no bits are toggling
* (including checking DQ5 for an error status) is tricky to get working
* correctly and is therefore not done (particularly with interleaved chips
* as each chip must be checked independently of the others).
*/
-static int __xipram chip_ready(struct map_info *map, unsigned long addr)
+static int __xipram chip_ready(struct map_info *map, unsigned long addr,
+ map_word *expected)
{
map_word d, t;
+ int ret;
d = map_read(map, addr);
t = map_read(map, addr);
- return map_word_equal(map, d, t);
-}
+ ret = map_word_equal(map, d, t);
-/*
- * Return true if the chip is ready and has the correct value.
- *
- * Ready is one of: read mode, query mode, erase-suspend-read mode (in any
- * non-suspended sector) and it is indicated by no bits toggling.
- *
- * Error are indicated by toggling bits or bits held with the wrong value,
- * or with bits toggling.
- *
- * Note that anything more complicated than checking if no bits are toggling
- * (including checking DQ5 for an error status) is tricky to get working
- * correctly and is therefore not done (particularly with interleaved chips
- * as each chip must be checked independently of the others).
- *
- */
-static int __xipram chip_good(struct map_info *map, unsigned long addr, map_word expected)
-{
- map_word oldd, curd;
-
- oldd = map_read(map, addr);
- curd = map_read(map, addr);
+ if (!ret || !expected)
+ return ret;
- return map_word_equal(map, oldd, curd) &&
- map_word_equal(map, curd, expected);
+ return map_word_equal(map, t, *expected);
}
static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr, int mode)
@@ -786,7 +770,7 @@ static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr
case FL_STATUS:
for (;;) {
- if (chip_ready(map, adr))
+ if (chip_ready(map, adr, NULL))
break;
if (time_after(jiffies, timeo)) {
@@ -824,7 +808,7 @@ static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr
chip->state = FL_ERASE_SUSPENDING;
chip->erase_suspended = 1;
for (;;) {
- if (chip_ready(map, adr))
+ if (chip_ready(map, adr, NULL))
break;
if (time_after(jiffies, timeo)) {
@@ -1357,7 +1341,7 @@ static int do_otp_lock(struct map_info *map, struct flchip *chip, loff_t adr,
/* wait for chip to become ready */
timeo = jiffies + msecs_to_jiffies(2);
for (;;) {
- if (chip_ready(map, adr))
+ if (chip_ready(map, adr, NULL))
break;
if (time_after(jiffies, timeo)) {
@@ -1624,10 +1608,11 @@ static int __xipram do_write_oneword(struct map_info *map, struct flchip *chip,
}
/*
- * We check "time_after" and "!chip_good" before checking
- * "chip_good" to avoid the failure due to scheduling.
+ * We check "time_after" and "!chip_ready" before checking
+ * "chip_ready" to avoid the failure due to scheduling.
*/
- if (time_after(jiffies, timeo) && !chip_good(map, adr, datum)) {
+ if (time_after(jiffies, timeo) &&
+ !chip_ready(map, adr, &datum)) {
xip_enable(map, chip, adr);
printk(KERN_WARNING "MTD %s(): software timeout\n", __func__);
xip_disable(map, chip, adr);
@@ -1635,7 +1620,7 @@ static int __xipram do_write_oneword(struct map_info *map, struct flchip *chip,
break;
}
- if (chip_good(map, adr, datum))
+ if (chip_ready(map, adr, &datum))
break;
/* Latency issues. Drop the lock, wait a while and retry */
@@ -1879,13 +1864,13 @@ static int __xipram do_write_buffer(struct map_info *map, struct flchip *chip,
}
/*
- * We check "time_after" and "!chip_good" before checking "chip_good" to avoid
- * the failure due to scheduling.
+ * We check "time_after" and "!chip_ready" before checking
+ * "chip_ready" to avoid the failure due to scheduling.
*/
- if (time_after(jiffies, timeo) && !chip_good(map, adr, datum))
+ if (time_after(jiffies, timeo) && !chip_ready(map, adr, &datum))
break;
- if (chip_good(map, adr, datum)) {
+ if (chip_ready(map, adr, &datum)) {
xip_enable(map, chip, adr);
goto op_done;
}
@@ -2019,7 +2004,7 @@ static int cfi_amdstd_panic_wait(struct map_info *map, struct flchip *chip,
* If the driver thinks the chip is idle, and no toggle bits
* are changing, then the chip is actually idle for sure.
*/
- if (chip->state == FL_READY && chip_ready(map, adr))
+ if (chip->state == FL_READY && chip_ready(map, adr, NULL))
return 0;
/*
@@ -2036,7 +2021,7 @@ static int cfi_amdstd_panic_wait(struct map_info *map, struct flchip *chip,
/* wait for the chip to become ready */
for (i = 0; i < jiffies_to_usecs(timeo); i++) {
- if (chip_ready(map, adr))
+ if (chip_ready(map, adr, NULL))
return 0;
udelay(1);
@@ -2100,13 +2085,13 @@ static int do_panic_write_oneword(struct map_info *map, struct flchip *chip,
map_write(map, datum, adr);
for (i = 0; i < jiffies_to_usecs(uWriteTimeout); i++) {
- if (chip_ready(map, adr))
+ if (chip_ready(map, adr, NULL))
break;
udelay(1);
}
- if (!chip_good(map, adr, datum)) {
+ if (!chip_ready(map, adr, &datum)) {
/* reset on all failures. */
map_write(map, CMD(0xF0), chip->start);
/* FIXME - should have reset delay before continuing */
@@ -2247,6 +2232,7 @@ static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
DECLARE_WAITQUEUE(wait, current);
int ret = 0;
int retry_cnt = 0;
+ map_word datum = map_word_ff(map);
adr = cfi->addr_unlock1;
@@ -2301,7 +2287,7 @@ static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
chip->erase_suspended = 0;
}
- if (chip_good(map, adr, map_word_ff(map)))
+ if (chip_ready(map, adr, &datum))
break;
if (time_after(jiffies, timeo)) {
@@ -2343,6 +2329,7 @@ static int __xipram do_erase_oneblock(struct map_info *map, struct flchip *chip,
DECLARE_WAITQUEUE(wait, current);
int ret = 0;
int retry_cnt = 0;
+ map_word datum = map_word_ff(map);
adr += chip->start;
@@ -2397,7 +2384,7 @@ static int __xipram do_erase_oneblock(struct map_info *map, struct flchip *chip,
chip->erase_suspended = 0;
}
- if (chip_good(map, adr, map_word_ff(map))) {
+ if (chip_ready(map, adr, &datum)) {
xip_enable(map, chip, adr);
break;
}
@@ -2612,7 +2599,7 @@ static int __maybe_unused do_ppb_xxlock(struct map_info *map,
*/
timeo = jiffies + msecs_to_jiffies(2000); /* 2s max (un)locking */
for (;;) {
- if (chip_ready(map, adr))
+ if (chip_ready(map, adr, NULL))
break;
if (time_after(jiffies, timeo)) {
--
2.34.1
Historically we did distinguish between a flag that surpressed partition
scanning, and a combinations of the minors variable and another flag if
any partitions were supported. This was generally confusing and doesn't
make much sense, but some corner case uses of the loop driver actually
do want to support manually added partitions on a device that does not
actively scan for partitions. To make things worsee the loop driver
also wants to dynamically toggle the scanning for partitions on a live
gendisk, which makes the disk->flags updates non-atomic.
Introduce a new GD_SUPPRESS_PART_SCAN bit in disk->state that disables
just scanning for partitions, and toggle that instead of GENHD_FL_NO_PART
in the loop driver.
Fixes: 1ebe2e5f9d68 ("block: remove GENHD_FL_EXT_DEVT")
Reported-by: Ming Lei <ming.lei(a)redhat.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Ming Lei <ming.lei(a)redhat.com>
Link: https://lore.kernel.org/r/20220527055806.1972352-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
(cherry picked from commit b9684a71fca793213378dd410cd11675d973eaa1)
---
block/genhd.c | 2 ++
drivers/block/loop.c | 8 ++++----
include/linux/genhd.h | 1 +
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index 9d9d702d077873..c284c1cf339672 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -380,6 +380,8 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
if (disk->flags & (GENHD_FL_NO_PART | GENHD_FL_HIDDEN))
return -EINVAL;
+ if (test_bit(GD_SUPPRESS_PART_SCAN, &disk->state))
+ return -EINVAL;
if (disk->open_partitions)
return -EBUSY;
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d46a3d5d0c2ec9..3411d3c0a5b0fc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1067,7 +1067,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
lo->lo_flags |= LO_FLAGS_PARTSCAN;
partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
if (partscan)
- lo->lo_disk->flags &= ~GENHD_FL_NO_PART;
+ clear_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state);
loop_global_unlock(lo, is_loop);
if (partscan)
@@ -1186,7 +1186,7 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
*/
lo->lo_flags = 0;
if (!part_shift)
- lo->lo_disk->flags |= GENHD_FL_NO_PART;
+ set_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state);
mutex_lock(&lo->lo_mutex);
lo->lo_state = Lo_unbound;
mutex_unlock(&lo->lo_mutex);
@@ -1296,7 +1296,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
if (!err && (lo->lo_flags & LO_FLAGS_PARTSCAN) &&
!(prev_lo_flags & LO_FLAGS_PARTSCAN)) {
- lo->lo_disk->flags &= ~GENHD_FL_NO_PART;
+ clear_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state);
partscan = true;
}
out_unlock:
@@ -2028,7 +2028,7 @@ static int loop_add(int i)
* userspace tools. Parameters like this in general should be avoided.
*/
if (!part_shift)
- disk->flags |= GENHD_FL_NO_PART;
+ set_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
atomic_set(&lo->lo_refcnt, 0);
mutex_init(&lo->lo_mutex);
lo->lo_number = i;
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6906a45bc761a4..2cb105f120282e 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -110,6 +110,7 @@ struct gendisk {
#define GD_READ_ONLY 1
#define GD_DEAD 2
#define GD_NATIVE_CAPACITY 3
+#define GD_SUPPRESS_PART_SCAN 5
struct mutex open_mutex; /* open/close mutex */
unsigned open_partitions; /* number of open partitions */
--
2.30.2
Please apply
Upstream commit ea23994edc4169bd90d7a9b5908c6ccefd82fa40
to kernel versions 4.14, 4.19, 5.4 and above.
Reason:
Commits c84a1372df929033cb1a0441fb57bd3932f39ac9 "md/raid0: avoid RAID0
data corruption due to layout confusion." and
33f2c35a54dfd75ad0e7e86918dcbe4de799a56c "md: add feature flag
MD_FEATURE_RAID0_LAYOUT" added handling of original and alternate
layouts of RAID0 arrays with members of different sizes. However they
introduced a regression: assembly of such RAID0 array fails if the
per-array or default layout is not defined even when the layout is
irrelevant and can be safely ignored. One common case is when the RAID0
array is composed of two members of different sizes because the disk or
partition sizes are slightly different. This patch aims to fix this
regression.
Newer versions of mdadm can set a per-array RAID0 layout but some stable
distributions such as Debian 10 ship an older version of mdadm which
does not handle RAID0 layouts and a kernel series (4.19.y) which now
includes the backported commits. As a result, assembly fails after the
kernel upgrade unless the default layout is defined with a kernel parameter.
Related Debian bug reports :
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944676https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954816
Hi Greg,
These three patches from 5.19-rc2 failed to automatically apply. The
following series should work okay.
Note that these are already part of the 4.9, 4.14, 4.19, 5.4 backport I
did for later this week.
Jason
Jason A. Donenfeld (3):
random: avoid checking crng_ready() twice in random_init()
random: mark bootloader randomness code as __init
random: account for arch randomness in bits
drivers/char/random.c | 15 +++++++--------
include/linux/random.h | 2 +-
2 files changed, 8 insertions(+), 9 deletions(-)
--
2.35.1
Hi Greg,
These three patches from 5.19-rc2 failed to automatically apply. The
following series should work okay.
Note that these are already part of the 4.9, 4.14, 4.19, 5.4 backport I
did for later this week.
Jason
Jason A. Donenfeld (3):
random: avoid checking crng_ready() twice in random_init()
random: mark bootloader randomness code as __init
random: account for arch randomness in bits
drivers/char/random.c | 15 +++++++--------
include/linux/random.h | 2 +-
2 files changed, 8 insertions(+), 9 deletions(-)
--
2.35.1
These patches were back-ported to v5.15.x, but we might as well
include them in v5.10.x too.
-Alex
Stephen Boyd (2):
interconnect: qcom: sc7180: Drop IP0 interconnects
interconnect: Restore sync state by ignoring ipa-virt in provider
count
drivers/interconnect/core.c | 7 ++++++-
drivers/interconnect/qcom/sc7180.c | 21 ---------------------
2 files changed, 6 insertions(+), 22 deletions(-)
--
2.34.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang(a)redhat.com>
Date: Wed, 8 Jun 2022 14:14:22 +0800
Subject: [PATCH] virtio-rng: make device ready before making request
Current virtio-rng does a entropy request before DRIVER_OK, this
violates the spec:
virtio spec requires that all drivers set DRIVER_OK
before using devices.
Further, kernel will ignore the interrupt after commit
8b4ec69d7e09 ("virtio: harden vring IRQ").
Fixing this by making device ready before the request.
Cc: stable(a)vger.kernel.org
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Reported-and-tested-by: syzbot+5b59d6d459306a556f54(a)syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Message-Id: <20220608061422.38437-1-jasowang(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Reviewed-by: Laurent Vivier <lvivier(a)redhat.com>
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index e856df7e285c..a6f3a8a2aca6 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev)
goto err_find;
}
+ virtio_device_ready(vdev);
+
/* we always have a pending entropy request */
request_entropy(vi);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang(a)redhat.com>
Date: Wed, 8 Jun 2022 14:14:22 +0800
Subject: [PATCH] virtio-rng: make device ready before making request
Current virtio-rng does a entropy request before DRIVER_OK, this
violates the spec:
virtio spec requires that all drivers set DRIVER_OK
before using devices.
Further, kernel will ignore the interrupt after commit
8b4ec69d7e09 ("virtio: harden vring IRQ").
Fixing this by making device ready before the request.
Cc: stable(a)vger.kernel.org
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Reported-and-tested-by: syzbot+5b59d6d459306a556f54(a)syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Message-Id: <20220608061422.38437-1-jasowang(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Reviewed-by: Laurent Vivier <lvivier(a)redhat.com>
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index e856df7e285c..a6f3a8a2aca6 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev)
goto err_find;
}
+ virtio_device_ready(vdev);
+
/* we always have a pending entropy request */
request_entropy(vi);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang(a)redhat.com>
Date: Wed, 8 Jun 2022 14:14:22 +0800
Subject: [PATCH] virtio-rng: make device ready before making request
Current virtio-rng does a entropy request before DRIVER_OK, this
violates the spec:
virtio spec requires that all drivers set DRIVER_OK
before using devices.
Further, kernel will ignore the interrupt after commit
8b4ec69d7e09 ("virtio: harden vring IRQ").
Fixing this by making device ready before the request.
Cc: stable(a)vger.kernel.org
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Reported-and-tested-by: syzbot+5b59d6d459306a556f54(a)syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Message-Id: <20220608061422.38437-1-jasowang(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Reviewed-by: Laurent Vivier <lvivier(a)redhat.com>
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index e856df7e285c..a6f3a8a2aca6 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev)
goto err_find;
}
+ virtio_device_ready(vdev);
+
/* we always have a pending entropy request */
request_entropy(vi);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang(a)redhat.com>
Date: Wed, 8 Jun 2022 14:14:22 +0800
Subject: [PATCH] virtio-rng: make device ready before making request
Current virtio-rng does a entropy request before DRIVER_OK, this
violates the spec:
virtio spec requires that all drivers set DRIVER_OK
before using devices.
Further, kernel will ignore the interrupt after commit
8b4ec69d7e09 ("virtio: harden vring IRQ").
Fixing this by making device ready before the request.
Cc: stable(a)vger.kernel.org
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Reported-and-tested-by: syzbot+5b59d6d459306a556f54(a)syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Message-Id: <20220608061422.38437-1-jasowang(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Reviewed-by: Laurent Vivier <lvivier(a)redhat.com>
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index e856df7e285c..a6f3a8a2aca6 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev)
goto err_find;
}
+ virtio_device_ready(vdev);
+
/* we always have a pending entropy request */
request_entropy(vi);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang(a)redhat.com>
Date: Wed, 8 Jun 2022 14:14:22 +0800
Subject: [PATCH] virtio-rng: make device ready before making request
Current virtio-rng does a entropy request before DRIVER_OK, this
violates the spec:
virtio spec requires that all drivers set DRIVER_OK
before using devices.
Further, kernel will ignore the interrupt after commit
8b4ec69d7e09 ("virtio: harden vring IRQ").
Fixing this by making device ready before the request.
Cc: stable(a)vger.kernel.org
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Reported-and-tested-by: syzbot+5b59d6d459306a556f54(a)syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Message-Id: <20220608061422.38437-1-jasowang(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Reviewed-by: Laurent Vivier <lvivier(a)redhat.com>
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index e856df7e285c..a6f3a8a2aca6 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev)
goto err_find;
}
+ virtio_device_ready(vdev);
+
/* we always have a pending entropy request */
request_entropy(vi);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang(a)redhat.com>
Date: Wed, 8 Jun 2022 14:14:22 +0800
Subject: [PATCH] virtio-rng: make device ready before making request
Current virtio-rng does a entropy request before DRIVER_OK, this
violates the spec:
virtio spec requires that all drivers set DRIVER_OK
before using devices.
Further, kernel will ignore the interrupt after commit
8b4ec69d7e09 ("virtio: harden vring IRQ").
Fixing this by making device ready before the request.
Cc: stable(a)vger.kernel.org
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Reported-and-tested-by: syzbot+5b59d6d459306a556f54(a)syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Message-Id: <20220608061422.38437-1-jasowang(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Reviewed-by: Laurent Vivier <lvivier(a)redhat.com>
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index e856df7e285c..a6f3a8a2aca6 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev)
goto err_find;
}
+ virtio_device_ready(vdev);
+
/* we always have a pending entropy request */
request_entropy(vi);
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 371017309a9f1725bfd3283afe61efa4ac34d30c Mon Sep 17 00:00:00 2001
From: Sunil Khatri <sunil.khatri(a)amd.com>
Date: Mon, 30 May 2022 23:24:09 +0530
Subject: [PATCH] drm/amdgpu: enable tmz by default for GC 10.3.7
Add IP GC 10.3.7 in the list of target to have
tmz enabled by default.
Signed-off-by: Sunil Khatri <sunil.khatri(a)amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org # 5.18.x
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 798c56214a23..aebc384531ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -518,6 +518,8 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev)
case IP_VERSION(9, 1, 0):
/* RENOIR looks like RAVEN */
case IP_VERSION(9, 3, 0):
+ /* GC 10.3.7 */
+ case IP_VERSION(10, 3, 7):
if (amdgpu_tmz == 0) {
adev->gmc.tmz_enabled = false;
dev_info(adev->dev,
@@ -540,8 +542,6 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev)
case IP_VERSION(10, 3, 1):
/* YELLOW_CARP*/
case IP_VERSION(10, 3, 3):
- /* GC 10.3.7 */
- case IP_VERSION(10, 3, 7):
/* Don't enable it by default yet.
*/
if (amdgpu_tmz < 1) {
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 88467db6e2f46a2e79b1b67ce6873c284e4cf417 Mon Sep 17 00:00:00 2001
From: Philip Yang <Philip.Yang(a)amd.com>
Date: Fri, 3 Jun 2022 09:19:34 -0400
Subject: [PATCH] drm/amdkfd: Fix partial migration bugs
Migration range from system memory to VRAM, if system page can not be
locked or unmapped, we do partial migration and leave some pages in
system memory. Several bugs found to copy pages and update GPU mapping
for this situation:
1. copy to vram should use migrate->npage which is total pages of range
as npages, not migrate->cpages which is number of pages can be migrated.
2. After partial copy, set VRAM res cursor as j + 1, j is number of
system pages copied plus 1 page to skip copy.
3. copy to ram, should collect all continuous VRAM pages and copy
together.
4. Call amdgpu_vm_update_range, should pass in offset as bytes, not
as number of pages.
Signed-off-by: Philip Yang <Philip.Yang(a)amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 997650d597ec..e44376c2ecdc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -296,7 +296,7 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
struct migrate_vma *migrate, struct dma_fence **mfence,
dma_addr_t *scratch)
{
- uint64_t npages = migrate->cpages;
+ uint64_t npages = migrate->npages;
struct device *dev = adev->dev;
struct amdgpu_res_cursor cursor;
dma_addr_t *src;
@@ -344,7 +344,7 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
mfence);
if (r)
goto out_free_vram_pages;
- amdgpu_res_next(&cursor, j << PAGE_SHIFT);
+ amdgpu_res_next(&cursor, (j + 1) << PAGE_SHIFT);
j = 0;
} else {
amdgpu_res_next(&cursor, PAGE_SIZE);
@@ -590,7 +590,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
continue;
}
src[i] = svm_migrate_addr(adev, spage);
- if (i > 0 && src[i] != src[i - 1] + PAGE_SIZE) {
+ if (j > 0 && src[i] != src[i - 1] + PAGE_SIZE) {
r = svm_migrate_copy_memory_gart(adev, dst + i - j,
src + i - j, j,
FROM_VRAM_TO_RAM,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3bd0f1a670bb..7b332246eda3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1295,7 +1295,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
r = amdgpu_vm_update_range(adev, vm, false, false, flush_tlb, NULL,
last_start, prange->start + i,
pte_flags,
- last_start - prange->start,
+ (last_start - prange->start) << PAGE_SHIFT,
bo_adev ? bo_adev->vm_manager.vram_base_offset : 0,
NULL, dma_addr, &vm->last_update);
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 88467db6e2f46a2e79b1b67ce6873c284e4cf417 Mon Sep 17 00:00:00 2001
From: Philip Yang <Philip.Yang(a)amd.com>
Date: Fri, 3 Jun 2022 09:19:34 -0400
Subject: [PATCH] drm/amdkfd: Fix partial migration bugs
Migration range from system memory to VRAM, if system page can not be
locked or unmapped, we do partial migration and leave some pages in
system memory. Several bugs found to copy pages and update GPU mapping
for this situation:
1. copy to vram should use migrate->npage which is total pages of range
as npages, not migrate->cpages which is number of pages can be migrated.
2. After partial copy, set VRAM res cursor as j + 1, j is number of
system pages copied plus 1 page to skip copy.
3. copy to ram, should collect all continuous VRAM pages and copy
together.
4. Call amdgpu_vm_update_range, should pass in offset as bytes, not
as number of pages.
Signed-off-by: Philip Yang <Philip.Yang(a)amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 997650d597ec..e44376c2ecdc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -296,7 +296,7 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
struct migrate_vma *migrate, struct dma_fence **mfence,
dma_addr_t *scratch)
{
- uint64_t npages = migrate->cpages;
+ uint64_t npages = migrate->npages;
struct device *dev = adev->dev;
struct amdgpu_res_cursor cursor;
dma_addr_t *src;
@@ -344,7 +344,7 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
mfence);
if (r)
goto out_free_vram_pages;
- amdgpu_res_next(&cursor, j << PAGE_SHIFT);
+ amdgpu_res_next(&cursor, (j + 1) << PAGE_SHIFT);
j = 0;
} else {
amdgpu_res_next(&cursor, PAGE_SIZE);
@@ -590,7 +590,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
continue;
}
src[i] = svm_migrate_addr(adev, spage);
- if (i > 0 && src[i] != src[i - 1] + PAGE_SIZE) {
+ if (j > 0 && src[i] != src[i - 1] + PAGE_SIZE) {
r = svm_migrate_copy_memory_gart(adev, dst + i - j,
src + i - j, j,
FROM_VRAM_TO_RAM,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3bd0f1a670bb..7b332246eda3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1295,7 +1295,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
r = amdgpu_vm_update_range(adev, vm, false, false, flush_tlb, NULL,
last_start, prange->start + i,
pte_flags,
- last_start - prange->start,
+ (last_start - prange->start) << PAGE_SHIFT,
bo_adev ? bo_adev->vm_manager.vram_base_offset : 0,
NULL, dma_addr, &vm->last_update);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e54a4424925a27ed94dff046db3ce5caf4b1e748 Mon Sep 17 00:00:00 2001
From: Brian Norris <briannorris(a)chromium.org>
Date: Mon, 28 Feb 2022 12:25:32 -0800
Subject: [PATCH] drm/atomic: Force bridge self-refresh-exit on CRTC switch
It's possible to change which CRTC is in use for a given
connector/encoder/bridge while we're in self-refresh without fully
disabling the connector/encoder/bridge along the way. This can confuse
the bridge encoder/bridge, because
(a) it needs to track the SR state (trying to perform "active"
operations while the panel is still in SR can be Bad(TM)); and
(b) it tracks the SR state via the CRTC state (and after the switch, the
previous SR state is lost).
Thus, we need to either somehow carry the self-refresh state over to the
new CRTC, or else force an encoder/bridge self-refresh transition during
such a switch.
I choose the latter, so we disable the encoder (and exit PSR) before
attaching it to the new CRTC (where we can continue to assume a clean
(non-self-refresh) state).
This fixes PSR issues seen on Rockchip RK3399 systems with
drivers/gpu/drm/bridge/analogix/analogix_dp_core.c.
Change in v2:
- Drop "->enable" condition; this could possibly be "->active" to
reflect the intended hardware state, but it also is a little
over-specific. We want to make a transition through "disabled" any
time we're exiting PSR at the same time as a CRTC switch.
(Thanks Liu Ying)
Cc: Liu Ying <victor.liu(a)oss.nxp.com>
Cc: <stable(a)vger.kernel.org>
Fixes: 1452c25b0e60 ("drm: Add helpers to kick off self refresh mode in drivers")
Signed-off-by: Brian Norris <briannorris(a)chromium.org>
Reviewed-by: Sean Paul <seanpaul(a)chromium.org>
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20220228122522.v2.2.Ic15a2ef6…
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 9603193d2fa1..987e4b212e9f 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1011,9 +1011,19 @@ crtc_needs_disable(struct drm_crtc_state *old_state,
return drm_atomic_crtc_effectively_active(old_state);
/*
- * We need to run through the crtc_funcs->disable() function if the CRTC
- * is currently on, if it's transitioning to self refresh mode, or if
- * it's in self refresh mode and needs to be fully disabled.
+ * We need to disable bridge(s) and CRTC if we're transitioning out of
+ * self-refresh and changing CRTCs at the same time, because the
+ * bridge tracks self-refresh status via CRTC state.
+ */
+ if (old_state->self_refresh_active &&
+ old_state->crtc != new_state->crtc)
+ return true;
+
+ /*
+ * We also need to run through the crtc_funcs->disable() function if
+ * the CRTC is currently on, if it's transitioning to self refresh
+ * mode, or if it's in self refresh mode and needs to be fully
+ * disabled.
*/
return old_state->active ||
(old_state->self_refresh_active && !new_state->active) ||
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ca871659ec1606d33b1e76de8d4cf924cf627e34 Mon Sep 17 00:00:00 2001
From: Brian Norris <briannorris(a)chromium.org>
Date: Mon, 28 Feb 2022 12:25:31 -0800
Subject: [PATCH] drm/bridge: analogix_dp: Support PSR-exit to disable
transition
Most eDP panel functions only work correctly when the panel is not in
self-refresh. In particular, analogix_dp_bridge_disable() tends to hit
AUX channel errors if the panel is in self-refresh.
Given the above, it appears that so far, this driver assumes that we are
never in self-refresh when it comes time to fully disable the bridge.
Prior to commit 846c7dfc1193 ("drm/atomic: Try to preserve the crtc
enabled state in drm_atomic_remove_fb, v2."), this tended to be true,
because we would automatically disable the pipe when framebuffers were
removed, and so we'd typically disable the bridge shortly after the last
display activity.
However, that is not guaranteed: an idle (self-refresh) display pipe may
be disabled, e.g., when switching CRTCs. We need to exit PSR first.
Stable notes: this is definitely a bugfix, and the bug has likely
existed in some form for quite a while. It may predate the "PSR helpers"
refactor, but the code looked very different before that, and it's
probably not worth rewriting the fix.
Cc: <stable(a)vger.kernel.org>
Fixes: 6c836d965bad ("drm/rockchip: Use the helpers for PSR")
Signed-off-by: Brian Norris <briannorris(a)chromium.org>
Reviewed-by: Sean Paul <seanpaul(a)chromium.org>
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20220228122522.v2.1.I161904be…
diff --git a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
index eb590fb8e8d0..0300f670a4fd 100644
--- a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
+++ b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
@@ -1268,6 +1268,25 @@ static int analogix_dp_bridge_attach(struct drm_bridge *bridge,
return 0;
}
+static
+struct drm_crtc *analogix_dp_get_old_crtc(struct analogix_dp_device *dp,
+ struct drm_atomic_state *state)
+{
+ struct drm_encoder *encoder = dp->encoder;
+ struct drm_connector *connector;
+ struct drm_connector_state *conn_state;
+
+ connector = drm_atomic_get_old_connector_for_encoder(state, encoder);
+ if (!connector)
+ return NULL;
+
+ conn_state = drm_atomic_get_old_connector_state(state, connector);
+ if (!conn_state)
+ return NULL;
+
+ return conn_state->crtc;
+}
+
static
struct drm_crtc *analogix_dp_get_new_crtc(struct analogix_dp_device *dp,
struct drm_atomic_state *state)
@@ -1448,14 +1467,16 @@ analogix_dp_bridge_atomic_disable(struct drm_bridge *bridge,
{
struct drm_atomic_state *old_state = old_bridge_state->base.state;
struct analogix_dp_device *dp = bridge->driver_private;
- struct drm_crtc *crtc;
+ struct drm_crtc *old_crtc, *new_crtc;
+ struct drm_crtc_state *old_crtc_state = NULL;
struct drm_crtc_state *new_crtc_state = NULL;
+ int ret;
- crtc = analogix_dp_get_new_crtc(dp, old_state);
- if (!crtc)
+ new_crtc = analogix_dp_get_new_crtc(dp, old_state);
+ if (!new_crtc)
goto out;
- new_crtc_state = drm_atomic_get_new_crtc_state(old_state, crtc);
+ new_crtc_state = drm_atomic_get_new_crtc_state(old_state, new_crtc);
if (!new_crtc_state)
goto out;
@@ -1464,6 +1485,19 @@ analogix_dp_bridge_atomic_disable(struct drm_bridge *bridge,
return;
out:
+ old_crtc = analogix_dp_get_old_crtc(dp, old_state);
+ if (old_crtc) {
+ old_crtc_state = drm_atomic_get_old_crtc_state(old_state,
+ old_crtc);
+
+ /* When moving from PSR to fully disabled, exit PSR first. */
+ if (old_crtc_state && old_crtc_state->self_refresh_active) {
+ ret = analogix_dp_disable_psr(dp);
+ if (ret)
+ DRM_ERROR("Failed to disable psr (%d)\n", ret);
+ }
+ }
+
analogix_dp_bridge_disable(bridge);
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a2a513be7139b279f1b5b2cee59c6c4950c34346 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Date: Thu, 2 Jun 2022 23:16:57 +0900
Subject: [PATCH] zonefs: fix handling of explicit_open option on mount
Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().
Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index bcb21aea990a..ecce84909ca1 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -1760,12 +1760,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
atomic_set(&sbi->s_wro_seq_files, 0);
sbi->s_max_wro_seq_files = bdev_max_open_zones(sb->s_bdev);
- if (!sbi->s_max_wro_seq_files &&
- sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
- zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
- sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
- }
-
atomic_set(&sbi->s_active_seq_files, 0);
sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
@@ -1790,6 +1784,12 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
zonefs_info(sb, "Mounting %u zones",
blkdev_nr_zones(sb->s_bdev->bd_disk));
+ if (!sbi->s_max_wro_seq_files &&
+ sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
+ zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
+ sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
+ }
+
/* Create root directory inode */
ret = -ENOMEM;
inode = new_inode(sb);
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a2a513be7139b279f1b5b2cee59c6c4950c34346 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Date: Thu, 2 Jun 2022 23:16:57 +0900
Subject: [PATCH] zonefs: fix handling of explicit_open option on mount
Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().
Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index bcb21aea990a..ecce84909ca1 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -1760,12 +1760,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
atomic_set(&sbi->s_wro_seq_files, 0);
sbi->s_max_wro_seq_files = bdev_max_open_zones(sb->s_bdev);
- if (!sbi->s_max_wro_seq_files &&
- sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
- zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
- sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
- }
-
atomic_set(&sbi->s_active_seq_files, 0);
sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
@@ -1790,6 +1784,12 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
zonefs_info(sb, "Mounting %u zones",
blkdev_nr_zones(sb->s_bdev->bd_disk));
+ if (!sbi->s_max_wro_seq_files &&
+ sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
+ zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
+ sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
+ }
+
/* Create root directory inode */
ret = -ENOMEM;
inode = new_inode(sb);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a2a513be7139b279f1b5b2cee59c6c4950c34346 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Date: Thu, 2 Jun 2022 23:16:57 +0900
Subject: [PATCH] zonefs: fix handling of explicit_open option on mount
Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().
Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index bcb21aea990a..ecce84909ca1 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -1760,12 +1760,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
atomic_set(&sbi->s_wro_seq_files, 0);
sbi->s_max_wro_seq_files = bdev_max_open_zones(sb->s_bdev);
- if (!sbi->s_max_wro_seq_files &&
- sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
- zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
- sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
- }
-
atomic_set(&sbi->s_active_seq_files, 0);
sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);
@@ -1790,6 +1784,12 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
zonefs_info(sb, "Mounting %u zones",
blkdev_nr_zones(sb->s_bdev->bd_disk));
+ if (!sbi->s_max_wro_seq_files &&
+ sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
+ zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
+ sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
+ }
+
/* Create root directory inode */
ret = -ENOMEM;
inode = new_inode(sb);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7bb0fb7c63df95d6027dc50d6af3bc3bbbc25483 Mon Sep 17 00:00:00 2001
From: Olivier Matz <olivier.matz(a)6wind.com>
Date: Wed, 6 Apr 2022 11:52:52 +0200
Subject: [PATCH] ixgbe: fix unexpected VLAN Rx in promisc mode on VF
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When the promiscuous mode is enabled on a VF, the IXGBE_VMOLR_VPE
bit (VLAN Promiscuous Enable) is set. This means that the VF will
receive packets whose VLAN is not the same than the VLAN of the VF.
For instance, in this situation:
┌────────┐ ┌────────┐ ┌────────┐
│ │ │ │ │ │
│ │ │ │ │ │
│ VF0├────┤VF1 VF2├────┤VF3 │
│ │ │ │ │ │
└────────┘ └────────┘ └────────┘
VM1 VM2 VM3
vf 0: vlan 1000
vf 1: vlan 1000
vf 2: vlan 1001
vf 3: vlan 1001
If we tcpdump on VF3, we see all the packets, even those transmitted
on vlan 1000.
This behavior prevents to bridge VF1 and VF2 in VM2, because it will
create a loop: packets transmitted on VF1 will be received by VF2 and
vice-versa, and bridged again through the software bridge.
This patch remove the activation of VLAN Promiscuous when a VF enables
the promiscuous mode. However, the IXGBE_VMOLR_UPE bit (Unicast
Promiscuous) is kept, so that a VF receives all packets that has the
same VLAN, whatever the destination MAC address.
Fixes: 8443c1a4b192 ("ixgbe, ixgbevf: Add new mbox API xcast mode")
Cc: stable(a)vger.kernel.org
Cc: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
Signed-off-by: Olivier Matz <olivier.matz(a)6wind.com>
Tested-by: Konrad Jankowski <konrad0.jankowski(a)intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen(a)intel.com>
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 8d108a78941b..d4e63f0644c3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -1208,9 +1208,9 @@ static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter,
return -EPERM;
}
- disable = 0;
+ disable = IXGBE_VMOLR_VPE;
enable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE |
- IXGBE_VMOLR_MPE | IXGBE_VMOLR_UPE | IXGBE_VMOLR_VPE;
+ IXGBE_VMOLR_MPE | IXGBE_VMOLR_UPE;
break;
default:
return -EOPNOTSUPP;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 803e9895ea2b0fe80bc85980ae2d7a7e44037914 Mon Sep 17 00:00:00 2001
From: Olivier Matz <olivier.matz(a)6wind.com>
Date: Wed, 6 Apr 2022 11:52:51 +0200
Subject: [PATCH] ixgbe: fix bcast packets Rx on VF after promisc removal
After a VF requested to remove the promiscuous flag on an interface, the
broadcast packets are not received anymore. This breaks some protocols
like ARP.
In ixgbe_update_vf_xcast_mode(), we should keep the IXGBE_VMOLR_BAM
bit (Broadcast Accept) on promiscuous removal.
This flag is already set by default in ixgbe_set_vmolr() on VF reset.
Fixes: 8443c1a4b192 ("ixgbe, ixgbevf: Add new mbox API xcast mode")
Cc: stable(a)vger.kernel.org
Cc: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
Signed-off-by: Olivier Matz <olivier.matz(a)6wind.com>
Tested-by: Konrad Jankowski <konrad0.jankowski(a)intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen(a)intel.com>
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 7f11c0a8e7a9..8d108a78941b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -1184,9 +1184,9 @@ static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter,
switch (xcast_mode) {
case IXGBEVF_XCAST_MODE_NONE:
- disable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE |
+ disable = IXGBE_VMOLR_ROMPE |
IXGBE_VMOLR_MPE | IXGBE_VMOLR_UPE | IXGBE_VMOLR_VPE;
- enable = 0;
+ enable = IXGBE_VMOLR_BAM;
break;
case IXGBEVF_XCAST_MODE_MULTI:
disable = IXGBE_VMOLR_MPE | IXGBE_VMOLR_UPE | IXGBE_VMOLR_VPE;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f2e19b36593caed4c977c2f55aeba7408aeb2132 Mon Sep 17 00:00:00 2001
From: Martin Faltesek <mfaltesek(a)google.com>
Date: Mon, 6 Jun 2022 21:57:29 -0500
Subject: [PATCH] nfc: st21nfca: fix incorrect sizing calculations in
EVT_TRANSACTION
The transaction buffer is allocated by using the size of the packet buf,
and subtracting two which seem intended to remove the two tags which are
not present in the target structure. This calculation leads to under
counting memory because of differences between the packet contents and the
target structure. The aid_len field is a u8 in the packet, but a u32 in
the structure, resulting in at least 3 bytes always being under counted.
Further, the aid data is a variable length field in the packet, but fixed
in the structure, so if this field is less than the max, the difference is
added to the under counting.
The last validation check for transaction->params_len is also incorrect
since it employs the same accounting error.
To fix, perform validation checks progressively to safely reach the
next field, to determine the size of both buffers and verify both tags.
Once all validation checks pass, allocate the buffer and copy the data.
This eliminates freeing memory on the error path, as those checks are
moved ahead of memory allocation.
Fixes: 26fc6c7f02cb ("NFC: st21nfca: Add HCI transaction event support")
Fixes: 4fbcc1a4cb20 ("nfc: st21nfca: Fix potential buffer overflows in EVT_TRANSACTION")
Cc: stable(a)vger.kernel.org
Signed-off-by: Martin Faltesek <mfaltesek(a)google.com>
Reviewed-by: Guenter Roeck <groeck(a)chromium.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/nfc/st21nfca/se.c b/drivers/nfc/st21nfca/se.c
index 8e1113ce139b..df8d27cf2956 100644
--- a/drivers/nfc/st21nfca/se.c
+++ b/drivers/nfc/st21nfca/se.c
@@ -300,6 +300,8 @@ int st21nfca_connectivity_event_received(struct nfc_hci_dev *hdev, u8 host,
int r = 0;
struct device *dev = &hdev->ndev->dev;
struct nfc_evt_transaction *transaction;
+ u32 aid_len;
+ u8 params_len;
pr_debug("connectivity gate event: %x\n", event);
@@ -308,50 +310,48 @@ int st21nfca_connectivity_event_received(struct nfc_hci_dev *hdev, u8 host,
r = nfc_se_connectivity(hdev->ndev, host);
break;
case ST21NFCA_EVT_TRANSACTION:
- /*
- * According to specification etsi 102 622
+ /* According to specification etsi 102 622
* 11.2.2.4 EVT_TRANSACTION Table 52
* Description Tag Length
* AID 81 5 to 16
* PARAMETERS 82 0 to 255
+ *
+ * The key differences are aid storage length is variably sized
+ * in the packet, but fixed in nfc_evt_transaction, and that the aid_len
+ * is u8 in the packet, but u32 in the structure, and the tags in
+ * the packet are not included in nfc_evt_transaction.
+ *
+ * size in bytes: 1 1 5-16 1 1 0-255
+ * offset: 0 1 2 aid_len + 2 aid_len + 3 aid_len + 4
+ * member name: aid_tag(M) aid_len aid params_tag(M) params_len params
+ * example: 0x81 5-16 X 0x82 0-255 X
*/
- if (skb->len < NFC_MIN_AID_LENGTH + 2 ||
- skb->data[0] != NFC_EVT_TRANSACTION_AID_TAG)
+ if (skb->len < 2 || skb->data[0] != NFC_EVT_TRANSACTION_AID_TAG)
return -EPROTO;
- transaction = devm_kzalloc(dev, skb->len - 2, GFP_KERNEL);
- if (!transaction)
- return -ENOMEM;
-
- transaction->aid_len = skb->data[1];
+ aid_len = skb->data[1];
- /* Checking if the length of the AID is valid */
- if (transaction->aid_len > sizeof(transaction->aid)) {
- devm_kfree(dev, transaction);
- return -EINVAL;
- }
+ if (skb->len < aid_len + 4 || aid_len > sizeof(transaction->aid))
+ return -EPROTO;
- memcpy(transaction->aid, &skb->data[2],
- transaction->aid_len);
+ params_len = skb->data[aid_len + 3];
- /* Check next byte is PARAMETERS tag (82) */
- if (skb->data[transaction->aid_len + 2] !=
- NFC_EVT_TRANSACTION_PARAMS_TAG) {
- devm_kfree(dev, transaction);
+ /* Verify PARAMETERS tag is (82), and final check that there is enough
+ * space in the packet to read everything.
+ */
+ if ((skb->data[aid_len + 2] != NFC_EVT_TRANSACTION_PARAMS_TAG) ||
+ (skb->len < aid_len + 4 + params_len))
return -EPROTO;
- }
- transaction->params_len = skb->data[transaction->aid_len + 3];
+ transaction = devm_kzalloc(dev, sizeof(*transaction) + params_len, GFP_KERNEL);
+ if (!transaction)
+ return -ENOMEM;
- /* Total size is allocated (skb->len - 2) minus fixed array members */
- if (transaction->params_len > ((skb->len - 2) -
- sizeof(struct nfc_evt_transaction))) {
- devm_kfree(dev, transaction);
- return -EINVAL;
- }
+ transaction->aid_len = aid_len;
+ transaction->params_len = params_len;
- memcpy(transaction->params, skb->data +
- transaction->aid_len + 4, transaction->params_len);
+ memcpy(transaction->aid, &skb->data[2], aid_len);
+ memcpy(transaction->params, &skb->data[aid_len + 4], params_len);
r = nfc_se_transaction(hdev->ndev, host, transaction);
break;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f2e19b36593caed4c977c2f55aeba7408aeb2132 Mon Sep 17 00:00:00 2001
From: Martin Faltesek <mfaltesek(a)google.com>
Date: Mon, 6 Jun 2022 21:57:29 -0500
Subject: [PATCH] nfc: st21nfca: fix incorrect sizing calculations in
EVT_TRANSACTION
The transaction buffer is allocated by using the size of the packet buf,
and subtracting two which seem intended to remove the two tags which are
not present in the target structure. This calculation leads to under
counting memory because of differences between the packet contents and the
target structure. The aid_len field is a u8 in the packet, but a u32 in
the structure, resulting in at least 3 bytes always being under counted.
Further, the aid data is a variable length field in the packet, but fixed
in the structure, so if this field is less than the max, the difference is
added to the under counting.
The last validation check for transaction->params_len is also incorrect
since it employs the same accounting error.
To fix, perform validation checks progressively to safely reach the
next field, to determine the size of both buffers and verify both tags.
Once all validation checks pass, allocate the buffer and copy the data.
This eliminates freeing memory on the error path, as those checks are
moved ahead of memory allocation.
Fixes: 26fc6c7f02cb ("NFC: st21nfca: Add HCI transaction event support")
Fixes: 4fbcc1a4cb20 ("nfc: st21nfca: Fix potential buffer overflows in EVT_TRANSACTION")
Cc: stable(a)vger.kernel.org
Signed-off-by: Martin Faltesek <mfaltesek(a)google.com>
Reviewed-by: Guenter Roeck <groeck(a)chromium.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/nfc/st21nfca/se.c b/drivers/nfc/st21nfca/se.c
index 8e1113ce139b..df8d27cf2956 100644
--- a/drivers/nfc/st21nfca/se.c
+++ b/drivers/nfc/st21nfca/se.c
@@ -300,6 +300,8 @@ int st21nfca_connectivity_event_received(struct nfc_hci_dev *hdev, u8 host,
int r = 0;
struct device *dev = &hdev->ndev->dev;
struct nfc_evt_transaction *transaction;
+ u32 aid_len;
+ u8 params_len;
pr_debug("connectivity gate event: %x\n", event);
@@ -308,50 +310,48 @@ int st21nfca_connectivity_event_received(struct nfc_hci_dev *hdev, u8 host,
r = nfc_se_connectivity(hdev->ndev, host);
break;
case ST21NFCA_EVT_TRANSACTION:
- /*
- * According to specification etsi 102 622
+ /* According to specification etsi 102 622
* 11.2.2.4 EVT_TRANSACTION Table 52
* Description Tag Length
* AID 81 5 to 16
* PARAMETERS 82 0 to 255
+ *
+ * The key differences are aid storage length is variably sized
+ * in the packet, but fixed in nfc_evt_transaction, and that the aid_len
+ * is u8 in the packet, but u32 in the structure, and the tags in
+ * the packet are not included in nfc_evt_transaction.
+ *
+ * size in bytes: 1 1 5-16 1 1 0-255
+ * offset: 0 1 2 aid_len + 2 aid_len + 3 aid_len + 4
+ * member name: aid_tag(M) aid_len aid params_tag(M) params_len params
+ * example: 0x81 5-16 X 0x82 0-255 X
*/
- if (skb->len < NFC_MIN_AID_LENGTH + 2 ||
- skb->data[0] != NFC_EVT_TRANSACTION_AID_TAG)
+ if (skb->len < 2 || skb->data[0] != NFC_EVT_TRANSACTION_AID_TAG)
return -EPROTO;
- transaction = devm_kzalloc(dev, skb->len - 2, GFP_KERNEL);
- if (!transaction)
- return -ENOMEM;
-
- transaction->aid_len = skb->data[1];
+ aid_len = skb->data[1];
- /* Checking if the length of the AID is valid */
- if (transaction->aid_len > sizeof(transaction->aid)) {
- devm_kfree(dev, transaction);
- return -EINVAL;
- }
+ if (skb->len < aid_len + 4 || aid_len > sizeof(transaction->aid))
+ return -EPROTO;
- memcpy(transaction->aid, &skb->data[2],
- transaction->aid_len);
+ params_len = skb->data[aid_len + 3];
- /* Check next byte is PARAMETERS tag (82) */
- if (skb->data[transaction->aid_len + 2] !=
- NFC_EVT_TRANSACTION_PARAMS_TAG) {
- devm_kfree(dev, transaction);
+ /* Verify PARAMETERS tag is (82), and final check that there is enough
+ * space in the packet to read everything.
+ */
+ if ((skb->data[aid_len + 2] != NFC_EVT_TRANSACTION_PARAMS_TAG) ||
+ (skb->len < aid_len + 4 + params_len))
return -EPROTO;
- }
- transaction->params_len = skb->data[transaction->aid_len + 3];
+ transaction = devm_kzalloc(dev, sizeof(*transaction) + params_len, GFP_KERNEL);
+ if (!transaction)
+ return -ENOMEM;
- /* Total size is allocated (skb->len - 2) minus fixed array members */
- if (transaction->params_len > ((skb->len - 2) -
- sizeof(struct nfc_evt_transaction))) {
- devm_kfree(dev, transaction);
- return -EINVAL;
- }
+ transaction->aid_len = aid_len;
+ transaction->params_len = params_len;
- memcpy(transaction->params, skb->data +
- transaction->aid_len + 4, transaction->params_len);
+ memcpy(transaction->aid, &skb->data[2], aid_len);
+ memcpy(transaction->params, &skb->data[aid_len + 4], params_len);
r = nfc_se_transaction(hdev->ndev, host, transaction);
break;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10e14073107dd0b6d97d9516a02845a8e501c2c9 Mon Sep 17 00:00:00 2001
From: Jchao Sun <sunjunchao2870(a)gmail.com>
Date: Tue, 24 May 2022 08:05:40 -0700
Subject: [PATCH] writeback: Fix inode->i_io_list not be protected by
inode->i_lock error
Commit b35250c0816c ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.
Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable(a)vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870(a)gmail.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a21d8f1a56d1..05221366a16d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -120,6 +120,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
struct list_head *head)
{
assert_spin_locked(&wb->list_lock);
+ assert_spin_locked(&inode->i_lock);
list_move(&inode->i_io_list, head);
@@ -1365,9 +1366,9 @@ static int move_expired_inodes(struct list_head *delaying_queue,
inode = wb_inode(delaying_queue->prev);
if (inode_dirtied_after(inode, dirtied_before))
break;
+ spin_lock(&inode->i_lock);
list_move(&inode->i_io_list, &tmp);
moved++;
- spin_lock(&inode->i_lock);
inode->i_state |= I_SYNC_QUEUED;
spin_unlock(&inode->i_lock);
if (sb_is_blkdev_sb(inode->i_sb))
@@ -1383,7 +1384,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
goto out;
}
- /* Move inodes from one superblock together */
+ /*
+ * Although inode's i_io_list is moved from 'tmp' to 'dispatch_queue',
+ * we don't take inode->i_lock here because it is just a pointless overhead.
+ * Inode is already marked as I_SYNC_QUEUED so writeback list handling is
+ * fully under our control.
+ */
while (!list_empty(&tmp)) {
sb = wb_inode(tmp.prev)->i_sb;
list_for_each_prev_safe(pos, node, &tmp) {
@@ -1826,8 +1832,8 @@ static long writeback_sb_inodes(struct super_block *sb,
* We'll have another go at writing back this inode
* when we completed a full scan of b_io.
*/
- spin_unlock(&inode->i_lock);
requeue_io(inode, wb);
+ spin_unlock(&inode->i_lock);
trace_writeback_sb_inodes_requeue(inode);
continue;
}
@@ -2358,6 +2364,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
int dirtytime = 0;
+ struct bdi_writeback *wb = NULL;
trace_writeback_mark_inode_dirty(inode, flags);
@@ -2409,6 +2416,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
+ /*
+ * Grab inode's wb early because it requires dropping i_lock and we
+ * need to make sure following checks happen atomically with dirty
+ * list handling so that we don't move inodes under flush worker's
+ * hands.
+ */
+ if (!was_dirty) {
+ wb = locked_inode_to_wb_and_lock_list(inode);
+ spin_lock(&inode->i_lock);
+ }
+
/*
* If the inode is queued for writeback by flush worker, just
* update its dirty state. Once the flush worker is done with
@@ -2416,7 +2434,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* Only add valid (hashed) inodes to the superblock's
@@ -2424,22 +2442,19 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
- goto out_unlock_inode;
+ goto out_unlock;
}
if (inode->i_state & I_FREEING)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
- wb = locked_inode_to_wb_and_lock_list(inode);
-
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
@@ -2453,6 +2468,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
dirty_list);
spin_unlock(&wb->list_lock);
+ spin_unlock(&inode->i_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
@@ -2467,6 +2483,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
return;
}
}
+out_unlock:
+ if (wb)
+ spin_unlock(&wb->list_lock);
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..bd4da9c5207e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,7 +27,7 @@
* Inode locking rules:
*
* inode->i_lock protects:
- * inode->i_state, inode->i_hash, __iget()
+ * inode->i_state, inode->i_hash, __iget(), inode->i_io_list
* Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode->i_sb->s_inode_list_lock protects:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10e14073107dd0b6d97d9516a02845a8e501c2c9 Mon Sep 17 00:00:00 2001
From: Jchao Sun <sunjunchao2870(a)gmail.com>
Date: Tue, 24 May 2022 08:05:40 -0700
Subject: [PATCH] writeback: Fix inode->i_io_list not be protected by
inode->i_lock error
Commit b35250c0816c ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.
Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable(a)vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870(a)gmail.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a21d8f1a56d1..05221366a16d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -120,6 +120,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
struct list_head *head)
{
assert_spin_locked(&wb->list_lock);
+ assert_spin_locked(&inode->i_lock);
list_move(&inode->i_io_list, head);
@@ -1365,9 +1366,9 @@ static int move_expired_inodes(struct list_head *delaying_queue,
inode = wb_inode(delaying_queue->prev);
if (inode_dirtied_after(inode, dirtied_before))
break;
+ spin_lock(&inode->i_lock);
list_move(&inode->i_io_list, &tmp);
moved++;
- spin_lock(&inode->i_lock);
inode->i_state |= I_SYNC_QUEUED;
spin_unlock(&inode->i_lock);
if (sb_is_blkdev_sb(inode->i_sb))
@@ -1383,7 +1384,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
goto out;
}
- /* Move inodes from one superblock together */
+ /*
+ * Although inode's i_io_list is moved from 'tmp' to 'dispatch_queue',
+ * we don't take inode->i_lock here because it is just a pointless overhead.
+ * Inode is already marked as I_SYNC_QUEUED so writeback list handling is
+ * fully under our control.
+ */
while (!list_empty(&tmp)) {
sb = wb_inode(tmp.prev)->i_sb;
list_for_each_prev_safe(pos, node, &tmp) {
@@ -1826,8 +1832,8 @@ static long writeback_sb_inodes(struct super_block *sb,
* We'll have another go at writing back this inode
* when we completed a full scan of b_io.
*/
- spin_unlock(&inode->i_lock);
requeue_io(inode, wb);
+ spin_unlock(&inode->i_lock);
trace_writeback_sb_inodes_requeue(inode);
continue;
}
@@ -2358,6 +2364,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
int dirtytime = 0;
+ struct bdi_writeback *wb = NULL;
trace_writeback_mark_inode_dirty(inode, flags);
@@ -2409,6 +2416,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
+ /*
+ * Grab inode's wb early because it requires dropping i_lock and we
+ * need to make sure following checks happen atomically with dirty
+ * list handling so that we don't move inodes under flush worker's
+ * hands.
+ */
+ if (!was_dirty) {
+ wb = locked_inode_to_wb_and_lock_list(inode);
+ spin_lock(&inode->i_lock);
+ }
+
/*
* If the inode is queued for writeback by flush worker, just
* update its dirty state. Once the flush worker is done with
@@ -2416,7 +2434,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* Only add valid (hashed) inodes to the superblock's
@@ -2424,22 +2442,19 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
- goto out_unlock_inode;
+ goto out_unlock;
}
if (inode->i_state & I_FREEING)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
- wb = locked_inode_to_wb_and_lock_list(inode);
-
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
@@ -2453,6 +2468,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
dirty_list);
spin_unlock(&wb->list_lock);
+ spin_unlock(&inode->i_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
@@ -2467,6 +2483,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
return;
}
}
+out_unlock:
+ if (wb)
+ spin_unlock(&wb->list_lock);
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..bd4da9c5207e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,7 +27,7 @@
* Inode locking rules:
*
* inode->i_lock protects:
- * inode->i_state, inode->i_hash, __iget()
+ * inode->i_state, inode->i_hash, __iget(), inode->i_io_list
* Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode->i_sb->s_inode_list_lock protects:
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10e14073107dd0b6d97d9516a02845a8e501c2c9 Mon Sep 17 00:00:00 2001
From: Jchao Sun <sunjunchao2870(a)gmail.com>
Date: Tue, 24 May 2022 08:05:40 -0700
Subject: [PATCH] writeback: Fix inode->i_io_list not be protected by
inode->i_lock error
Commit b35250c0816c ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.
Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable(a)vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870(a)gmail.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a21d8f1a56d1..05221366a16d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -120,6 +120,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
struct list_head *head)
{
assert_spin_locked(&wb->list_lock);
+ assert_spin_locked(&inode->i_lock);
list_move(&inode->i_io_list, head);
@@ -1365,9 +1366,9 @@ static int move_expired_inodes(struct list_head *delaying_queue,
inode = wb_inode(delaying_queue->prev);
if (inode_dirtied_after(inode, dirtied_before))
break;
+ spin_lock(&inode->i_lock);
list_move(&inode->i_io_list, &tmp);
moved++;
- spin_lock(&inode->i_lock);
inode->i_state |= I_SYNC_QUEUED;
spin_unlock(&inode->i_lock);
if (sb_is_blkdev_sb(inode->i_sb))
@@ -1383,7 +1384,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
goto out;
}
- /* Move inodes from one superblock together */
+ /*
+ * Although inode's i_io_list is moved from 'tmp' to 'dispatch_queue',
+ * we don't take inode->i_lock here because it is just a pointless overhead.
+ * Inode is already marked as I_SYNC_QUEUED so writeback list handling is
+ * fully under our control.
+ */
while (!list_empty(&tmp)) {
sb = wb_inode(tmp.prev)->i_sb;
list_for_each_prev_safe(pos, node, &tmp) {
@@ -1826,8 +1832,8 @@ static long writeback_sb_inodes(struct super_block *sb,
* We'll have another go at writing back this inode
* when we completed a full scan of b_io.
*/
- spin_unlock(&inode->i_lock);
requeue_io(inode, wb);
+ spin_unlock(&inode->i_lock);
trace_writeback_sb_inodes_requeue(inode);
continue;
}
@@ -2358,6 +2364,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
int dirtytime = 0;
+ struct bdi_writeback *wb = NULL;
trace_writeback_mark_inode_dirty(inode, flags);
@@ -2409,6 +2416,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
+ /*
+ * Grab inode's wb early because it requires dropping i_lock and we
+ * need to make sure following checks happen atomically with dirty
+ * list handling so that we don't move inodes under flush worker's
+ * hands.
+ */
+ if (!was_dirty) {
+ wb = locked_inode_to_wb_and_lock_list(inode);
+ spin_lock(&inode->i_lock);
+ }
+
/*
* If the inode is queued for writeback by flush worker, just
* update its dirty state. Once the flush worker is done with
@@ -2416,7 +2434,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* Only add valid (hashed) inodes to the superblock's
@@ -2424,22 +2442,19 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
- goto out_unlock_inode;
+ goto out_unlock;
}
if (inode->i_state & I_FREEING)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
- wb = locked_inode_to_wb_and_lock_list(inode);
-
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
@@ -2453,6 +2468,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
dirty_list);
spin_unlock(&wb->list_lock);
+ spin_unlock(&inode->i_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
@@ -2467,6 +2483,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
return;
}
}
+out_unlock:
+ if (wb)
+ spin_unlock(&wb->list_lock);
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..bd4da9c5207e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,7 +27,7 @@
* Inode locking rules:
*
* inode->i_lock protects:
- * inode->i_state, inode->i_hash, __iget()
+ * inode->i_state, inode->i_hash, __iget(), inode->i_io_list
* Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode->i_sb->s_inode_list_lock protects:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10e14073107dd0b6d97d9516a02845a8e501c2c9 Mon Sep 17 00:00:00 2001
From: Jchao Sun <sunjunchao2870(a)gmail.com>
Date: Tue, 24 May 2022 08:05:40 -0700
Subject: [PATCH] writeback: Fix inode->i_io_list not be protected by
inode->i_lock error
Commit b35250c0816c ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.
Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable(a)vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870(a)gmail.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a21d8f1a56d1..05221366a16d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -120,6 +120,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
struct list_head *head)
{
assert_spin_locked(&wb->list_lock);
+ assert_spin_locked(&inode->i_lock);
list_move(&inode->i_io_list, head);
@@ -1365,9 +1366,9 @@ static int move_expired_inodes(struct list_head *delaying_queue,
inode = wb_inode(delaying_queue->prev);
if (inode_dirtied_after(inode, dirtied_before))
break;
+ spin_lock(&inode->i_lock);
list_move(&inode->i_io_list, &tmp);
moved++;
- spin_lock(&inode->i_lock);
inode->i_state |= I_SYNC_QUEUED;
spin_unlock(&inode->i_lock);
if (sb_is_blkdev_sb(inode->i_sb))
@@ -1383,7 +1384,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
goto out;
}
- /* Move inodes from one superblock together */
+ /*
+ * Although inode's i_io_list is moved from 'tmp' to 'dispatch_queue',
+ * we don't take inode->i_lock here because it is just a pointless overhead.
+ * Inode is already marked as I_SYNC_QUEUED so writeback list handling is
+ * fully under our control.
+ */
while (!list_empty(&tmp)) {
sb = wb_inode(tmp.prev)->i_sb;
list_for_each_prev_safe(pos, node, &tmp) {
@@ -1826,8 +1832,8 @@ static long writeback_sb_inodes(struct super_block *sb,
* We'll have another go at writing back this inode
* when we completed a full scan of b_io.
*/
- spin_unlock(&inode->i_lock);
requeue_io(inode, wb);
+ spin_unlock(&inode->i_lock);
trace_writeback_sb_inodes_requeue(inode);
continue;
}
@@ -2358,6 +2364,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
int dirtytime = 0;
+ struct bdi_writeback *wb = NULL;
trace_writeback_mark_inode_dirty(inode, flags);
@@ -2409,6 +2416,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
+ /*
+ * Grab inode's wb early because it requires dropping i_lock and we
+ * need to make sure following checks happen atomically with dirty
+ * list handling so that we don't move inodes under flush worker's
+ * hands.
+ */
+ if (!was_dirty) {
+ wb = locked_inode_to_wb_and_lock_list(inode);
+ spin_lock(&inode->i_lock);
+ }
+
/*
* If the inode is queued for writeback by flush worker, just
* update its dirty state. Once the flush worker is done with
@@ -2416,7 +2434,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* Only add valid (hashed) inodes to the superblock's
@@ -2424,22 +2442,19 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
- goto out_unlock_inode;
+ goto out_unlock;
}
if (inode->i_state & I_FREEING)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
- wb = locked_inode_to_wb_and_lock_list(inode);
-
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
@@ -2453,6 +2468,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
dirty_list);
spin_unlock(&wb->list_lock);
+ spin_unlock(&inode->i_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
@@ -2467,6 +2483,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
return;
}
}
+out_unlock:
+ if (wb)
+ spin_unlock(&wb->list_lock);
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..bd4da9c5207e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,7 +27,7 @@
* Inode locking rules:
*
* inode->i_lock protects:
- * inode->i_state, inode->i_hash, __iget()
+ * inode->i_state, inode->i_hash, __iget(), inode->i_io_list
* Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode->i_sb->s_inode_list_lock protects:
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10e14073107dd0b6d97d9516a02845a8e501c2c9 Mon Sep 17 00:00:00 2001
From: Jchao Sun <sunjunchao2870(a)gmail.com>
Date: Tue, 24 May 2022 08:05:40 -0700
Subject: [PATCH] writeback: Fix inode->i_io_list not be protected by
inode->i_lock error
Commit b35250c0816c ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.
Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable(a)vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870(a)gmail.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a21d8f1a56d1..05221366a16d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -120,6 +120,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
struct list_head *head)
{
assert_spin_locked(&wb->list_lock);
+ assert_spin_locked(&inode->i_lock);
list_move(&inode->i_io_list, head);
@@ -1365,9 +1366,9 @@ static int move_expired_inodes(struct list_head *delaying_queue,
inode = wb_inode(delaying_queue->prev);
if (inode_dirtied_after(inode, dirtied_before))
break;
+ spin_lock(&inode->i_lock);
list_move(&inode->i_io_list, &tmp);
moved++;
- spin_lock(&inode->i_lock);
inode->i_state |= I_SYNC_QUEUED;
spin_unlock(&inode->i_lock);
if (sb_is_blkdev_sb(inode->i_sb))
@@ -1383,7 +1384,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
goto out;
}
- /* Move inodes from one superblock together */
+ /*
+ * Although inode's i_io_list is moved from 'tmp' to 'dispatch_queue',
+ * we don't take inode->i_lock here because it is just a pointless overhead.
+ * Inode is already marked as I_SYNC_QUEUED so writeback list handling is
+ * fully under our control.
+ */
while (!list_empty(&tmp)) {
sb = wb_inode(tmp.prev)->i_sb;
list_for_each_prev_safe(pos, node, &tmp) {
@@ -1826,8 +1832,8 @@ static long writeback_sb_inodes(struct super_block *sb,
* We'll have another go at writing back this inode
* when we completed a full scan of b_io.
*/
- spin_unlock(&inode->i_lock);
requeue_io(inode, wb);
+ spin_unlock(&inode->i_lock);
trace_writeback_sb_inodes_requeue(inode);
continue;
}
@@ -2358,6 +2364,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
int dirtytime = 0;
+ struct bdi_writeback *wb = NULL;
trace_writeback_mark_inode_dirty(inode, flags);
@@ -2409,6 +2416,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
+ /*
+ * Grab inode's wb early because it requires dropping i_lock and we
+ * need to make sure following checks happen atomically with dirty
+ * list handling so that we don't move inodes under flush worker's
+ * hands.
+ */
+ if (!was_dirty) {
+ wb = locked_inode_to_wb_and_lock_list(inode);
+ spin_lock(&inode->i_lock);
+ }
+
/*
* If the inode is queued for writeback by flush worker, just
* update its dirty state. Once the flush worker is done with
@@ -2416,7 +2434,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* Only add valid (hashed) inodes to the superblock's
@@ -2424,22 +2442,19 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
- goto out_unlock_inode;
+ goto out_unlock;
}
if (inode->i_state & I_FREEING)
- goto out_unlock_inode;
+ goto out_unlock;
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
- wb = locked_inode_to_wb_and_lock_list(inode);
-
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
@@ -2453,6 +2468,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
dirty_list);
spin_unlock(&wb->list_lock);
+ spin_unlock(&inode->i_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
@@ -2467,6 +2483,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
return;
}
}
+out_unlock:
+ if (wb)
+ spin_unlock(&wb->list_lock);
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..bd4da9c5207e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,7 +27,7 @@
* Inode locking rules:
*
* inode->i_lock protects:
- * inode->i_state, inode->i_hash, __iget()
+ * inode->i_state, inode->i_hash, __iget(), inode->i_io_list
* Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode->i_sb->s_inode_list_lock protects:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2061ecfdf2350994e5b61c43e50e98a7a70e95ee Mon Sep 17 00:00:00 2001
From: Ilya Maximets <i.maximets(a)ovn.org>
Date: Tue, 7 Jun 2022 00:11:40 +0200
Subject: [PATCH] net: openvswitch: fix misuse of the cached connection on
tuple changes
If packet headers changed, the cached nfct is no longer relevant
for the packet and attempt to re-use it leads to the incorrect packet
classification.
This issue is causing broken connectivity in OpenStack deployments
with OVS/OVN due to hairpin traffic being unexpectedly dropped.
The setup has datapath flows with several conntrack actions and tuple
changes between them:
actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
ct(zone=8),recirc(0x4)
After the first ct() action the packet headers are almost fully
re-written. The next ct() tries to re-use the existing nfct entry
and marks the packet as invalid, so it gets dropped later in the
pipeline.
Clearing the cached conntrack entry whenever packet tuple is changed
to avoid the issue.
The flow key should not be cleared though, because we should still
be able to match on the ct_state if the recirculation happens after
the tuple change but before the next ct() action.
Cc: stable(a)vger.kernel.org
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Frode Nordahl <frode.nordahl(a)canonical.com>
Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html
Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1b5d73079dc9..868db4669a29 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -373,6 +373,7 @@ static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
update_ip_l4_checksum(skb, nh, *addr, new_addr);
csum_replace4(&nh->check, *addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
*addr = new_addr;
}
@@ -420,6 +421,7 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
update_ipv6_checksum(skb, l4_proto, addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
memcpy(addr, new_addr, sizeof(__be32[4]));
}
@@ -660,6 +662,7 @@ static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
static void set_tp_port(struct sk_buff *skb, __be16 *port,
__be16 new_port, __sum16 *check)
{
+ ovs_ct_clear(skb, NULL);
inet_proto_csum_replace2(check, skb, *port, new_port, false);
*port = new_port;
}
@@ -699,6 +702,7 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,
uh->dest = dst;
flow_key->tp.src = src;
flow_key->tp.dst = dst;
+ ovs_ct_clear(skb, NULL);
}
skb_clear_hash(skb);
@@ -761,6 +765,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
sh->checksum = old_csum ^ old_correct_csum ^ new_csum;
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
+
flow_key->tp.src = sh->source;
flow_key->tp.dst = sh->dest;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 4a947c13c813..4e70df91d0f2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1342,7 +1342,9 @@ int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
nf_ct_put(ct);
nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
- ovs_ct_fill_key(skb, key, false);
+
+ if (key)
+ ovs_ct_fill_key(skb, key, false);
return 0;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2061ecfdf2350994e5b61c43e50e98a7a70e95ee Mon Sep 17 00:00:00 2001
From: Ilya Maximets <i.maximets(a)ovn.org>
Date: Tue, 7 Jun 2022 00:11:40 +0200
Subject: [PATCH] net: openvswitch: fix misuse of the cached connection on
tuple changes
If packet headers changed, the cached nfct is no longer relevant
for the packet and attempt to re-use it leads to the incorrect packet
classification.
This issue is causing broken connectivity in OpenStack deployments
with OVS/OVN due to hairpin traffic being unexpectedly dropped.
The setup has datapath flows with several conntrack actions and tuple
changes between them:
actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
ct(zone=8),recirc(0x4)
After the first ct() action the packet headers are almost fully
re-written. The next ct() tries to re-use the existing nfct entry
and marks the packet as invalid, so it gets dropped later in the
pipeline.
Clearing the cached conntrack entry whenever packet tuple is changed
to avoid the issue.
The flow key should not be cleared though, because we should still
be able to match on the ct_state if the recirculation happens after
the tuple change but before the next ct() action.
Cc: stable(a)vger.kernel.org
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Frode Nordahl <frode.nordahl(a)canonical.com>
Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html
Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1b5d73079dc9..868db4669a29 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -373,6 +373,7 @@ static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
update_ip_l4_checksum(skb, nh, *addr, new_addr);
csum_replace4(&nh->check, *addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
*addr = new_addr;
}
@@ -420,6 +421,7 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
update_ipv6_checksum(skb, l4_proto, addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
memcpy(addr, new_addr, sizeof(__be32[4]));
}
@@ -660,6 +662,7 @@ static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
static void set_tp_port(struct sk_buff *skb, __be16 *port,
__be16 new_port, __sum16 *check)
{
+ ovs_ct_clear(skb, NULL);
inet_proto_csum_replace2(check, skb, *port, new_port, false);
*port = new_port;
}
@@ -699,6 +702,7 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,
uh->dest = dst;
flow_key->tp.src = src;
flow_key->tp.dst = dst;
+ ovs_ct_clear(skb, NULL);
}
skb_clear_hash(skb);
@@ -761,6 +765,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
sh->checksum = old_csum ^ old_correct_csum ^ new_csum;
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
+
flow_key->tp.src = sh->source;
flow_key->tp.dst = sh->dest;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 4a947c13c813..4e70df91d0f2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1342,7 +1342,9 @@ int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
nf_ct_put(ct);
nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
- ovs_ct_fill_key(skb, key, false);
+
+ if (key)
+ ovs_ct_fill_key(skb, key, false);
return 0;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2061ecfdf2350994e5b61c43e50e98a7a70e95ee Mon Sep 17 00:00:00 2001
From: Ilya Maximets <i.maximets(a)ovn.org>
Date: Tue, 7 Jun 2022 00:11:40 +0200
Subject: [PATCH] net: openvswitch: fix misuse of the cached connection on
tuple changes
If packet headers changed, the cached nfct is no longer relevant
for the packet and attempt to re-use it leads to the incorrect packet
classification.
This issue is causing broken connectivity in OpenStack deployments
with OVS/OVN due to hairpin traffic being unexpectedly dropped.
The setup has datapath flows with several conntrack actions and tuple
changes between them:
actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
ct(zone=8),recirc(0x4)
After the first ct() action the packet headers are almost fully
re-written. The next ct() tries to re-use the existing nfct entry
and marks the packet as invalid, so it gets dropped later in the
pipeline.
Clearing the cached conntrack entry whenever packet tuple is changed
to avoid the issue.
The flow key should not be cleared though, because we should still
be able to match on the ct_state if the recirculation happens after
the tuple change but before the next ct() action.
Cc: stable(a)vger.kernel.org
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Frode Nordahl <frode.nordahl(a)canonical.com>
Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html
Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1b5d73079dc9..868db4669a29 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -373,6 +373,7 @@ static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
update_ip_l4_checksum(skb, nh, *addr, new_addr);
csum_replace4(&nh->check, *addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
*addr = new_addr;
}
@@ -420,6 +421,7 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
update_ipv6_checksum(skb, l4_proto, addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
memcpy(addr, new_addr, sizeof(__be32[4]));
}
@@ -660,6 +662,7 @@ static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
static void set_tp_port(struct sk_buff *skb, __be16 *port,
__be16 new_port, __sum16 *check)
{
+ ovs_ct_clear(skb, NULL);
inet_proto_csum_replace2(check, skb, *port, new_port, false);
*port = new_port;
}
@@ -699,6 +702,7 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,
uh->dest = dst;
flow_key->tp.src = src;
flow_key->tp.dst = dst;
+ ovs_ct_clear(skb, NULL);
}
skb_clear_hash(skb);
@@ -761,6 +765,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
sh->checksum = old_csum ^ old_correct_csum ^ new_csum;
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
+
flow_key->tp.src = sh->source;
flow_key->tp.dst = sh->dest;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 4a947c13c813..4e70df91d0f2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1342,7 +1342,9 @@ int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
nf_ct_put(ct);
nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
- ovs_ct_fill_key(skb, key, false);
+
+ if (key)
+ ovs_ct_fill_key(skb, key, false);
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2061ecfdf2350994e5b61c43e50e98a7a70e95ee Mon Sep 17 00:00:00 2001
From: Ilya Maximets <i.maximets(a)ovn.org>
Date: Tue, 7 Jun 2022 00:11:40 +0200
Subject: [PATCH] net: openvswitch: fix misuse of the cached connection on
tuple changes
If packet headers changed, the cached nfct is no longer relevant
for the packet and attempt to re-use it leads to the incorrect packet
classification.
This issue is causing broken connectivity in OpenStack deployments
with OVS/OVN due to hairpin traffic being unexpectedly dropped.
The setup has datapath flows with several conntrack actions and tuple
changes between them:
actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
ct(zone=8),recirc(0x4)
After the first ct() action the packet headers are almost fully
re-written. The next ct() tries to re-use the existing nfct entry
and marks the packet as invalid, so it gets dropped later in the
pipeline.
Clearing the cached conntrack entry whenever packet tuple is changed
to avoid the issue.
The flow key should not be cleared though, because we should still
be able to match on the ct_state if the recirculation happens after
the tuple change but before the next ct() action.
Cc: stable(a)vger.kernel.org
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Frode Nordahl <frode.nordahl(a)canonical.com>
Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html
Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1b5d73079dc9..868db4669a29 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -373,6 +373,7 @@ static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
update_ip_l4_checksum(skb, nh, *addr, new_addr);
csum_replace4(&nh->check, *addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
*addr = new_addr;
}
@@ -420,6 +421,7 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
update_ipv6_checksum(skb, l4_proto, addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
memcpy(addr, new_addr, sizeof(__be32[4]));
}
@@ -660,6 +662,7 @@ static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
static void set_tp_port(struct sk_buff *skb, __be16 *port,
__be16 new_port, __sum16 *check)
{
+ ovs_ct_clear(skb, NULL);
inet_proto_csum_replace2(check, skb, *port, new_port, false);
*port = new_port;
}
@@ -699,6 +702,7 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,
uh->dest = dst;
flow_key->tp.src = src;
flow_key->tp.dst = dst;
+ ovs_ct_clear(skb, NULL);
}
skb_clear_hash(skb);
@@ -761,6 +765,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
sh->checksum = old_csum ^ old_correct_csum ^ new_csum;
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
+
flow_key->tp.src = sh->source;
flow_key->tp.dst = sh->dest;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 4a947c13c813..4e70df91d0f2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1342,7 +1342,9 @@ int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
nf_ct_put(ct);
nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
- ovs_ct_fill_key(skb, key, false);
+
+ if (key)
+ ovs_ct_fill_key(skb, key, false);
return 0;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2061ecfdf2350994e5b61c43e50e98a7a70e95ee Mon Sep 17 00:00:00 2001
From: Ilya Maximets <i.maximets(a)ovn.org>
Date: Tue, 7 Jun 2022 00:11:40 +0200
Subject: [PATCH] net: openvswitch: fix misuse of the cached connection on
tuple changes
If packet headers changed, the cached nfct is no longer relevant
for the packet and attempt to re-use it leads to the incorrect packet
classification.
This issue is causing broken connectivity in OpenStack deployments
with OVS/OVN due to hairpin traffic being unexpectedly dropped.
The setup has datapath flows with several conntrack actions and tuple
changes between them:
actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
ct(zone=8),recirc(0x4)
After the first ct() action the packet headers are almost fully
re-written. The next ct() tries to re-use the existing nfct entry
and marks the packet as invalid, so it gets dropped later in the
pipeline.
Clearing the cached conntrack entry whenever packet tuple is changed
to avoid the issue.
The flow key should not be cleared though, because we should still
be able to match on the ct_state if the recirculation happens after
the tuple change but before the next ct() action.
Cc: stable(a)vger.kernel.org
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Frode Nordahl <frode.nordahl(a)canonical.com>
Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html
Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1b5d73079dc9..868db4669a29 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -373,6 +373,7 @@ static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
update_ip_l4_checksum(skb, nh, *addr, new_addr);
csum_replace4(&nh->check, *addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
*addr = new_addr;
}
@@ -420,6 +421,7 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
update_ipv6_checksum(skb, l4_proto, addr, new_addr);
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
memcpy(addr, new_addr, sizeof(__be32[4]));
}
@@ -660,6 +662,7 @@ static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
static void set_tp_port(struct sk_buff *skb, __be16 *port,
__be16 new_port, __sum16 *check)
{
+ ovs_ct_clear(skb, NULL);
inet_proto_csum_replace2(check, skb, *port, new_port, false);
*port = new_port;
}
@@ -699,6 +702,7 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,
uh->dest = dst;
flow_key->tp.src = src;
flow_key->tp.dst = dst;
+ ovs_ct_clear(skb, NULL);
}
skb_clear_hash(skb);
@@ -761,6 +765,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
sh->checksum = old_csum ^ old_correct_csum ^ new_csum;
skb_clear_hash(skb);
+ ovs_ct_clear(skb, NULL);
+
flow_key->tp.src = sh->source;
flow_key->tp.dst = sh->dest;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 4a947c13c813..4e70df91d0f2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1342,7 +1342,9 @@ int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
nf_ct_put(ct);
nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
- ovs_ct_fill_key(skb, key, false);
+
+ if (key)
+ ovs_ct_fill_key(skb, key, false);
return 0;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c76acfb7e19dcc3a0964e0563770b1d11b8d4540 Mon Sep 17 00:00:00 2001
From: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Date: Thu, 26 May 2022 17:03:47 +0800
Subject: [PATCH] net: phy: dp83867: retrigger SGMII AN when link change
There is a limitation in TI DP83867 PHY device where SGMII AN is only
triggered once after the device is booted up. Even after the PHY TPI is
down and up again, SGMII AN is not triggered and hence no new in-band
message from PHY to MAC side SGMII.
This could cause an issue during power up, when PHY is up prior to MAC.
At this condition, once MAC side SGMII is up, MAC side SGMII wouldn`t
receive new in-band message from TI PHY with correct link status, speed
and duplex info.
As suggested by TI, implemented a SW solution here to retrigger SGMII
Auto-Neg whenever there is a link change.
v2: Add Fixes tag in commit message.
Fixes: 2a10154abcb7 ("net: phy: dp83867: Add TI dp83867 phy")
Cc: <stable(a)vger.kernel.org> # 5.4.x
Signed-off-by: Sit, Michael Wei Hong <michael.wei.hong.sit(a)intel.com>
Reviewed-by: Voon Weifeng <weifeng.voon(a)intel.com>
Signed-off-by: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Reviewed-by: Andrew Lunn <andrew(a)lunn.ch>
Link: https://lore.kernel.org/r/20220526090347.128742-1-tee.min.tan@linux.intel.c…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 8561f2d4443b..13dafe7a29bd 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -137,6 +137,7 @@
#define DP83867_DOWNSHIFT_2_COUNT 2
#define DP83867_DOWNSHIFT_4_COUNT 4
#define DP83867_DOWNSHIFT_8_COUNT 8
+#define DP83867_SGMII_AUTONEG_EN BIT(7)
/* CFG3 bits */
#define DP83867_CFG3_INT_OE BIT(7)
@@ -855,6 +856,32 @@ static int dp83867_phy_reset(struct phy_device *phydev)
DP83867_PHYCR_FORCE_LINK_GOOD, 0);
}
+static void dp83867_link_change_notify(struct phy_device *phydev)
+{
+ /* There is a limitation in DP83867 PHY device where SGMII AN is
+ * only triggered once after the device is booted up. Even after the
+ * PHY TPI is down and up again, SGMII AN is not triggered and
+ * hence no new in-band message from PHY to MAC side SGMII.
+ * This could cause an issue during power up, when PHY is up prior
+ * to MAC. At this condition, once MAC side SGMII is up, MAC side
+ * SGMII wouldn`t receive new in-band message from TI PHY with
+ * correct link status, speed and duplex info.
+ * Thus, implemented a SW solution here to retrigger SGMII Auto-Neg
+ * whenever there is a link change.
+ */
+ if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
+ int val = 0;
+
+ val = phy_clear_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ if (val < 0)
+ return;
+
+ phy_set_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ }
+}
+
static struct phy_driver dp83867_driver[] = {
{
.phy_id = DP83867_PHY_ID,
@@ -879,6 +906,8 @@ static struct phy_driver dp83867_driver[] = {
.suspend = genphy_suspend,
.resume = genphy_resume,
+
+ .link_change_notify = dp83867_link_change_notify,
},
};
module_phy_driver(dp83867_driver);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c76acfb7e19dcc3a0964e0563770b1d11b8d4540 Mon Sep 17 00:00:00 2001
From: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Date: Thu, 26 May 2022 17:03:47 +0800
Subject: [PATCH] net: phy: dp83867: retrigger SGMII AN when link change
There is a limitation in TI DP83867 PHY device where SGMII AN is only
triggered once after the device is booted up. Even after the PHY TPI is
down and up again, SGMII AN is not triggered and hence no new in-band
message from PHY to MAC side SGMII.
This could cause an issue during power up, when PHY is up prior to MAC.
At this condition, once MAC side SGMII is up, MAC side SGMII wouldn`t
receive new in-band message from TI PHY with correct link status, speed
and duplex info.
As suggested by TI, implemented a SW solution here to retrigger SGMII
Auto-Neg whenever there is a link change.
v2: Add Fixes tag in commit message.
Fixes: 2a10154abcb7 ("net: phy: dp83867: Add TI dp83867 phy")
Cc: <stable(a)vger.kernel.org> # 5.4.x
Signed-off-by: Sit, Michael Wei Hong <michael.wei.hong.sit(a)intel.com>
Reviewed-by: Voon Weifeng <weifeng.voon(a)intel.com>
Signed-off-by: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Reviewed-by: Andrew Lunn <andrew(a)lunn.ch>
Link: https://lore.kernel.org/r/20220526090347.128742-1-tee.min.tan@linux.intel.c…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 8561f2d4443b..13dafe7a29bd 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -137,6 +137,7 @@
#define DP83867_DOWNSHIFT_2_COUNT 2
#define DP83867_DOWNSHIFT_4_COUNT 4
#define DP83867_DOWNSHIFT_8_COUNT 8
+#define DP83867_SGMII_AUTONEG_EN BIT(7)
/* CFG3 bits */
#define DP83867_CFG3_INT_OE BIT(7)
@@ -855,6 +856,32 @@ static int dp83867_phy_reset(struct phy_device *phydev)
DP83867_PHYCR_FORCE_LINK_GOOD, 0);
}
+static void dp83867_link_change_notify(struct phy_device *phydev)
+{
+ /* There is a limitation in DP83867 PHY device where SGMII AN is
+ * only triggered once after the device is booted up. Even after the
+ * PHY TPI is down and up again, SGMII AN is not triggered and
+ * hence no new in-band message from PHY to MAC side SGMII.
+ * This could cause an issue during power up, when PHY is up prior
+ * to MAC. At this condition, once MAC side SGMII is up, MAC side
+ * SGMII wouldn`t receive new in-band message from TI PHY with
+ * correct link status, speed and duplex info.
+ * Thus, implemented a SW solution here to retrigger SGMII Auto-Neg
+ * whenever there is a link change.
+ */
+ if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
+ int val = 0;
+
+ val = phy_clear_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ if (val < 0)
+ return;
+
+ phy_set_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ }
+}
+
static struct phy_driver dp83867_driver[] = {
{
.phy_id = DP83867_PHY_ID,
@@ -879,6 +906,8 @@ static struct phy_driver dp83867_driver[] = {
.suspend = genphy_suspend,
.resume = genphy_resume,
+
+ .link_change_notify = dp83867_link_change_notify,
},
};
module_phy_driver(dp83867_driver);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c76acfb7e19dcc3a0964e0563770b1d11b8d4540 Mon Sep 17 00:00:00 2001
From: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Date: Thu, 26 May 2022 17:03:47 +0800
Subject: [PATCH] net: phy: dp83867: retrigger SGMII AN when link change
There is a limitation in TI DP83867 PHY device where SGMII AN is only
triggered once after the device is booted up. Even after the PHY TPI is
down and up again, SGMII AN is not triggered and hence no new in-band
message from PHY to MAC side SGMII.
This could cause an issue during power up, when PHY is up prior to MAC.
At this condition, once MAC side SGMII is up, MAC side SGMII wouldn`t
receive new in-band message from TI PHY with correct link status, speed
and duplex info.
As suggested by TI, implemented a SW solution here to retrigger SGMII
Auto-Neg whenever there is a link change.
v2: Add Fixes tag in commit message.
Fixes: 2a10154abcb7 ("net: phy: dp83867: Add TI dp83867 phy")
Cc: <stable(a)vger.kernel.org> # 5.4.x
Signed-off-by: Sit, Michael Wei Hong <michael.wei.hong.sit(a)intel.com>
Reviewed-by: Voon Weifeng <weifeng.voon(a)intel.com>
Signed-off-by: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Reviewed-by: Andrew Lunn <andrew(a)lunn.ch>
Link: https://lore.kernel.org/r/20220526090347.128742-1-tee.min.tan@linux.intel.c…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 8561f2d4443b..13dafe7a29bd 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -137,6 +137,7 @@
#define DP83867_DOWNSHIFT_2_COUNT 2
#define DP83867_DOWNSHIFT_4_COUNT 4
#define DP83867_DOWNSHIFT_8_COUNT 8
+#define DP83867_SGMII_AUTONEG_EN BIT(7)
/* CFG3 bits */
#define DP83867_CFG3_INT_OE BIT(7)
@@ -855,6 +856,32 @@ static int dp83867_phy_reset(struct phy_device *phydev)
DP83867_PHYCR_FORCE_LINK_GOOD, 0);
}
+static void dp83867_link_change_notify(struct phy_device *phydev)
+{
+ /* There is a limitation in DP83867 PHY device where SGMII AN is
+ * only triggered once after the device is booted up. Even after the
+ * PHY TPI is down and up again, SGMII AN is not triggered and
+ * hence no new in-band message from PHY to MAC side SGMII.
+ * This could cause an issue during power up, when PHY is up prior
+ * to MAC. At this condition, once MAC side SGMII is up, MAC side
+ * SGMII wouldn`t receive new in-band message from TI PHY with
+ * correct link status, speed and duplex info.
+ * Thus, implemented a SW solution here to retrigger SGMII Auto-Neg
+ * whenever there is a link change.
+ */
+ if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
+ int val = 0;
+
+ val = phy_clear_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ if (val < 0)
+ return;
+
+ phy_set_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ }
+}
+
static struct phy_driver dp83867_driver[] = {
{
.phy_id = DP83867_PHY_ID,
@@ -879,6 +906,8 @@ static struct phy_driver dp83867_driver[] = {
.suspend = genphy_suspend,
.resume = genphy_resume,
+
+ .link_change_notify = dp83867_link_change_notify,
},
};
module_phy_driver(dp83867_driver);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c76acfb7e19dcc3a0964e0563770b1d11b8d4540 Mon Sep 17 00:00:00 2001
From: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Date: Thu, 26 May 2022 17:03:47 +0800
Subject: [PATCH] net: phy: dp83867: retrigger SGMII AN when link change
There is a limitation in TI DP83867 PHY device where SGMII AN is only
triggered once after the device is booted up. Even after the PHY TPI is
down and up again, SGMII AN is not triggered and hence no new in-band
message from PHY to MAC side SGMII.
This could cause an issue during power up, when PHY is up prior to MAC.
At this condition, once MAC side SGMII is up, MAC side SGMII wouldn`t
receive new in-band message from TI PHY with correct link status, speed
and duplex info.
As suggested by TI, implemented a SW solution here to retrigger SGMII
Auto-Neg whenever there is a link change.
v2: Add Fixes tag in commit message.
Fixes: 2a10154abcb7 ("net: phy: dp83867: Add TI dp83867 phy")
Cc: <stable(a)vger.kernel.org> # 5.4.x
Signed-off-by: Sit, Michael Wei Hong <michael.wei.hong.sit(a)intel.com>
Reviewed-by: Voon Weifeng <weifeng.voon(a)intel.com>
Signed-off-by: Tan Tee Min <tee.min.tan(a)linux.intel.com>
Reviewed-by: Andrew Lunn <andrew(a)lunn.ch>
Link: https://lore.kernel.org/r/20220526090347.128742-1-tee.min.tan@linux.intel.c…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 8561f2d4443b..13dafe7a29bd 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -137,6 +137,7 @@
#define DP83867_DOWNSHIFT_2_COUNT 2
#define DP83867_DOWNSHIFT_4_COUNT 4
#define DP83867_DOWNSHIFT_8_COUNT 8
+#define DP83867_SGMII_AUTONEG_EN BIT(7)
/* CFG3 bits */
#define DP83867_CFG3_INT_OE BIT(7)
@@ -855,6 +856,32 @@ static int dp83867_phy_reset(struct phy_device *phydev)
DP83867_PHYCR_FORCE_LINK_GOOD, 0);
}
+static void dp83867_link_change_notify(struct phy_device *phydev)
+{
+ /* There is a limitation in DP83867 PHY device where SGMII AN is
+ * only triggered once after the device is booted up. Even after the
+ * PHY TPI is down and up again, SGMII AN is not triggered and
+ * hence no new in-band message from PHY to MAC side SGMII.
+ * This could cause an issue during power up, when PHY is up prior
+ * to MAC. At this condition, once MAC side SGMII is up, MAC side
+ * SGMII wouldn`t receive new in-band message from TI PHY with
+ * correct link status, speed and duplex info.
+ * Thus, implemented a SW solution here to retrigger SGMII Auto-Neg
+ * whenever there is a link change.
+ */
+ if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
+ int val = 0;
+
+ val = phy_clear_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ if (val < 0)
+ return;
+
+ phy_set_bits(phydev, DP83867_CFG2,
+ DP83867_SGMII_AUTONEG_EN);
+ }
+}
+
static struct phy_driver dp83867_driver[] = {
{
.phy_id = DP83867_PHY_ID,
@@ -879,6 +906,8 @@ static struct phy_driver dp83867_driver[] = {
.suspend = genphy_suspend,
.resume = genphy_resume,
+
+ .link_change_notify = dp83867_link_change_notify,
},
};
module_phy_driver(dp83867_driver);
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 11d39e8cc43e1c6737af19ca9372e590061b5ad2 Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Mon, 6 Jun 2022 21:11:49 +0300
Subject: [PATCH] KVM: SVM: fix tsc scaling cache logic
SVM uses a per-cpu variable to cache the current value of the
tsc scaling multiplier msr on each cpu.
Commit 1ab9287add5e2
("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
broke this caching logic.
Refactor the code so that all TSC scaling multiplier writes go through
a single function which checks and updates the cache.
This fixes the following scenario:
1. A CPU runs a guest with some tsc scaling ratio.
2. New guest with different tsc scaling ratio starts on this CPU
and terminates almost immediately.
This ensures that the short running guest had set the tsc scaling ratio just
once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
the per-cpu cache is not updated.
3. The original guest continues to run, it doesn't restore the msr
value back to its own value, because the cache matches,
and thus continues to run with a wrong tsc scaling ratio.
Fixes: 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20220606181149.103072-1-mlevitsk(a)redhat.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef..3361258640a2 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -982,7 +982,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) {
WARN_ON(!svm->tsc_scaling_enabled);
vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
- svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio);
+ __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio);
}
svm->nested.ctl.nested_cr3 = 0;
@@ -1387,7 +1387,7 @@ void nested_svm_update_tsc_ratio_msr(struct kvm_vcpu *vcpu)
vcpu->arch.tsc_scaling_ratio =
kvm_calc_nested_tsc_multiplier(vcpu->arch.l1_tsc_scaling_ratio,
svm->tsc_ratio_msr);
- svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio);
+ __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio);
}
/* Inverse operation of nested_copy_vmcb_control_to_cache(). asid is copied too. */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 63880b33ce37..478e6ee81d88 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -465,11 +465,24 @@ static int has_svm(void)
return 1;
}
+void __svm_write_tsc_multiplier(u64 multiplier)
+{
+ preempt_disable();
+
+ if (multiplier == __this_cpu_read(current_tsc_ratio))
+ goto out;
+
+ wrmsrl(MSR_AMD64_TSC_RATIO, multiplier);
+ __this_cpu_write(current_tsc_ratio, multiplier);
+out:
+ preempt_enable();
+}
+
static void svm_hardware_disable(void)
{
/* Make sure we clean up behind us */
if (tsc_scaling)
- wrmsrl(MSR_AMD64_TSC_RATIO, SVM_TSC_RATIO_DEFAULT);
+ __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT);
cpu_svm_disable();
@@ -515,8 +528,7 @@ static int svm_hardware_enable(void)
* Set the default value, even if we don't use TSC scaling
* to avoid having stale value in the msr
*/
- wrmsrl(MSR_AMD64_TSC_RATIO, SVM_TSC_RATIO_DEFAULT);
- __this_cpu_write(current_tsc_ratio, SVM_TSC_RATIO_DEFAULT);
+ __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT);
}
@@ -999,11 +1011,12 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
}
-void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+static void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
{
- wrmsrl(MSR_AMD64_TSC_RATIO, multiplier);
+ __svm_write_tsc_multiplier(multiplier);
}
+
/* Evaluate instruction intercepts that depend on guest CPUID features. */
static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu,
struct vcpu_svm *svm)
@@ -1363,13 +1376,8 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
sev_es_prepare_switch_to_guest(hostsa);
}
- if (tsc_scaling) {
- u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio;
- if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) {
- __this_cpu_write(current_tsc_ratio, tsc_ratio);
- wrmsrl(MSR_AMD64_TSC_RATIO, tsc_ratio);
- }
- }
+ if (tsc_scaling)
+ __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio);
if (likely(tsc_aux_uret_slot >= 0))
kvm_set_user_return_msr(tsc_aux_uret_slot, svm->tsc_aux, -1ull);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 45a87b2a8b3c..29d6fd205a49 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -590,7 +590,7 @@ int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
bool has_error_code, u32 error_code);
int nested_svm_exit_special(struct vcpu_svm *svm);
void nested_svm_update_tsc_ratio_msr(struct kvm_vcpu *vcpu);
-void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void __svm_write_tsc_multiplier(u64 multiplier);
void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm,
struct vmcb_control_area *control);
void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 11d39e8cc43e1c6737af19ca9372e590061b5ad2 Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Mon, 6 Jun 2022 21:11:49 +0300
Subject: [PATCH] KVM: SVM: fix tsc scaling cache logic
SVM uses a per-cpu variable to cache the current value of the
tsc scaling multiplier msr on each cpu.
Commit 1ab9287add5e2
("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
broke this caching logic.
Refactor the code so that all TSC scaling multiplier writes go through
a single function which checks and updates the cache.
This fixes the following scenario:
1. A CPU runs a guest with some tsc scaling ratio.
2. New guest with different tsc scaling ratio starts on this CPU
and terminates almost immediately.
This ensures that the short running guest had set the tsc scaling ratio just
once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
the per-cpu cache is not updated.
3. The original guest continues to run, it doesn't restore the msr
value back to its own value, because the cache matches,
and thus continues to run with a wrong tsc scaling ratio.
Fixes: 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20220606181149.103072-1-mlevitsk(a)redhat.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef..3361258640a2 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -982,7 +982,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) {
WARN_ON(!svm->tsc_scaling_enabled);
vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
- svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio);
+ __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio);
}
svm->nested.ctl.nested_cr3 = 0;
@@ -1387,7 +1387,7 @@ void nested_svm_update_tsc_ratio_msr(struct kvm_vcpu *vcpu)
vcpu->arch.tsc_scaling_ratio =
kvm_calc_nested_tsc_multiplier(vcpu->arch.l1_tsc_scaling_ratio,
svm->tsc_ratio_msr);
- svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio);
+ __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio);
}
/* Inverse operation of nested_copy_vmcb_control_to_cache(). asid is copied too. */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 63880b33ce37..478e6ee81d88 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -465,11 +465,24 @@ static int has_svm(void)
return 1;
}
+void __svm_write_tsc_multiplier(u64 multiplier)
+{
+ preempt_disable();
+
+ if (multiplier == __this_cpu_read(current_tsc_ratio))
+ goto out;
+
+ wrmsrl(MSR_AMD64_TSC_RATIO, multiplier);
+ __this_cpu_write(current_tsc_ratio, multiplier);
+out:
+ preempt_enable();
+}
+
static void svm_hardware_disable(void)
{
/* Make sure we clean up behind us */
if (tsc_scaling)
- wrmsrl(MSR_AMD64_TSC_RATIO, SVM_TSC_RATIO_DEFAULT);
+ __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT);
cpu_svm_disable();
@@ -515,8 +528,7 @@ static int svm_hardware_enable(void)
* Set the default value, even if we don't use TSC scaling
* to avoid having stale value in the msr
*/
- wrmsrl(MSR_AMD64_TSC_RATIO, SVM_TSC_RATIO_DEFAULT);
- __this_cpu_write(current_tsc_ratio, SVM_TSC_RATIO_DEFAULT);
+ __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT);
}
@@ -999,11 +1011,12 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
}
-void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+static void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
{
- wrmsrl(MSR_AMD64_TSC_RATIO, multiplier);
+ __svm_write_tsc_multiplier(multiplier);
}
+
/* Evaluate instruction intercepts that depend on guest CPUID features. */
static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu,
struct vcpu_svm *svm)
@@ -1363,13 +1376,8 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
sev_es_prepare_switch_to_guest(hostsa);
}
- if (tsc_scaling) {
- u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio;
- if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) {
- __this_cpu_write(current_tsc_ratio, tsc_ratio);
- wrmsrl(MSR_AMD64_TSC_RATIO, tsc_ratio);
- }
- }
+ if (tsc_scaling)
+ __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio);
if (likely(tsc_aux_uret_slot >= 0))
kvm_set_user_return_msr(tsc_aux_uret_slot, svm->tsc_aux, -1ull);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 45a87b2a8b3c..29d6fd205a49 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -590,7 +590,7 @@ int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
bool has_error_code, u32 error_code);
int nested_svm_exit_special(struct vcpu_svm *svm);
void nested_svm_update_tsc_ratio_msr(struct kvm_vcpu *vcpu);
-void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void __svm_write_tsc_multiplier(u64 multiplier);
void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm,
struct vmcb_control_area *control);
void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 77fc95f8c0dc9e1f8e620ec14d2fb65028fb7adc Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:04:38 +0200
Subject: [PATCH] random: account for arch randomness in bits
Rather than accounting in bytes and multiplying (shifting), we can just
account in bits and avoid the shift. The main motivation for this is
there are other patches in flux that expand this code a bit, and
avoiding the duplication of "* 8" everywhere makes things a bit clearer.
Cc: stable(a)vger.kernel.org
Fixes: 12e45a2a6308 ("random: credit architectural init the exact amount")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0d6fb3eaf609..60d701a2516b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -776,7 +776,7 @@ static struct notifier_block pm_notifier = { .notifier_call = random_pm_notifica
int __init random_init(const char *command_line)
{
ktime_t now = ktime_get_real();
- unsigned int i, arch_bytes;
+ unsigned int i, arch_bits;
unsigned long entropy;
#if defined(LATENT_ENTROPY_PLUGIN)
@@ -784,12 +784,12 @@ int __init random_init(const char *command_line)
_mix_pool_bytes(compiletime_seed, sizeof(compiletime_seed));
#endif
- for (i = 0, arch_bytes = BLAKE2S_BLOCK_SIZE;
+ for (i = 0, arch_bits = BLAKE2S_BLOCK_SIZE * 8;
i < BLAKE2S_BLOCK_SIZE; i += sizeof(entropy)) {
if (!arch_get_random_seed_long_early(&entropy) &&
!arch_get_random_long_early(&entropy)) {
entropy = random_get_entropy();
- arch_bytes -= sizeof(entropy);
+ arch_bits -= sizeof(entropy) * 8;
}
_mix_pool_bytes(&entropy, sizeof(entropy));
}
@@ -801,8 +801,8 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- _credit_init_bits(arch_bytes * 8);
- used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
+ _credit_init_bits(arch_bits);
+ used_arch_random = arch_bits >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 77fc95f8c0dc9e1f8e620ec14d2fb65028fb7adc Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:04:38 +0200
Subject: [PATCH] random: account for arch randomness in bits
Rather than accounting in bytes and multiplying (shifting), we can just
account in bits and avoid the shift. The main motivation for this is
there are other patches in flux that expand this code a bit, and
avoiding the duplication of "* 8" everywhere makes things a bit clearer.
Cc: stable(a)vger.kernel.org
Fixes: 12e45a2a6308 ("random: credit architectural init the exact amount")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0d6fb3eaf609..60d701a2516b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -776,7 +776,7 @@ static struct notifier_block pm_notifier = { .notifier_call = random_pm_notifica
int __init random_init(const char *command_line)
{
ktime_t now = ktime_get_real();
- unsigned int i, arch_bytes;
+ unsigned int i, arch_bits;
unsigned long entropy;
#if defined(LATENT_ENTROPY_PLUGIN)
@@ -784,12 +784,12 @@ int __init random_init(const char *command_line)
_mix_pool_bytes(compiletime_seed, sizeof(compiletime_seed));
#endif
- for (i = 0, arch_bytes = BLAKE2S_BLOCK_SIZE;
+ for (i = 0, arch_bits = BLAKE2S_BLOCK_SIZE * 8;
i < BLAKE2S_BLOCK_SIZE; i += sizeof(entropy)) {
if (!arch_get_random_seed_long_early(&entropy) &&
!arch_get_random_long_early(&entropy)) {
entropy = random_get_entropy();
- arch_bytes -= sizeof(entropy);
+ arch_bits -= sizeof(entropy) * 8;
}
_mix_pool_bytes(&entropy, sizeof(entropy));
}
@@ -801,8 +801,8 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- _credit_init_bits(arch_bytes * 8);
- used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
+ _credit_init_bits(arch_bits);
+ used_arch_random = arch_bits >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 77fc95f8c0dc9e1f8e620ec14d2fb65028fb7adc Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:04:38 +0200
Subject: [PATCH] random: account for arch randomness in bits
Rather than accounting in bytes and multiplying (shifting), we can just
account in bits and avoid the shift. The main motivation for this is
there are other patches in flux that expand this code a bit, and
avoiding the duplication of "* 8" everywhere makes things a bit clearer.
Cc: stable(a)vger.kernel.org
Fixes: 12e45a2a6308 ("random: credit architectural init the exact amount")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0d6fb3eaf609..60d701a2516b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -776,7 +776,7 @@ static struct notifier_block pm_notifier = { .notifier_call = random_pm_notifica
int __init random_init(const char *command_line)
{
ktime_t now = ktime_get_real();
- unsigned int i, arch_bytes;
+ unsigned int i, arch_bits;
unsigned long entropy;
#if defined(LATENT_ENTROPY_PLUGIN)
@@ -784,12 +784,12 @@ int __init random_init(const char *command_line)
_mix_pool_bytes(compiletime_seed, sizeof(compiletime_seed));
#endif
- for (i = 0, arch_bytes = BLAKE2S_BLOCK_SIZE;
+ for (i = 0, arch_bits = BLAKE2S_BLOCK_SIZE * 8;
i < BLAKE2S_BLOCK_SIZE; i += sizeof(entropy)) {
if (!arch_get_random_seed_long_early(&entropy) &&
!arch_get_random_long_early(&entropy)) {
entropy = random_get_entropy();
- arch_bytes -= sizeof(entropy);
+ arch_bits -= sizeof(entropy) * 8;
}
_mix_pool_bytes(&entropy, sizeof(entropy));
}
@@ -801,8 +801,8 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- _credit_init_bits(arch_bytes * 8);
- used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
+ _credit_init_bits(arch_bits);
+ used_arch_random = arch_bits >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 77fc95f8c0dc9e1f8e620ec14d2fb65028fb7adc Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:04:38 +0200
Subject: [PATCH] random: account for arch randomness in bits
Rather than accounting in bytes and multiplying (shifting), we can just
account in bits and avoid the shift. The main motivation for this is
there are other patches in flux that expand this code a bit, and
avoiding the duplication of "* 8" everywhere makes things a bit clearer.
Cc: stable(a)vger.kernel.org
Fixes: 12e45a2a6308 ("random: credit architectural init the exact amount")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0d6fb3eaf609..60d701a2516b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -776,7 +776,7 @@ static struct notifier_block pm_notifier = { .notifier_call = random_pm_notifica
int __init random_init(const char *command_line)
{
ktime_t now = ktime_get_real();
- unsigned int i, arch_bytes;
+ unsigned int i, arch_bits;
unsigned long entropy;
#if defined(LATENT_ENTROPY_PLUGIN)
@@ -784,12 +784,12 @@ int __init random_init(const char *command_line)
_mix_pool_bytes(compiletime_seed, sizeof(compiletime_seed));
#endif
- for (i = 0, arch_bytes = BLAKE2S_BLOCK_SIZE;
+ for (i = 0, arch_bits = BLAKE2S_BLOCK_SIZE * 8;
i < BLAKE2S_BLOCK_SIZE; i += sizeof(entropy)) {
if (!arch_get_random_seed_long_early(&entropy) &&
!arch_get_random_long_early(&entropy)) {
entropy = random_get_entropy();
- arch_bytes -= sizeof(entropy);
+ arch_bits -= sizeof(entropy) * 8;
}
_mix_pool_bytes(&entropy, sizeof(entropy));
}
@@ -801,8 +801,8 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- _credit_init_bits(arch_bytes * 8);
- used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
+ _credit_init_bits(arch_bits);
+ used_arch_random = arch_bits >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 77fc95f8c0dc9e1f8e620ec14d2fb65028fb7adc Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:04:38 +0200
Subject: [PATCH] random: account for arch randomness in bits
Rather than accounting in bytes and multiplying (shifting), we can just
account in bits and avoid the shift. The main motivation for this is
there are other patches in flux that expand this code a bit, and
avoiding the duplication of "* 8" everywhere makes things a bit clearer.
Cc: stable(a)vger.kernel.org
Fixes: 12e45a2a6308 ("random: credit architectural init the exact amount")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0d6fb3eaf609..60d701a2516b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -776,7 +776,7 @@ static struct notifier_block pm_notifier = { .notifier_call = random_pm_notifica
int __init random_init(const char *command_line)
{
ktime_t now = ktime_get_real();
- unsigned int i, arch_bytes;
+ unsigned int i, arch_bits;
unsigned long entropy;
#if defined(LATENT_ENTROPY_PLUGIN)
@@ -784,12 +784,12 @@ int __init random_init(const char *command_line)
_mix_pool_bytes(compiletime_seed, sizeof(compiletime_seed));
#endif
- for (i = 0, arch_bytes = BLAKE2S_BLOCK_SIZE;
+ for (i = 0, arch_bits = BLAKE2S_BLOCK_SIZE * 8;
i < BLAKE2S_BLOCK_SIZE; i += sizeof(entropy)) {
if (!arch_get_random_seed_long_early(&entropy) &&
!arch_get_random_long_early(&entropy)) {
entropy = random_get_entropy();
- arch_bytes -= sizeof(entropy);
+ arch_bits -= sizeof(entropy) * 8;
}
_mix_pool_bytes(&entropy, sizeof(entropy));
}
@@ -801,8 +801,8 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- _credit_init_bits(arch_bytes * 8);
- used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
+ _credit_init_bits(arch_bits);
+ used_arch_random = arch_bits >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9b29b6b20376ab64e1b043df6301d8a92378e631 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 09:44:07 +0200
Subject: [PATCH] random: avoid checking crng_ready() twice in random_init()
The current flow expands to:
if (crng_ready())
...
else if (...)
if (!crng_ready())
...
The second crng_ready() call is redundant, but can't so easily be
optimized out by the compiler.
This commit simplifies that to:
if (crng_ready()
...
else if (...)
...
Fixes: 560181c27b58 ("random: move initialization functions out of hot pages")
Cc: stable(a)vger.kernel.org
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b691b9d59503..4862d4d3ec49 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -801,7 +801,7 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- credit_init_bits(arch_bytes * 8);
+ _credit_init_bits(arch_bytes * 8);
used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9b29b6b20376ab64e1b043df6301d8a92378e631 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 09:44:07 +0200
Subject: [PATCH] random: avoid checking crng_ready() twice in random_init()
The current flow expands to:
if (crng_ready())
...
else if (...)
if (!crng_ready())
...
The second crng_ready() call is redundant, but can't so easily be
optimized out by the compiler.
This commit simplifies that to:
if (crng_ready()
...
else if (...)
...
Fixes: 560181c27b58 ("random: move initialization functions out of hot pages")
Cc: stable(a)vger.kernel.org
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b691b9d59503..4862d4d3ec49 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -801,7 +801,7 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- credit_init_bits(arch_bytes * 8);
+ _credit_init_bits(arch_bytes * 8);
used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9b29b6b20376ab64e1b043df6301d8a92378e631 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 09:44:07 +0200
Subject: [PATCH] random: avoid checking crng_ready() twice in random_init()
The current flow expands to:
if (crng_ready())
...
else if (...)
if (!crng_ready())
...
The second crng_ready() call is redundant, but can't so easily be
optimized out by the compiler.
This commit simplifies that to:
if (crng_ready()
...
else if (...)
...
Fixes: 560181c27b58 ("random: move initialization functions out of hot pages")
Cc: stable(a)vger.kernel.org
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b691b9d59503..4862d4d3ec49 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -801,7 +801,7 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- credit_init_bits(arch_bytes * 8);
+ _credit_init_bits(arch_bytes * 8);
used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9b29b6b20376ab64e1b043df6301d8a92378e631 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 09:44:07 +0200
Subject: [PATCH] random: avoid checking crng_ready() twice in random_init()
The current flow expands to:
if (crng_ready())
...
else if (...)
if (!crng_ready())
...
The second crng_ready() call is redundant, but can't so easily be
optimized out by the compiler.
This commit simplifies that to:
if (crng_ready()
...
else if (...)
...
Fixes: 560181c27b58 ("random: move initialization functions out of hot pages")
Cc: stable(a)vger.kernel.org
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b691b9d59503..4862d4d3ec49 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -801,7 +801,7 @@ int __init random_init(const char *command_line)
if (crng_ready())
crng_reseed();
else if (trust_cpu)
- credit_init_bits(arch_bytes * 8);
+ _credit_init_bits(arch_bytes * 8);
used_arch_random = arch_bytes * 8 >= POOL_READY_BITS;
WARN_ON(register_pm_notifier(&pm_notifier));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 39e0f991a62ed5efabd20711a7b6e7da92603170 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:00:16 +0200
Subject: [PATCH] random: mark bootloader randomness code as __init
add_bootloader_randomness() and the variables it touches are only used
during __init and not after, so mark these as __init. At the same time,
unexport this, since it's only called by other __init code that's
built-in.
Cc: stable(a)vger.kernel.org
Fixes: 428826f5358c ("fdt: add support for rng-seed")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 4862d4d3ec49..0d6fb3eaf609 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -725,8 +725,8 @@ static void __cold _credit_init_bits(size_t bits)
**********************************************************************/
static bool used_arch_random;
-static bool trust_cpu __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
-static bool trust_bootloader __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
+static bool trust_cpu __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
+static bool trust_bootloader __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
static int __init parse_trust_cpu(char *arg)
{
return kstrtobool(arg, &trust_cpu);
@@ -865,13 +865,12 @@ EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
* Handle random seed passed by bootloader, and credit it if
* CONFIG_RANDOM_TRUST_BOOTLOADER is set.
*/
-void __cold add_bootloader_randomness(const void *buf, size_t len)
+void __init add_bootloader_randomness(const void *buf, size_t len)
{
mix_pool_bytes(buf, len);
if (trust_bootloader)
credit_init_bits(len * 8);
}
-EXPORT_SYMBOL_GPL(add_bootloader_randomness);
#if IS_ENABLED(CONFIG_VMGENID)
static BLOCKING_NOTIFIER_HEAD(vmfork_chain);
diff --git a/include/linux/random.h b/include/linux/random.h
index fae0c84027fd..223b4bd584e7 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -13,7 +13,7 @@
struct notifier_block;
void add_device_randomness(const void *buf, size_t len);
-void add_bootloader_randomness(const void *buf, size_t len);
+void __init add_bootloader_randomness(const void *buf, size_t len);
void add_input_randomness(unsigned int type, unsigned int code,
unsigned int value) __latent_entropy;
void add_interrupt_randomness(int irq) __latent_entropy;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 39e0f991a62ed5efabd20711a7b6e7da92603170 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:00:16 +0200
Subject: [PATCH] random: mark bootloader randomness code as __init
add_bootloader_randomness() and the variables it touches are only used
during __init and not after, so mark these as __init. At the same time,
unexport this, since it's only called by other __init code that's
built-in.
Cc: stable(a)vger.kernel.org
Fixes: 428826f5358c ("fdt: add support for rng-seed")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 4862d4d3ec49..0d6fb3eaf609 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -725,8 +725,8 @@ static void __cold _credit_init_bits(size_t bits)
**********************************************************************/
static bool used_arch_random;
-static bool trust_cpu __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
-static bool trust_bootloader __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
+static bool trust_cpu __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
+static bool trust_bootloader __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
static int __init parse_trust_cpu(char *arg)
{
return kstrtobool(arg, &trust_cpu);
@@ -865,13 +865,12 @@ EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
* Handle random seed passed by bootloader, and credit it if
* CONFIG_RANDOM_TRUST_BOOTLOADER is set.
*/
-void __cold add_bootloader_randomness(const void *buf, size_t len)
+void __init add_bootloader_randomness(const void *buf, size_t len)
{
mix_pool_bytes(buf, len);
if (trust_bootloader)
credit_init_bits(len * 8);
}
-EXPORT_SYMBOL_GPL(add_bootloader_randomness);
#if IS_ENABLED(CONFIG_VMGENID)
static BLOCKING_NOTIFIER_HEAD(vmfork_chain);
diff --git a/include/linux/random.h b/include/linux/random.h
index fae0c84027fd..223b4bd584e7 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -13,7 +13,7 @@
struct notifier_block;
void add_device_randomness(const void *buf, size_t len);
-void add_bootloader_randomness(const void *buf, size_t len);
+void __init add_bootloader_randomness(const void *buf, size_t len);
void add_input_randomness(unsigned int type, unsigned int code,
unsigned int value) __latent_entropy;
void add_interrupt_randomness(int irq) __latent_entropy;
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 39e0f991a62ed5efabd20711a7b6e7da92603170 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:00:16 +0200
Subject: [PATCH] random: mark bootloader randomness code as __init
add_bootloader_randomness() and the variables it touches are only used
during __init and not after, so mark these as __init. At the same time,
unexport this, since it's only called by other __init code that's
built-in.
Cc: stable(a)vger.kernel.org
Fixes: 428826f5358c ("fdt: add support for rng-seed")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 4862d4d3ec49..0d6fb3eaf609 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -725,8 +725,8 @@ static void __cold _credit_init_bits(size_t bits)
**********************************************************************/
static bool used_arch_random;
-static bool trust_cpu __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
-static bool trust_bootloader __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
+static bool trust_cpu __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
+static bool trust_bootloader __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
static int __init parse_trust_cpu(char *arg)
{
return kstrtobool(arg, &trust_cpu);
@@ -865,13 +865,12 @@ EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
* Handle random seed passed by bootloader, and credit it if
* CONFIG_RANDOM_TRUST_BOOTLOADER is set.
*/
-void __cold add_bootloader_randomness(const void *buf, size_t len)
+void __init add_bootloader_randomness(const void *buf, size_t len)
{
mix_pool_bytes(buf, len);
if (trust_bootloader)
credit_init_bits(len * 8);
}
-EXPORT_SYMBOL_GPL(add_bootloader_randomness);
#if IS_ENABLED(CONFIG_VMGENID)
static BLOCKING_NOTIFIER_HEAD(vmfork_chain);
diff --git a/include/linux/random.h b/include/linux/random.h
index fae0c84027fd..223b4bd584e7 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -13,7 +13,7 @@
struct notifier_block;
void add_device_randomness(const void *buf, size_t len);
-void add_bootloader_randomness(const void *buf, size_t len);
+void __init add_bootloader_randomness(const void *buf, size_t len);
void add_input_randomness(unsigned int type, unsigned int code,
unsigned int value) __latent_entropy;
void add_interrupt_randomness(int irq) __latent_entropy;
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 39e0f991a62ed5efabd20711a7b6e7da92603170 Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Tue, 7 Jun 2022 17:00:16 +0200
Subject: [PATCH] random: mark bootloader randomness code as __init
add_bootloader_randomness() and the variables it touches are only used
during __init and not after, so mark these as __init. At the same time,
unexport this, since it's only called by other __init code that's
built-in.
Cc: stable(a)vger.kernel.org
Fixes: 428826f5358c ("fdt: add support for rng-seed")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 4862d4d3ec49..0d6fb3eaf609 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -725,8 +725,8 @@ static void __cold _credit_init_bits(size_t bits)
**********************************************************************/
static bool used_arch_random;
-static bool trust_cpu __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
-static bool trust_bootloader __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
+static bool trust_cpu __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
+static bool trust_bootloader __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER);
static int __init parse_trust_cpu(char *arg)
{
return kstrtobool(arg, &trust_cpu);
@@ -865,13 +865,12 @@ EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
* Handle random seed passed by bootloader, and credit it if
* CONFIG_RANDOM_TRUST_BOOTLOADER is set.
*/
-void __cold add_bootloader_randomness(const void *buf, size_t len)
+void __init add_bootloader_randomness(const void *buf, size_t len)
{
mix_pool_bytes(buf, len);
if (trust_bootloader)
credit_init_bits(len * 8);
}
-EXPORT_SYMBOL_GPL(add_bootloader_randomness);
#if IS_ENABLED(CONFIG_VMGENID)
static BLOCKING_NOTIFIER_HEAD(vmfork_chain);
diff --git a/include/linux/random.h b/include/linux/random.h
index fae0c84027fd..223b4bd584e7 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -13,7 +13,7 @@
struct notifier_block;
void add_device_randomness(const void *buf, size_t len);
-void add_bootloader_randomness(const void *buf, size_t len);
+void __init add_bootloader_randomness(const void *buf, size_t len);
void add_input_randomness(unsigned int type, unsigned int code,
unsigned int value) __latent_entropy;
void add_interrupt_randomness(int irq) __latent_entropy;
SANTANDER BANK COMPENSATION UNIT, IN AFFILIATION WITH THE UNITED NATION.
Your compensation fund of 6 million dollars is ready for payment
contact me for more details.
Thanks
Greetings,
I sent this mail praying it will find you in a good condition, since I
myself am in a very critical health condition in which I sleep every
night without knowing if I may be alive to see the next day. I am
Mrs. Maria Roland, a widow suffering from a long time illness. I have
some funds I inherited from my late husband, the sum of
($11,000,000.00) my Doctor told me recently that I have serious
sickness which is a cancer problem. What disturbs me most is my stroke
sickness. Having known my condition, I decided to donate this fund to
a good person that will utilize it the way I am going to instruct
herein. I need a very honest God.
fearing a person who can claim this money and use it for Charity
works, for orphanages, widows and also build schools for less
privileges that will be named after my late husband if possible and to
promote the word of God and the effort that the house of God is
maintained. I do not want a situation where this money will be used in
an ungodly manner. That's why I' making this decision. I'm not afraid
of death so I know where I'm going. I accept this decision because I
do not have any child who will inherit this money after I die. Please
I want your sincere and urgent answer to know if you will be able to
execute this project, and I will give you more information on how the
fund will be transferred to your bank account. I am waiting for your reply,
May God Bless you,
Mrs. Maria Roland,
From: Chuang Wang <nashuiliang(a)gmail.com>
In aggrprobe scenes, if arm_kprobe() returns an error(e.g. livepatch and
kprobe are using the same function X), kprobe flags, while has been
modified to ~KPROBE_FLAG_DISABLED, is not rollled back.
Then, __disable_kprobe() will be failed in __unregister_kprobe_top(),
the kprobe list will be not removed from aggrprobe, memory leaks or
illegal pointers will be caused.
WARN disarm_kprobe:
Failed to disarm kprobe-ftrace at 00000000c729fdbc (-2)
RIP: 0010:disarm_kprobe+0xcc/0x110
Call Trace:
__disable_kprobe+0x78/0x90
__unregister_kprobe_top+0x13/0x1b0
? _cond_resched+0x15/0x30
unregister_kprobes+0x32/0x80
unregister_kprobe+0x1a/0x20
Illegal Pointers:
BUG: unable to handle kernel paging request at 0000000000656369
RIP: 0010:__get_valid_kprobe+0x69/0x90
Call Trace:
register_kprobe+0x30/0x60
__register_trace_kprobe.part.7+0x8b/0xc0
create_local_trace_kprobe+0xd2/0x130
perf_kprobe_init+0x83/0xd0
Fixes: 12310e343755 ("kprobes: Propagate error from arm_kprobe_ftrace()")
Signed-off-by: Chuang Wang <nashuiliang(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Jingren Zhou <zhoujingren(a)didiglobal.com>
---
v1->v2:
- Supplement commit information: fixline, Cc stable
kernel/kprobes.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index f214f8c088ed..c11c79e05a4c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2422,8 +2422,11 @@ int enable_kprobe(struct kprobe *kp)
if (!kprobes_all_disarmed && kprobe_disabled(p)) {
p->flags &= ~KPROBE_FLAG_DISABLED;
ret = arm_kprobe(p);
- if (ret)
+ if (ret) {
p->flags |= KPROBE_FLAG_DISABLED;
+ if (p != kp)
+ kp->flags |= KPROBE_FLAG_DISABLED;
+ }
}
out:
mutex_unlock(&kprobe_mutex);
--
2.34.1
The following changes since commit f2906aa863381afb0015a9eb7fefad885d4e5a56:
Linux 5.19-rc1 (2022-06-05 17:18:54 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to eacea844594ff338db06437806707313210d4865:
um: virt-pci: set device ready in probe() (2022-06-10 20:38:06 -0400)
----------------------------------------------------------------
virtio,vdpa: fixes
Fixes all over the place, most notably fixes for latent
bugs in drivers that got exposed by suppressing
interrupts before DRIVER_OK, which in turn has been
done by 8b4ec69d7e09 ("virtio: harden vring IRQ").
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
----------------------------------------------------------------
Bo Liu (1):
virtio: Fix all occurences of the "the the" typo
Dan Carpenter (2):
vdpa/mlx5: fix error code for deleting vlan
vdpa/mlx5: clean up indenting in handle_ctrl_vlan()
Jason Wang (2):
virtio-rng: make device ready before making request
vdpa: make get_vq_group and set_group_asid optional
Vincent Whitchurch (1):
um: virt-pci: set device ready in probe()
Xiang wangx (1):
vdpa/mlx5: Fix syntax errors in comments
Xie Yongji (2):
vringh: Fix loop descriptors check in the indirect cases
vduse: Fix NULL pointer dereference on sysfs access
chengkaitao (1):
virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed
arch/um/drivers/virt-pci.c | 7 ++++++-
drivers/char/hw_random/virtio-rng.c | 2 ++
drivers/vdpa/mlx5/net/mlx5_vnet.c | 9 +++++----
drivers/vdpa/vdpa_user/vduse_dev.c | 7 +++----
drivers/vhost/vdpa.c | 2 ++
drivers/vhost/vringh.c | 10 ++++++++--
drivers/virtio/virtio_mmio.c | 3 ++-
drivers/virtio/virtio_pci_modern_dev.c | 2 +-
include/linux/vdpa.h | 5 +++--
9 files changed, 32 insertions(+), 15 deletions(-)
The patch titled
Subject: mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-userfaultfd-fix-uffdio_continue-on-fallocated-shmem-pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Axel Rasmussen <axelrasmussen(a)google.com>
Subject: mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
Date: Fri, 10 Jun 2022 10:38:12 -0700
When fallocate() is used on a shmem file, the pages we allocate can end up
with !PageUptodate.
Since UFFDIO_CONTINUE tries to find the existing page the user wants to
map with SGP_READ, we would fail to find such a page, since
shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
!PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
do if the page wasn't found in the page cache at all.
This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find
if a page exists, and doesn't care whether it still needs to be cleared or
not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same,
except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
will clear the page and then return it. With this change, UFFDIO_CONTINUE
works properly (succeeds) on a shmem file which has been fallocated, but
otherwise not modified.
Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com
Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen(a)google.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/userfaultfd.c~mm-userfaultfd-fix-uffdio_continue-on-fallocated-shmem-pages
+++ a/mm/userfaultfd.c
@@ -246,7 +246,10 @@ static int mcontinue_atomic_pte(struct m
struct page *page;
int ret;
- ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
+ ret = shmem_getpage(inode, pgoff, &page, SGP_NOALLOC);
+ /* Our caller expects us to return -EFAULT if we failed to find page. */
+ if (ret == -ENOENT)
+ ret = -EFAULT;
if (ret)
goto out;
if (!page) {
_
Patches currently in -mm which might be from axelrasmussen(a)google.com are
mm-userfaultfd-fix-uffdio_continue-on-fallocated-shmem-pages.patch
The quilt patch titled
Subject: mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
has been removed from the -mm tree. Its filename was
mm-userfaultfd-fix-uffdio_continue-on-fallocated-shmem-pages.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Axel Rasmussen <axelrasmussen(a)google.com>
Subject: mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
Date: Fri, 3 Jun 2022 13:57:41 -0700
When fallocate() is used on a shmem file, the pages we allocate can end up
with !PageUptodate.
Since UFFDIO_CONTINUE tries to find the existing page the user wants to
map with SGP_READ, we would fail to find such a page, since
shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
!PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
do if the page wasn't found in the page cache at all.
This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find
if a page exists, and doesn't care whether it still needs to be cleared or
not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same,
except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
will clear the page and then return it. With this change, UFFDIO_CONTINUE
works properly (succeeds) on a shmem file which has been fallocated, but
otherwise not modified.
Link: https://lkml.kernel.org/r/20220603205741.12888-1-axelrasmussen@google.com
Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen(a)google.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/userfaultfd.c~mm-userfaultfd-fix-uffdio_continue-on-fallocated-shmem-pages
+++ a/mm/userfaultfd.c
@@ -246,7 +246,7 @@ static int mcontinue_atomic_pte(struct m
struct page *page;
int ret;
- ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
+ ret = shmem_getpage(inode, pgoff, &page, SGP_NOALLOC);
if (ret)
goto out;
if (!page) {
_
Patches currently in -mm which might be from axelrasmussen(a)google.com are
Hello [lvuugu]
Only we have the lowest prices for E-mail database [kmkzqq]
Want now new.bussines.year(a)gmail.com [nsozzk]
To unsubscribe, send us an email with the subject 'unsubscribe' [gnualis]
The platform's RNG must be available before random_init() in order to be
useful for initial seeding, which in turn means that it needs to be
called from setup_arch(), rather than from an init call. Fortunately,
each platform already has a setup_arch function pointer, which means
it's easy to wire this up. This commit also removes some noisy log
messages that don't add much.
Cc: stable(a)vger.kernel.org
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Fixes: a489043f4626 ("powerpc/pseries: Implement arch_get_random_long() based on H_RANDOM")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
arch/powerpc/platforms/pseries/pseries.h | 2 ++
arch/powerpc/platforms/pseries/rng.c | 11 +++--------
arch/powerpc/platforms/pseries/setup.c | 1 +
3 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index f5c916c839c9..1d75b7742ef0 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -122,4 +122,6 @@ void pseries_lpar_read_hblkrm_characteristics(void);
static inline void pseries_lpar_read_hblkrm_characteristics(void) { }
#endif
+void pseries_rng_init(void);
+
#endif /* _PSERIES_PSERIES_H */
diff --git a/arch/powerpc/platforms/pseries/rng.c b/arch/powerpc/platforms/pseries/rng.c
index 6268545947b8..6ddfdeaace9e 100644
--- a/arch/powerpc/platforms/pseries/rng.c
+++ b/arch/powerpc/platforms/pseries/rng.c
@@ -10,6 +10,7 @@
#include <asm/archrandom.h>
#include <asm/machdep.h>
#include <asm/plpar_wrappers.h>
+#include "pseries.h"
static int pseries_get_random_long(unsigned long *v)
@@ -24,19 +25,13 @@ static int pseries_get_random_long(unsigned long *v)
return 0;
}
-static __init int rng_init(void)
+void __init pseries_rng_init(void)
{
struct device_node *dn;
dn = of_find_compatible_node(NULL, NULL, "ibm,random");
if (!dn)
- return -ENODEV;
-
- pr_info("Registering arch random hook.\n");
-
+ return;
ppc_md.get_random_seed = pseries_get_random_long;
-
of_node_put(dn);
- return 0;
}
-machine_subsys_initcall(pseries, rng_init);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index afb074269b42..ee4f1db49515 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -839,6 +839,7 @@ static void __init pSeries_setup_arch(void)
}
ppc_md.pcibios_root_bridge_prepare = pseries_root_bridge_prepare;
+ pseries_rng_init();
}
static void pseries_panic(char *str)
--
2.35.1
Before commit 9998943f6dfc ("media: rkvdec: Stop overclocking the
decoder"), the rkvdec driver was forcing the VDU clock rate. After that
commit, we rely on the default clock rate. That rate works OK on many
boards, with the default PLL settings (CPLL is 800MHz, VDU dividers
leave it at 400MHz); but some boards change PLL settings.
Assign the expected default clock rate explicitly, so that the rate is
consistent, regardless of PLL configuration.
This was particularly broken on RK3399 Gru Scarlet systems, where the
rk3399-gru-scarlet.dtsi assigns PLL_CPLL to 1.6 GHz, and so the VDU
clock ends up at 800 MHz (twice the expected rate), and causes video
artifacts and other issues.
Note: I assign the clock rate in the clock controller instead of the
vdec node, because there are multiple nodes that use this clock, and per
the clock.yaml specification:
Configuring a clock's parent and rate through the device node that
consumes the clock can be done only for clocks that have a single
user. Specifying conflicting parent or rate configuration in multiple
consumer nodes for a shared clock is forbidden.
Configuration of common clocks, which affect multiple consumer devices
can be similarly specified in the clock provider node.
Fixes: 9998943f6dfc ("media: rkvdec: Stop overclocking the decoder")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Brian Norris <briannorris(a)chromium.org>
---
This is a candidate for 5.19 IMO, since commit 9998943f6dfc landed in
5.19-rc1 and is being queued up for -stable as we speak.
arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi | 4 +++-
arch/arm64/boot/dts/rockchip/rk3399.dtsi | 6 ++++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi b/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi
index 913d845eb51a..1977103a5ef4 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi
@@ -376,7 +376,8 @@ &cru {
<&cru ACLK_VIO>,
<&cru ACLK_GIC_PRE>,
<&cru PCLK_DDR>,
- <&cru ACLK_HDCP>;
+ <&cru ACLK_HDCP>,
+ <&cru ACLK_VDU>;
assigned-clock-rates =
<600000000>, <1600000000>,
<1000000000>,
@@ -388,6 +389,7 @@ &cru {
<400000000>,
<200000000>,
<200000000>,
+ <400000000>,
<400000000>;
};
diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
index fbd0346624e6..9d5b0e8c9cca 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -1462,7 +1462,8 @@ cru: clock-controller@ff760000 {
<&cru HCLK_PERILP1>, <&cru PCLK_PERILP1>,
<&cru ACLK_VIO>, <&cru ACLK_HDCP>,
<&cru ACLK_GIC_PRE>,
- <&cru PCLK_DDR>;
+ <&cru PCLK_DDR>,
+ <&cru ACLK_VDU>;
assigned-clock-rates =
<594000000>, <800000000>,
<1000000000>,
@@ -1473,7 +1474,8 @@ cru: clock-controller@ff760000 {
<100000000>, <50000000>,
<400000000>, <400000000>,
<200000000>,
- <200000000>;
+ <200000000>,
+ <400000000>;
};
grf: syscon@ff770000 {
--
2.36.1.255.ge46751e96f-goog
The platform's RNG must be available before random_init() in order to be
useful for initial seeding, which in turn means that it needs to be
called from setup_arch(), rather than from an init call. Fortunately,
each platform already has a setup_arch function pointer, which means
it's easy to wire this up for each of the three platforms that have an
RNG. This commit also removes some noisy log messages that don't add
much.
Cc: stable(a)vger.kernel.org
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Fixes: c25769fddaec ("powerpc/microwatt: Add support for hardware random number generator")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
arch/powerpc/platforms/microwatt/rng.c | 9 ++-------
arch/powerpc/platforms/microwatt/setup.c | 8 ++++++++
2 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/platforms/microwatt/rng.c b/arch/powerpc/platforms/microwatt/rng.c
index 7bc4d1cbfaf0..d13f656910ad 100644
--- a/arch/powerpc/platforms/microwatt/rng.c
+++ b/arch/powerpc/platforms/microwatt/rng.c
@@ -29,7 +29,7 @@ static int microwatt_get_random_darn(unsigned long *v)
return 1;
}
-static __init int rng_init(void)
+__init void microwatt_rng_init(void)
{
unsigned long val;
int i;
@@ -37,12 +37,7 @@ static __init int rng_init(void)
for (i = 0; i < 10; i++) {
if (microwatt_get_random_darn(&val)) {
ppc_md.get_random_seed = microwatt_get_random_darn;
- return 0;
+ return;
}
}
-
- pr_warn("Unable to use DARN for get_random_seed()\n");
-
- return -EIO;
}
-machine_subsys_initcall(, rng_init);
diff --git a/arch/powerpc/platforms/microwatt/setup.c b/arch/powerpc/platforms/microwatt/setup.c
index 0b02603bdb74..23c996dcc870 100644
--- a/arch/powerpc/platforms/microwatt/setup.c
+++ b/arch/powerpc/platforms/microwatt/setup.c
@@ -32,10 +32,18 @@ static int __init microwatt_populate(void)
}
machine_arch_initcall(microwatt, microwatt_populate);
+__init void microwatt_rng_init(void);
+
+static void __init microwatt_setup_arch(void)
+{
+ microwatt_rng_init();
+}
+
define_machine(microwatt) {
.name = "microwatt",
.probe = microwatt_probe,
.init_IRQ = microwatt_init_IRQ,
+ .setup_arch = microwatt_setup_arch,
.progress = udbg_progress,
.calibrate_decr = generic_calibrate_decr,
};
--
2.35.1
The platform's RNG must be available before random_init() in order to be
useful for initial seeding, which in turn means that it needs to be
called from setup_arch(), rather than from an init call. Fortunately,
each platform already has a setup_arch function pointer, which means
it's easy to wire this up for each of the three platforms that have an
RNG. This commit also removes some noisy log messages that don't add
much.
Cc: stable(a)vger.kernel.org
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Fixes: a489043f4626 ("powerpc/pseries: Implement arch_get_random_long() based on H_RANDOM")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
arch/powerpc/platforms/pseries/rng.c | 11 ++---------
arch/powerpc/platforms/pseries/setup.c | 3 +++
2 files changed, 5 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/rng.c b/arch/powerpc/platforms/pseries/rng.c
index 6268545947b8..d39bfce39aa1 100644
--- a/arch/powerpc/platforms/pseries/rng.c
+++ b/arch/powerpc/platforms/pseries/rng.c
@@ -24,19 +24,12 @@ static int pseries_get_random_long(unsigned long *v)
return 0;
}
-static __init int rng_init(void)
+__init void pseries_rng_init(void)
{
struct device_node *dn;
-
dn = of_find_compatible_node(NULL, NULL, "ibm,random");
if (!dn)
- return -ENODEV;
-
- pr_info("Registering arch random hook.\n");
-
+ return;
ppc_md.get_random_seed = pseries_get_random_long;
-
of_node_put(dn);
- return 0;
}
-machine_subsys_initcall(pseries, rng_init);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index afb074269b42..7f3ee2658163 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -779,6 +779,8 @@ static resource_size_t pseries_pci_iov_resource_alignment(struct pci_dev *pdev,
}
#endif
+__init void pseries_rng_init(void);
+
static void __init pSeries_setup_arch(void)
{
set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
@@ -839,6 +841,7 @@ static void __init pSeries_setup_arch(void)
}
ppc_md.pcibios_root_bridge_prepare = pseries_root_bridge_prepare;
+ pseries_rng_init();
}
static void pseries_panic(char *str)
--
2.35.1
The platform's RNG must be available before random_init() in order to be
useful for initial seeding, which in turn means that it needs to be
called from setup_arch(), rather than from an init call. Fortunately,
each platform already has a setup_arch function pointer, which means
it's easy to wire this up for each of the three platforms that have an
RNG. This commit also removes some noisy log messages that don't add
much.
Cc: stable(a)vger.kernel.org
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Fixes: a4da0d50b2a0 ("powerpc: Implement arch_get_random_long/int() for powernv")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
arch/powerpc/platforms/powernv/rng.c | 17 ++++-------------
arch/powerpc/platforms/powernv/setup.c | 4 ++++
2 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/rng.c b/arch/powerpc/platforms/powernv/rng.c
index e3d44b36ae98..ef24e72a1b69 100644
--- a/arch/powerpc/platforms/powernv/rng.c
+++ b/arch/powerpc/platforms/powernv/rng.c
@@ -84,24 +84,20 @@ static int powernv_get_random_darn(unsigned long *v)
return 1;
}
-static int __init initialise_darn(void)
+static void __init initialise_darn(void)
{
unsigned long val;
int i;
if (!cpu_has_feature(CPU_FTR_ARCH_300))
- return -ENODEV;
+ return;
for (i = 0; i < 10; i++) {
if (powernv_get_random_darn(&val)) {
ppc_md.get_random_seed = powernv_get_random_darn;
- return 0;
+ return;
}
}
-
- pr_warn("Unable to use DARN for get_random_seed()\n");
-
- return -EIO;
}
int powernv_get_random_long(unsigned long *v)
@@ -163,14 +159,12 @@ static __init int rng_create(struct device_node *dn)
rng_init_per_cpu(rng, dn);
- pr_info_once("Registering arch random hook.\n");
-
ppc_md.get_random_seed = powernv_get_random_long;
return 0;
}
-static __init int rng_init(void)
+__init void powernv_rng_init(void)
{
struct device_node *dn;
int rc;
@@ -188,7 +182,4 @@ static __init int rng_init(void)
}
initialise_darn();
-
- return 0;
}
-machine_subsys_initcall(powernv, rng_init);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 824c3ad7a0fa..a0c5217bc5c0 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -184,6 +184,8 @@ static void __init pnv_check_guarded_cores(void)
}
}
+__init void powernv_rng_init(void);
+
static void __init pnv_setup_arch(void)
{
set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
@@ -203,6 +205,8 @@ static void __init pnv_setup_arch(void)
pnv_check_guarded_cores();
/* XXX PMCS */
+
+ powernv_rng_init();
}
static void __init pnv_init(void)
--
2.35.1
Commit 6c64a664e1cff339ec698d803fa8cbb9af5d95ce “xhci: Set HCD flag to defer
primary roothub registration” added for Linux 5.18.3 caused a regression on
some Amlogic S912 devices (original user forum report with an Android TV box
and confirmed with a Khadas VIM2 board). I do not see issues with older S905
(WeTeK Play2) or newer S922X (Odroid N2+) devices running the same kernel.
There are no kernel splats or error messages but lsusb shows no output and
nothing works. Simple revert restores the previous good working behaviour.
dmesg with commit http://ix.io/3ZTv
dmesg with revert http://ix.io/3ZTw
I’m not a coder so will need to be fed instructions to assist with debugging
the issue further if you don't have access to an Amlogic S912 device. I can
share devices online for remote poking around if needed.
Christian
--
Dear Fund Beneficiary,
Refer to my previous emails informing you again and again that your
fund has been released. Please be advised that your long delayed
payment has just been released and loaded onto an ATM Master
Card.Please reconfirm your names and address for delivery.
Looking forward to hearing from you,
Nigel Higgins, (Barclays Group Chairman),
Barclays Bank Plc,
1 Churchill Place, London, ENG E14 5HP,
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index e08b32ccf1d9..073117ba75fe 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3007,8 +3007,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&tmp, &child->thread.TS_FPR(fpidx),
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ tmp = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ } else {
+ memcpy(&tmp, &child->thread.TS_FPR(fpidx),
+ sizeof(long));
+ }
else
tmp = child->thread.fp_state.fpscr;
}
@@ -3040,8 +3045,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data,
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ } else {
+ memcpy(&child->thread.TS_FPR(fpidx), &data,
+ sizeof(long));
+ }
else
child->thread.fp_state.fpscr = data;
ret = 0;
--
2.35.3
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92febf5f44..63bfc5250b67 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3014,8 +3014,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&tmp, &child->thread.TS_FPR(fpidx),
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ tmp = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ } else {
+ memcpy(&tmp, &child->thread.TS_FPR(fpidx),
+ sizeof(long));
+ }
else
tmp = child->thread.fp_state.fpscr;
}
@@ -3047,8 +3052,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data,
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ } else {
+ memcpy(&child->thread.TS_FPR(fpidx), &data,
+ sizeof(long));
+ }
else
child->thread.fp_state.fpscr = data;
ret = 0;
@@ -3398,4 +3408,7 @@ void __init pt_regs_check(void)
offsetof(struct user_pt_regs, result));
BUILD_BUG_ON(sizeof(struct user_pt_regs) > sizeof(struct pt_regs));
+
+ // ptrace_get/put_fpr() rely on PPC32 and VSX being incompatible
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_VSX));
}
--
2.35.3
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace/ptrace.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index f6e51be47c6e..9ea9ee513ae1 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -75,8 +75,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&tmp, &child->thread.TS_FPR(fpidx),
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ tmp = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ } else {
+ memcpy(&tmp, &child->thread.TS_FPR(fpidx),
+ sizeof(long));
+ }
else
tmp = child->thread.fp_state.fpscr;
}
@@ -108,8 +113,13 @@ long arch_ptrace(struct task_struct *child, long request,
flush_fp_to_thread(child);
if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data,
- sizeof(long));
+ if (IS_ENABLED(CONFIG_PPC32)) {
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ } else {
+ memcpy(&child->thread.TS_FPR(fpidx), &data,
+ sizeof(long));
+ }
else
child->thread.fp_state.fpscr = data;
ret = 0;
@@ -478,4 +488,7 @@ void __init pt_regs_check(void)
* real registers.
*/
BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned long));
+
+ // ptrace_get/put_fpr() rely on PPC32 and VSX being incompatible
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_VSX));
}
--
2.35.3
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace/ptrace-fpu.c | 20 ++++++++++++++------
arch/powerpc/kernel/ptrace/ptrace.c | 3 +++
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace/ptrace-fpu.c b/arch/powerpc/kernel/ptrace/ptrace-fpu.c
index 5dca19361316..09c49632bfe5 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-fpu.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-fpu.c
@@ -17,9 +17,13 @@ int ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data)
#ifdef CONFIG_PPC_FPU_REGS
flush_fp_to_thread(child);
- if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(data, &child->thread.TS_FPR(fpidx), sizeof(long));
- else
+ if (fpidx < (PT_FPSCR - PT_FPR0)) {
+ if (IS_ENABLED(CONFIG_PPC32))
+ // On 32-bit the index we are passed refers to 32-bit words
+ *data = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ else
+ memcpy(data, &child->thread.TS_FPR(fpidx), sizeof(long));
+ } else
*data = child->thread.fp_state.fpscr;
#else
*data = 0;
@@ -39,9 +43,13 @@ int ptrace_put_fpr(struct task_struct *child, int index, unsigned long data)
#ifdef CONFIG_PPC_FPU_REGS
flush_fp_to_thread(child);
- if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long));
- else
+ if (fpidx < (PT_FPSCR - PT_FPR0)) {
+ if (IS_ENABLED(CONFIG_PPC32))
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ else
+ memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long));
+ } else
child->thread.fp_state.fpscr = data;
#endif
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index 7c7093c17c45..ff5e46dbf7c5 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -446,4 +446,7 @@ void __init pt_regs_check(void)
* real registers.
*/
BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned long));
+
+ // ptrace_get/put_fpr() rely on PPC32 and VSX being incompatible
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_VSX));
}
--
2.35.3
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace/ptrace-fpu.c | 20 ++++++++++++++------
arch/powerpc/kernel/ptrace/ptrace.c | 3 +++
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace/ptrace-fpu.c b/arch/powerpc/kernel/ptrace/ptrace-fpu.c
index 5dca19361316..09c49632bfe5 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-fpu.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-fpu.c
@@ -17,9 +17,13 @@ int ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data)
#ifdef CONFIG_PPC_FPU_REGS
flush_fp_to_thread(child);
- if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(data, &child->thread.TS_FPR(fpidx), sizeof(long));
- else
+ if (fpidx < (PT_FPSCR - PT_FPR0)) {
+ if (IS_ENABLED(CONFIG_PPC32))
+ // On 32-bit the index we are passed refers to 32-bit words
+ *data = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ else
+ memcpy(data, &child->thread.TS_FPR(fpidx), sizeof(long));
+ } else
*data = child->thread.fp_state.fpscr;
#else
*data = 0;
@@ -39,9 +43,13 @@ int ptrace_put_fpr(struct task_struct *child, int index, unsigned long data)
#ifdef CONFIG_PPC_FPU_REGS
flush_fp_to_thread(child);
- if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long));
- else
+ if (fpidx < (PT_FPSCR - PT_FPR0)) {
+ if (IS_ENABLED(CONFIG_PPC32))
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ else
+ memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long));
+ } else
child->thread.fp_state.fpscr = data;
#endif
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index c43f77e2ac31..6d45fa288015 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -445,4 +445,7 @@ void __init pt_regs_check(void)
* real registers.
*/
BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned long));
+
+ // ptrace_get/put_fpr() rely on PPC32 and VSX being incompatible
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_VSX));
}
--
2.35.3
commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 upstream.
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.
The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@k…
Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable(a)vger.kernel.org # v3.13+
Reported-by: Ariel Miculas <ariel.miculas(a)belden.com>
Tested-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
---
arch/powerpc/kernel/ptrace/ptrace-fpu.c | 20 ++++++++++++++------
arch/powerpc/kernel/ptrace/ptrace.c | 3 +++
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace/ptrace-fpu.c b/arch/powerpc/kernel/ptrace/ptrace-fpu.c
index 5dca19361316..09c49632bfe5 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-fpu.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-fpu.c
@@ -17,9 +17,13 @@ int ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data)
#ifdef CONFIG_PPC_FPU_REGS
flush_fp_to_thread(child);
- if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(data, &child->thread.TS_FPR(fpidx), sizeof(long));
- else
+ if (fpidx < (PT_FPSCR - PT_FPR0)) {
+ if (IS_ENABLED(CONFIG_PPC32))
+ // On 32-bit the index we are passed refers to 32-bit words
+ *data = ((u32 *)child->thread.fp_state.fpr)[fpidx];
+ else
+ memcpy(data, &child->thread.TS_FPR(fpidx), sizeof(long));
+ } else
*data = child->thread.fp_state.fpscr;
#else
*data = 0;
@@ -39,9 +43,13 @@ int ptrace_put_fpr(struct task_struct *child, int index, unsigned long data)
#ifdef CONFIG_PPC_FPU_REGS
flush_fp_to_thread(child);
- if (fpidx < (PT_FPSCR - PT_FPR0))
- memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long));
- else
+ if (fpidx < (PT_FPSCR - PT_FPR0)) {
+ if (IS_ENABLED(CONFIG_PPC32))
+ // On 32-bit the index we are passed refers to 32-bit words
+ ((u32 *)child->thread.fp_state.fpr)[fpidx] = data;
+ else
+ memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long));
+ } else
child->thread.fp_state.fpscr = data;
#endif
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index 6d5026a9db4f..eab6fa59d9d2 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -450,4 +450,7 @@ void __init pt_regs_check(void)
#else
BUILD_BUG_ON(IS_ENABLED(CONFIG_HAVE_FUNCTION_DESCRIPTORS));
#endif
+
+ // ptrace_get/put_fpr() rely on PPC32 and VSX being incompatible
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_VSX));
}
--
2.35.3