The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 467dc47ea99c56e966e99d09dae54869850abeeb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 27 May 2020 11:16:07 +0100
Subject: [PATCH] btrfs: fix space_info bytes_may_use underflow after nocow
buffered write
When doing a buffered write we always try to reserve data space for it,
even when the file has the NOCOW bit set or the write falls into a file
range covered by a prealloc extent. This is done both because it is
expensive to check if we can do a nocow write (checking if an extent is
shared through reflinks or if there's a hole in the range for example),
and because when writeback starts we might actually need to fallback to
COW mode (for example the block group containing the target extents was
turned into RO mode due to a scrub or balance).
When we are unable to reserve data space we check if we can do a nocow
write, and if we can, we proceed with dirtying the pages and setting up
the range for delalloc. In this case the bytes_may_use counter of the
data space_info object is not incremented, unlike in the case where we
are able to reserve data space (done through btrfs_check_data_free_space()
which calls btrfs_alloc_data_chunk_ondemand()).
Later when running delalloc we attempt to start writeback in nocow mode
but we might revert back to cow mode, for example because in the meanwhile
a block group was turned into RO mode by a scrub or relocation. The cow
path after successfully allocating an extent ends up calling
btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
the data space_info object to have been incremented before - but we did
not do it when the buffered write started, since there was not enough
available data space. So btrfs_add_reserved_bytes() ends up decrementing
the bytes_may_use counter anyway, and when the counter's current value
is smaller then the size of the allocated extent we get a stack trace
like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-btrfs-1754)
RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Code: ff ff 48 (...)
RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
find_free_extent+0x4a0/0x16c0 [btrfs]
btrfs_reserve_extent+0x91/0x180 [btrfs]
cow_file_range+0x12d/0x490 [btrfs]
run_delalloc_nocow+0x341/0xa40 [btrfs]
btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
? find_lock_delalloc_range+0x221/0x250 [btrfs]
writepage_delalloc+0xe8/0x150 [btrfs]
__extent_writepage+0xe8/0x4c0 [btrfs]
extent_write_cache_pages+0x237/0x530 [btrfs]
? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
extent_writepages+0x44/0xa0 [btrfs]
do_writepages+0x23/0x80
__writeback_single_inode+0x59/0x700
writeback_sb_inodes+0x267/0x5f0
__writeback_inodes_wb+0x87/0xe0
wb_writeback+0x382/0x590
? wb_workfn+0x4a2/0x6c0
wb_workfn+0x4a2/0x6c0
process_one_work+0x26d/0x6a0
worker_thread+0x4f/0x3e0
? process_one_work+0x6a0/0x6a0
kthread+0x103/0x140
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x3a/0x50
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace f9f6ef8ec4cd8ec9 ]---
So to fix this, when falling back into cow mode check if space was not
reserved, by testing for the bit EXTENT_NORESERVE in the respective file
range, and if not, increment the bytes_may_use counter for the data
space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
that if the cow path fails it decrements the bytes_may_use counter when
clearing the delalloc range (through the btrfs_clear_delalloc_extent()
callback).
Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1b6cd937f214..486b1da2fc5c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -48,6 +48,7 @@
#include "qgroup.h"
#include "delalloc-space.h"
#include "block-group.h"
+#include "space-info.h"
struct btrfs_iget_args {
u64 ino;
@@ -1354,6 +1355,56 @@ static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
return 1;
}
+static int fallback_to_cow(struct inode *inode, struct page *locked_page,
+ const u64 start, const u64 end,
+ int *page_started, unsigned long *nr_written)
+{
+ struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+ u64 range_start = start;
+ u64 count;
+
+ /*
+ * If EXTENT_NORESERVE is set it means that when the buffered write was
+ * made we had not enough available data space and therefore we did not
+ * reserve data space for it, since we though we could do NOCOW for the
+ * respective file range (either there is prealloc extent or the inode
+ * has the NOCOW bit set).
+ *
+ * However when we need to fallback to COW mode (because for example the
+ * block group for the corresponding extent was turned to RO mode by a
+ * scrub or relocation) we need to do the following:
+ *
+ * 1) We increment the bytes_may_use counter of the data space info.
+ * If COW succeeds, it allocates a new data extent and after doing
+ * that it decrements the space info's bytes_may_use counter and
+ * increments its bytes_reserved counter by the same amount (we do
+ * this at btrfs_add_reserved_bytes()). So we need to increment the
+ * bytes_may_use counter to compensate (when space is reserved at
+ * buffered write time, the bytes_may_use counter is incremented);
+ *
+ * 2) We clear the EXTENT_NORESERVE bit from the range. We do this so
+ * that if the COW path fails for any reason, it decrements (through
+ * extent_clear_unlock_delalloc()) the bytes_may_use counter of the
+ * data space info, which we incremented in the step above.
+ */
+ count = count_range_bits(io_tree, &range_start, end, end + 1 - start,
+ EXTENT_NORESERVE, 0);
+ if (count > 0) {
+ struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+ struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(fs_info, sinfo, count);
+ spin_unlock(&sinfo->lock);
+
+ clear_extent_bit(io_tree, start, end, EXTENT_NORESERVE, 0, 0,
+ NULL);
+ }
+
+ return cow_file_range(inode, locked_page, start, end, page_started,
+ nr_written, 1);
+}
+
/*
* when nowcow writeback call back. This checks for snapshots or COW copies
* of the extents that exist in the file, and COWs the file as required.
@@ -1601,9 +1652,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
* NOCOW, following one which needs to be COW'ed
*/
if (cow_start != (u64)-1) {
- ret = cow_file_range(inode, locked_page,
- cow_start, found_key.offset - 1,
- page_started, nr_written, 1);
+ ret = fallback_to_cow(inode, locked_page, cow_start,
+ found_key.offset - 1,
+ page_started, nr_written);
if (ret) {
if (nocow)
btrfs_dec_nocow_writers(fs_info,
@@ -1692,8 +1743,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
if (cow_start != (u64)-1) {
cur_offset = end;
- ret = cow_file_range(inode, locked_page, cow_start, end,
- page_started, nr_written, 1);
+ ret = fallback_to_cow(inode, locked_page, cow_start, end,
+ page_started, nr_written);
if (ret)
goto error;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 467dc47ea99c56e966e99d09dae54869850abeeb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 27 May 2020 11:16:07 +0100
Subject: [PATCH] btrfs: fix space_info bytes_may_use underflow after nocow
buffered write
When doing a buffered write we always try to reserve data space for it,
even when the file has the NOCOW bit set or the write falls into a file
range covered by a prealloc extent. This is done both because it is
expensive to check if we can do a nocow write (checking if an extent is
shared through reflinks or if there's a hole in the range for example),
and because when writeback starts we might actually need to fallback to
COW mode (for example the block group containing the target extents was
turned into RO mode due to a scrub or balance).
When we are unable to reserve data space we check if we can do a nocow
write, and if we can, we proceed with dirtying the pages and setting up
the range for delalloc. In this case the bytes_may_use counter of the
data space_info object is not incremented, unlike in the case where we
are able to reserve data space (done through btrfs_check_data_free_space()
which calls btrfs_alloc_data_chunk_ondemand()).
Later when running delalloc we attempt to start writeback in nocow mode
but we might revert back to cow mode, for example because in the meanwhile
a block group was turned into RO mode by a scrub or relocation. The cow
path after successfully allocating an extent ends up calling
btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
the data space_info object to have been incremented before - but we did
not do it when the buffered write started, since there was not enough
available data space. So btrfs_add_reserved_bytes() ends up decrementing
the bytes_may_use counter anyway, and when the counter's current value
is smaller then the size of the allocated extent we get a stack trace
like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-btrfs-1754)
RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Code: ff ff 48 (...)
RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
find_free_extent+0x4a0/0x16c0 [btrfs]
btrfs_reserve_extent+0x91/0x180 [btrfs]
cow_file_range+0x12d/0x490 [btrfs]
run_delalloc_nocow+0x341/0xa40 [btrfs]
btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
? find_lock_delalloc_range+0x221/0x250 [btrfs]
writepage_delalloc+0xe8/0x150 [btrfs]
__extent_writepage+0xe8/0x4c0 [btrfs]
extent_write_cache_pages+0x237/0x530 [btrfs]
? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
extent_writepages+0x44/0xa0 [btrfs]
do_writepages+0x23/0x80
__writeback_single_inode+0x59/0x700
writeback_sb_inodes+0x267/0x5f0
__writeback_inodes_wb+0x87/0xe0
wb_writeback+0x382/0x590
? wb_workfn+0x4a2/0x6c0
wb_workfn+0x4a2/0x6c0
process_one_work+0x26d/0x6a0
worker_thread+0x4f/0x3e0
? process_one_work+0x6a0/0x6a0
kthread+0x103/0x140
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x3a/0x50
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace f9f6ef8ec4cd8ec9 ]---
So to fix this, when falling back into cow mode check if space was not
reserved, by testing for the bit EXTENT_NORESERVE in the respective file
range, and if not, increment the bytes_may_use counter for the data
space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
that if the cow path fails it decrements the bytes_may_use counter when
clearing the delalloc range (through the btrfs_clear_delalloc_extent()
callback).
Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1b6cd937f214..486b1da2fc5c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -48,6 +48,7 @@
#include "qgroup.h"
#include "delalloc-space.h"
#include "block-group.h"
+#include "space-info.h"
struct btrfs_iget_args {
u64 ino;
@@ -1354,6 +1355,56 @@ static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
return 1;
}
+static int fallback_to_cow(struct inode *inode, struct page *locked_page,
+ const u64 start, const u64 end,
+ int *page_started, unsigned long *nr_written)
+{
+ struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+ u64 range_start = start;
+ u64 count;
+
+ /*
+ * If EXTENT_NORESERVE is set it means that when the buffered write was
+ * made we had not enough available data space and therefore we did not
+ * reserve data space for it, since we though we could do NOCOW for the
+ * respective file range (either there is prealloc extent or the inode
+ * has the NOCOW bit set).
+ *
+ * However when we need to fallback to COW mode (because for example the
+ * block group for the corresponding extent was turned to RO mode by a
+ * scrub or relocation) we need to do the following:
+ *
+ * 1) We increment the bytes_may_use counter of the data space info.
+ * If COW succeeds, it allocates a new data extent and after doing
+ * that it decrements the space info's bytes_may_use counter and
+ * increments its bytes_reserved counter by the same amount (we do
+ * this at btrfs_add_reserved_bytes()). So we need to increment the
+ * bytes_may_use counter to compensate (when space is reserved at
+ * buffered write time, the bytes_may_use counter is incremented);
+ *
+ * 2) We clear the EXTENT_NORESERVE bit from the range. We do this so
+ * that if the COW path fails for any reason, it decrements (through
+ * extent_clear_unlock_delalloc()) the bytes_may_use counter of the
+ * data space info, which we incremented in the step above.
+ */
+ count = count_range_bits(io_tree, &range_start, end, end + 1 - start,
+ EXTENT_NORESERVE, 0);
+ if (count > 0) {
+ struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+ struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(fs_info, sinfo, count);
+ spin_unlock(&sinfo->lock);
+
+ clear_extent_bit(io_tree, start, end, EXTENT_NORESERVE, 0, 0,
+ NULL);
+ }
+
+ return cow_file_range(inode, locked_page, start, end, page_started,
+ nr_written, 1);
+}
+
/*
* when nowcow writeback call back. This checks for snapshots or COW copies
* of the extents that exist in the file, and COWs the file as required.
@@ -1601,9 +1652,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
* NOCOW, following one which needs to be COW'ed
*/
if (cow_start != (u64)-1) {
- ret = cow_file_range(inode, locked_page,
- cow_start, found_key.offset - 1,
- page_started, nr_written, 1);
+ ret = fallback_to_cow(inode, locked_page, cow_start,
+ found_key.offset - 1,
+ page_started, nr_written);
if (ret) {
if (nocow)
btrfs_dec_nocow_writers(fs_info,
@@ -1692,8 +1743,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
if (cow_start != (u64)-1) {
cur_offset = end;
- ret = cow_file_range(inode, locked_page, cow_start, end,
- page_started, nr_written, 1);
+ ret = fallback_to_cow(inode, locked_page, cow_start, end,
+ page_started, nr_written);
if (ret)
goto error;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 467dc47ea99c56e966e99d09dae54869850abeeb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 27 May 2020 11:16:07 +0100
Subject: [PATCH] btrfs: fix space_info bytes_may_use underflow after nocow
buffered write
When doing a buffered write we always try to reserve data space for it,
even when the file has the NOCOW bit set or the write falls into a file
range covered by a prealloc extent. This is done both because it is
expensive to check if we can do a nocow write (checking if an extent is
shared through reflinks or if there's a hole in the range for example),
and because when writeback starts we might actually need to fallback to
COW mode (for example the block group containing the target extents was
turned into RO mode due to a scrub or balance).
When we are unable to reserve data space we check if we can do a nocow
write, and if we can, we proceed with dirtying the pages and setting up
the range for delalloc. In this case the bytes_may_use counter of the
data space_info object is not incremented, unlike in the case where we
are able to reserve data space (done through btrfs_check_data_free_space()
which calls btrfs_alloc_data_chunk_ondemand()).
Later when running delalloc we attempt to start writeback in nocow mode
but we might revert back to cow mode, for example because in the meanwhile
a block group was turned into RO mode by a scrub or relocation. The cow
path after successfully allocating an extent ends up calling
btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
the data space_info object to have been incremented before - but we did
not do it when the buffered write started, since there was not enough
available data space. So btrfs_add_reserved_bytes() ends up decrementing
the bytes_may_use counter anyway, and when the counter's current value
is smaller then the size of the allocated extent we get a stack trace
like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-btrfs-1754)
RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Code: ff ff 48 (...)
RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
find_free_extent+0x4a0/0x16c0 [btrfs]
btrfs_reserve_extent+0x91/0x180 [btrfs]
cow_file_range+0x12d/0x490 [btrfs]
run_delalloc_nocow+0x341/0xa40 [btrfs]
btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
? find_lock_delalloc_range+0x221/0x250 [btrfs]
writepage_delalloc+0xe8/0x150 [btrfs]
__extent_writepage+0xe8/0x4c0 [btrfs]
extent_write_cache_pages+0x237/0x530 [btrfs]
? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
extent_writepages+0x44/0xa0 [btrfs]
do_writepages+0x23/0x80
__writeback_single_inode+0x59/0x700
writeback_sb_inodes+0x267/0x5f0
__writeback_inodes_wb+0x87/0xe0
wb_writeback+0x382/0x590
? wb_workfn+0x4a2/0x6c0
wb_workfn+0x4a2/0x6c0
process_one_work+0x26d/0x6a0
worker_thread+0x4f/0x3e0
? process_one_work+0x6a0/0x6a0
kthread+0x103/0x140
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x3a/0x50
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace f9f6ef8ec4cd8ec9 ]---
So to fix this, when falling back into cow mode check if space was not
reserved, by testing for the bit EXTENT_NORESERVE in the respective file
range, and if not, increment the bytes_may_use counter for the data
space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
that if the cow path fails it decrements the bytes_may_use counter when
clearing the delalloc range (through the btrfs_clear_delalloc_extent()
callback).
Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1b6cd937f214..486b1da2fc5c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -48,6 +48,7 @@
#include "qgroup.h"
#include "delalloc-space.h"
#include "block-group.h"
+#include "space-info.h"
struct btrfs_iget_args {
u64 ino;
@@ -1354,6 +1355,56 @@ static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
return 1;
}
+static int fallback_to_cow(struct inode *inode, struct page *locked_page,
+ const u64 start, const u64 end,
+ int *page_started, unsigned long *nr_written)
+{
+ struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+ u64 range_start = start;
+ u64 count;
+
+ /*
+ * If EXTENT_NORESERVE is set it means that when the buffered write was
+ * made we had not enough available data space and therefore we did not
+ * reserve data space for it, since we though we could do NOCOW for the
+ * respective file range (either there is prealloc extent or the inode
+ * has the NOCOW bit set).
+ *
+ * However when we need to fallback to COW mode (because for example the
+ * block group for the corresponding extent was turned to RO mode by a
+ * scrub or relocation) we need to do the following:
+ *
+ * 1) We increment the bytes_may_use counter of the data space info.
+ * If COW succeeds, it allocates a new data extent and after doing
+ * that it decrements the space info's bytes_may_use counter and
+ * increments its bytes_reserved counter by the same amount (we do
+ * this at btrfs_add_reserved_bytes()). So we need to increment the
+ * bytes_may_use counter to compensate (when space is reserved at
+ * buffered write time, the bytes_may_use counter is incremented);
+ *
+ * 2) We clear the EXTENT_NORESERVE bit from the range. We do this so
+ * that if the COW path fails for any reason, it decrements (through
+ * extent_clear_unlock_delalloc()) the bytes_may_use counter of the
+ * data space info, which we incremented in the step above.
+ */
+ count = count_range_bits(io_tree, &range_start, end, end + 1 - start,
+ EXTENT_NORESERVE, 0);
+ if (count > 0) {
+ struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+ struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(fs_info, sinfo, count);
+ spin_unlock(&sinfo->lock);
+
+ clear_extent_bit(io_tree, start, end, EXTENT_NORESERVE, 0, 0,
+ NULL);
+ }
+
+ return cow_file_range(inode, locked_page, start, end, page_started,
+ nr_written, 1);
+}
+
/*
* when nowcow writeback call back. This checks for snapshots or COW copies
* of the extents that exist in the file, and COWs the file as required.
@@ -1601,9 +1652,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
* NOCOW, following one which needs to be COW'ed
*/
if (cow_start != (u64)-1) {
- ret = cow_file_range(inode, locked_page,
- cow_start, found_key.offset - 1,
- page_started, nr_written, 1);
+ ret = fallback_to_cow(inode, locked_page, cow_start,
+ found_key.offset - 1,
+ page_started, nr_written);
if (ret) {
if (nocow)
btrfs_dec_nocow_writers(fs_info,
@@ -1692,8 +1743,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
if (cow_start != (u64)-1) {
cur_offset = end;
- ret = cow_file_range(inode, locked_page, cow_start, end,
- page_started, nr_written, 1);
+ ret = fallback_to_cow(inode, locked_page, cow_start, end,
+ page_started, nr_written);
if (ret)
goto error;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e2c8e92d1140754073ad3799eb6620c76bab2078 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 27 May 2020 11:15:53 +0100
Subject: [PATCH] btrfs: fix wrong file range cleanup after an error filling
dealloc range
If an error happens while running dellaloc in COW mode for a range, we can
end up calling extent_clear_unlock_delalloc() for a range that goes beyond
our range's end offset by 1 byte, which affects 1 extra page. This results
in clearing bits and doing page operations (such as a page unlock) outside
our target range.
Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
offset, instead of an exclusive end offset, at cow_file_range().
Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
CC: stable(a)vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4583f0763571..1b6cd937f214 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1141,7 +1141,7 @@ static noinline int cow_file_range(struct inode *inode,
*/
if (extent_reserved) {
extent_clear_unlock_delalloc(inode, start,
- start + cur_alloc_size,
+ start + cur_alloc_size - 1,
locked_page,
clear_bits,
page_ops);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e2c8e92d1140754073ad3799eb6620c76bab2078 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 27 May 2020 11:15:53 +0100
Subject: [PATCH] btrfs: fix wrong file range cleanup after an error filling
dealloc range
If an error happens while running dellaloc in COW mode for a range, we can
end up calling extent_clear_unlock_delalloc() for a range that goes beyond
our range's end offset by 1 byte, which affects 1 extra page. This results
in clearing bits and doing page operations (such as a page unlock) outside
our target range.
Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
offset, instead of an exclusive end offset, at cow_file_range().
Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
CC: stable(a)vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4583f0763571..1b6cd937f214 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1141,7 +1141,7 @@ static noinline int cow_file_range(struct inode *inode,
*/
if (extent_reserved) {
extent_clear_unlock_delalloc(inode, start,
- start + cur_alloc_size,
+ start + cur_alloc_size - 1,
locked_page,
clear_bits,
page_ops);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 18 May 2020 12:14:50 +0100
Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with
shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5afeb17a3f1a..30ce7039bc27 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1167,6 +1167,9 @@ struct btrfs_root {
/* Record pairs of swapped blocks for qgroup */
struct btrfs_qgroup_swapped_blocks swapped_blocks;
+ /* Used only by log trees, when logging csum items */
+ struct extent_io_tree log_csum_range;
+
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
u64 alloc_bytenr;
#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f2f2864f5978..f8ec2d8606fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
- if (!dummy)
+ if (!dummy) {
extent_io_tree_init(fs_info, &root->dirty_log_pages,
IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL);
+ extent_io_tree_init(fs_info, &root->log_csum_range,
+ IO_TREE_LOG_CSUM_RANGE, NULL);
+ }
memset(&root->root_key, 0, sizeof(root->root_key));
memset(&root->root_item, 0, sizeof(root->root_item));
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index b4a7bad3e82e..b6561455b3c4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -44,6 +44,7 @@ enum {
IO_TREE_TRANS_DIRTY_PAGES,
IO_TREE_ROOT_DIRTY_LOG_PAGES,
IO_TREE_INODE_FILE_EXTENT,
+ IO_TREE_LOG_CSUM_RANGE,
IO_TREE_SELFTEST,
};
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 67fa7087f707..920cee312f4e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+ extent_io_tree_release(&log->log_csum_range);
btrfs_put_root(log);
}
@@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans,
struct btrfs_root *log_root,
struct btrfs_ordered_sum *sums)
{
+ const u64 lock_end = sums->bytenr + sums->len - 1;
+ struct extent_state *cached_state = NULL;
int ret;
+ /*
+ * Serialize logging for checksums. This is to avoid racing with the
+ * same checksum being logged by another task that is logging another
+ * file which happens to refer to the same extent as well. Such races
+ * can leave checksum items in the log with overlapping ranges.
+ */
+ ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr,
+ lock_end, &cached_state);
+ if (ret)
+ return ret;
/*
* Due to extent cloning, we might have logged a csum item that covers a
* subrange of a cloned extent, and later we can end up logging a csum
@@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans,
* trim and adjust) any existing csum items in the log for this range.
*/
ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
- if (ret)
- return ret;
+ if (!ret)
+ ret = btrfs_csum_file_blocks(trans, log_root, sums);
- return btrfs_csum_file_blocks(trans, log_root, sums);
+ unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end,
+ &cached_state);
+
+ return ret;
}
static noinline int copy_items(struct btrfs_trans_handle *trans,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index bcbc763b8814..360b0f9d2220 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS);
{ IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \
{ IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \
{ IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \
+ { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \
{ IO_TREE_SELFTEST, "SELFTEST" })
#define BTRFS_GROUP_FLAGS \
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 18 May 2020 12:14:50 +0100
Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with
shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5afeb17a3f1a..30ce7039bc27 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1167,6 +1167,9 @@ struct btrfs_root {
/* Record pairs of swapped blocks for qgroup */
struct btrfs_qgroup_swapped_blocks swapped_blocks;
+ /* Used only by log trees, when logging csum items */
+ struct extent_io_tree log_csum_range;
+
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
u64 alloc_bytenr;
#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f2f2864f5978..f8ec2d8606fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
- if (!dummy)
+ if (!dummy) {
extent_io_tree_init(fs_info, &root->dirty_log_pages,
IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL);
+ extent_io_tree_init(fs_info, &root->log_csum_range,
+ IO_TREE_LOG_CSUM_RANGE, NULL);
+ }
memset(&root->root_key, 0, sizeof(root->root_key));
memset(&root->root_item, 0, sizeof(root->root_item));
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index b4a7bad3e82e..b6561455b3c4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -44,6 +44,7 @@ enum {
IO_TREE_TRANS_DIRTY_PAGES,
IO_TREE_ROOT_DIRTY_LOG_PAGES,
IO_TREE_INODE_FILE_EXTENT,
+ IO_TREE_LOG_CSUM_RANGE,
IO_TREE_SELFTEST,
};
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 67fa7087f707..920cee312f4e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+ extent_io_tree_release(&log->log_csum_range);
btrfs_put_root(log);
}
@@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans,
struct btrfs_root *log_root,
struct btrfs_ordered_sum *sums)
{
+ const u64 lock_end = sums->bytenr + sums->len - 1;
+ struct extent_state *cached_state = NULL;
int ret;
+ /*
+ * Serialize logging for checksums. This is to avoid racing with the
+ * same checksum being logged by another task that is logging another
+ * file which happens to refer to the same extent as well. Such races
+ * can leave checksum items in the log with overlapping ranges.
+ */
+ ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr,
+ lock_end, &cached_state);
+ if (ret)
+ return ret;
/*
* Due to extent cloning, we might have logged a csum item that covers a
* subrange of a cloned extent, and later we can end up logging a csum
@@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans,
* trim and adjust) any existing csum items in the log for this range.
*/
ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
- if (ret)
- return ret;
+ if (!ret)
+ ret = btrfs_csum_file_blocks(trans, log_root, sums);
- return btrfs_csum_file_blocks(trans, log_root, sums);
+ unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end,
+ &cached_state);
+
+ return ret;
}
static noinline int copy_items(struct btrfs_trans_handle *trans,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index bcbc763b8814..360b0f9d2220 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS);
{ IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \
{ IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \
{ IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \
+ { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \
{ IO_TREE_SELFTEST, "SELFTEST" })
#define BTRFS_GROUP_FLAGS \
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 18 May 2020 12:14:50 +0100
Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with
shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5afeb17a3f1a..30ce7039bc27 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1167,6 +1167,9 @@ struct btrfs_root {
/* Record pairs of swapped blocks for qgroup */
struct btrfs_qgroup_swapped_blocks swapped_blocks;
+ /* Used only by log trees, when logging csum items */
+ struct extent_io_tree log_csum_range;
+
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
u64 alloc_bytenr;
#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f2f2864f5978..f8ec2d8606fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
- if (!dummy)
+ if (!dummy) {
extent_io_tree_init(fs_info, &root->dirty_log_pages,
IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL);
+ extent_io_tree_init(fs_info, &root->log_csum_range,
+ IO_TREE_LOG_CSUM_RANGE, NULL);
+ }
memset(&root->root_key, 0, sizeof(root->root_key));
memset(&root->root_item, 0, sizeof(root->root_item));
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index b4a7bad3e82e..b6561455b3c4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -44,6 +44,7 @@ enum {
IO_TREE_TRANS_DIRTY_PAGES,
IO_TREE_ROOT_DIRTY_LOG_PAGES,
IO_TREE_INODE_FILE_EXTENT,
+ IO_TREE_LOG_CSUM_RANGE,
IO_TREE_SELFTEST,
};
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 67fa7087f707..920cee312f4e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+ extent_io_tree_release(&log->log_csum_range);
btrfs_put_root(log);
}
@@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans,
struct btrfs_root *log_root,
struct btrfs_ordered_sum *sums)
{
+ const u64 lock_end = sums->bytenr + sums->len - 1;
+ struct extent_state *cached_state = NULL;
int ret;
+ /*
+ * Serialize logging for checksums. This is to avoid racing with the
+ * same checksum being logged by another task that is logging another
+ * file which happens to refer to the same extent as well. Such races
+ * can leave checksum items in the log with overlapping ranges.
+ */
+ ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr,
+ lock_end, &cached_state);
+ if (ret)
+ return ret;
/*
* Due to extent cloning, we might have logged a csum item that covers a
* subrange of a cloned extent, and later we can end up logging a csum
@@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans,
* trim and adjust) any existing csum items in the log for this range.
*/
ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
- if (ret)
- return ret;
+ if (!ret)
+ ret = btrfs_csum_file_blocks(trans, log_root, sums);
- return btrfs_csum_file_blocks(trans, log_root, sums);
+ unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end,
+ &cached_state);
+
+ return ret;
}
static noinline int copy_items(struct btrfs_trans_handle *trans,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index bcbc763b8814..360b0f9d2220 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS);
{ IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \
{ IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \
{ IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \
+ { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \
{ IO_TREE_SELFTEST, "SELFTEST" })
#define BTRFS_GROUP_FLAGS \
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 18 May 2020 12:14:50 +0100
Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with
shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5afeb17a3f1a..30ce7039bc27 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1167,6 +1167,9 @@ struct btrfs_root {
/* Record pairs of swapped blocks for qgroup */
struct btrfs_qgroup_swapped_blocks swapped_blocks;
+ /* Used only by log trees, when logging csum items */
+ struct extent_io_tree log_csum_range;
+
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
u64 alloc_bytenr;
#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f2f2864f5978..f8ec2d8606fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
- if (!dummy)
+ if (!dummy) {
extent_io_tree_init(fs_info, &root->dirty_log_pages,
IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL);
+ extent_io_tree_init(fs_info, &root->log_csum_range,
+ IO_TREE_LOG_CSUM_RANGE, NULL);
+ }
memset(&root->root_key, 0, sizeof(root->root_key));
memset(&root->root_item, 0, sizeof(root->root_item));
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index b4a7bad3e82e..b6561455b3c4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -44,6 +44,7 @@ enum {
IO_TREE_TRANS_DIRTY_PAGES,
IO_TREE_ROOT_DIRTY_LOG_PAGES,
IO_TREE_INODE_FILE_EXTENT,
+ IO_TREE_LOG_CSUM_RANGE,
IO_TREE_SELFTEST,
};
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 67fa7087f707..920cee312f4e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+ extent_io_tree_release(&log->log_csum_range);
btrfs_put_root(log);
}
@@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans,
struct btrfs_root *log_root,
struct btrfs_ordered_sum *sums)
{
+ const u64 lock_end = sums->bytenr + sums->len - 1;
+ struct extent_state *cached_state = NULL;
int ret;
+ /*
+ * Serialize logging for checksums. This is to avoid racing with the
+ * same checksum being logged by another task that is logging another
+ * file which happens to refer to the same extent as well. Such races
+ * can leave checksum items in the log with overlapping ranges.
+ */
+ ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr,
+ lock_end, &cached_state);
+ if (ret)
+ return ret;
/*
* Due to extent cloning, we might have logged a csum item that covers a
* subrange of a cloned extent, and later we can end up logging a csum
@@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans,
* trim and adjust) any existing csum items in the log for this range.
*/
ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
- if (ret)
- return ret;
+ if (!ret)
+ ret = btrfs_csum_file_blocks(trans, log_root, sums);
- return btrfs_csum_file_blocks(trans, log_root, sums);
+ unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end,
+ &cached_state);
+
+ return ret;
}
static noinline int copy_items(struct btrfs_trans_handle *trans,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index bcbc763b8814..360b0f9d2220 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS);
{ IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \
{ IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \
{ IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \
+ { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \
{ IO_TREE_SELFTEST, "SELFTEST" })
#define BTRFS_GROUP_FLAGS \
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 18 May 2020 12:14:50 +0100
Subject: [PATCH] btrfs: fix corrupt log due to concurrent fsync of inodes with
shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5afeb17a3f1a..30ce7039bc27 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1167,6 +1167,9 @@ struct btrfs_root {
/* Record pairs of swapped blocks for qgroup */
struct btrfs_qgroup_swapped_blocks swapped_blocks;
+ /* Used only by log trees, when logging csum items */
+ struct extent_io_tree log_csum_range;
+
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
u64 alloc_bytenr;
#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f2f2864f5978..f8ec2d8606fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1133,9 +1133,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
- if (!dummy)
+ if (!dummy) {
extent_io_tree_init(fs_info, &root->dirty_log_pages,
IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL);
+ extent_io_tree_init(fs_info, &root->log_csum_range,
+ IO_TREE_LOG_CSUM_RANGE, NULL);
+ }
memset(&root->root_key, 0, sizeof(root->root_key));
memset(&root->root_item, 0, sizeof(root->root_item));
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index b4a7bad3e82e..b6561455b3c4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -44,6 +44,7 @@ enum {
IO_TREE_TRANS_DIRTY_PAGES,
IO_TREE_ROOT_DIRTY_LOG_PAGES,
IO_TREE_INODE_FILE_EXTENT,
+ IO_TREE_LOG_CSUM_RANGE,
IO_TREE_SELFTEST,
};
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 67fa7087f707..920cee312f4e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3290,6 +3290,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+ extent_io_tree_release(&log->log_csum_range);
btrfs_put_root(log);
}
@@ -3903,8 +3904,20 @@ static int log_csums(struct btrfs_trans_handle *trans,
struct btrfs_root *log_root,
struct btrfs_ordered_sum *sums)
{
+ const u64 lock_end = sums->bytenr + sums->len - 1;
+ struct extent_state *cached_state = NULL;
int ret;
+ /*
+ * Serialize logging for checksums. This is to avoid racing with the
+ * same checksum being logged by another task that is logging another
+ * file which happens to refer to the same extent as well. Such races
+ * can leave checksum items in the log with overlapping ranges.
+ */
+ ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr,
+ lock_end, &cached_state);
+ if (ret)
+ return ret;
/*
* Due to extent cloning, we might have logged a csum item that covers a
* subrange of a cloned extent, and later we can end up logging a csum
@@ -3915,10 +3928,13 @@ static int log_csums(struct btrfs_trans_handle *trans,
* trim and adjust) any existing csum items in the log for this range.
*/
ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
- if (ret)
- return ret;
+ if (!ret)
+ ret = btrfs_csum_file_blocks(trans, log_root, sums);
- return btrfs_csum_file_blocks(trans, log_root, sums);
+ unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end,
+ &cached_state);
+
+ return ret;
}
static noinline int copy_items(struct btrfs_trans_handle *trans,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index bcbc763b8814..360b0f9d2220 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS);
{ IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \
{ IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \
{ IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \
+ { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \
{ IO_TREE_SELFTEST, "SELFTEST" })
#define BTRFS_GROUP_FLAGS \
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 16 Apr 2020 14:46:12 -0700
Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.
Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.
After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.
Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 259239b33370..b628c319a5b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
/* bio split */
ASSERT(geom.len <= INT_MAX);
- atomic_inc(&dip->pending_bios);
do {
clone_len = min_t(int, submit_len, geom.len);
@@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
if (!status)
return 0;
- bio_put(bio);
+ if (bio != orig_bio)
+ bio_put(bio);
out_err:
dip->errors = 1;
/*
@@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
bio->bi_private = dip;
dip->orig_bio = bio;
dip->dio_bio = dio_bio;
- atomic_set(&dip->pending_bios, 0);
+ atomic_set(&dip->pending_bios, 1);
io_bio = btrfs_io_bio(bio);
io_bio->logical = file_offset;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 16 Apr 2020 14:46:12 -0700
Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.
Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.
After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.
Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 259239b33370..b628c319a5b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
/* bio split */
ASSERT(geom.len <= INT_MAX);
- atomic_inc(&dip->pending_bios);
do {
clone_len = min_t(int, submit_len, geom.len);
@@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
if (!status)
return 0;
- bio_put(bio);
+ if (bio != orig_bio)
+ bio_put(bio);
out_err:
dip->errors = 1;
/*
@@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
bio->bi_private = dip;
dip->orig_bio = bio;
dip->dio_bio = dio_bio;
- atomic_set(&dip->pending_bios, 0);
+ atomic_set(&dip->pending_bios, 1);
io_bio = btrfs_io_bio(bio);
io_bio->logical = file_offset;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 16 Apr 2020 14:46:12 -0700
Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.
Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.
After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.
Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 259239b33370..b628c319a5b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
/* bio split */
ASSERT(geom.len <= INT_MAX);
- atomic_inc(&dip->pending_bios);
do {
clone_len = min_t(int, submit_len, geom.len);
@@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
if (!status)
return 0;
- bio_put(bio);
+ if (bio != orig_bio)
+ bio_put(bio);
out_err:
dip->errors = 1;
/*
@@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
bio->bi_private = dip;
dip->orig_bio = bio;
dip->dio_bio = dio_bio;
- atomic_set(&dip->pending_bios, 0);
+ atomic_set(&dip->pending_bios, 1);
io_bio = btrfs_io_bio(bio);
io_bio->logical = file_offset;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001
From: Omar Sandoval <osandov(a)fb.com>
Date: Thu, 16 Apr 2020 14:46:12 -0700
Subject: [PATCH] btrfs: fix error handling when submitting direct I/O bio
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.
Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.
After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.
Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Omar Sandoval <osandov(a)fb.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 259239b33370..b628c319a5b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
/* bio split */
ASSERT(geom.len <= INT_MAX);
- atomic_inc(&dip->pending_bios);
do {
clone_len = min_t(int, submit_len, geom.len);
@@ -7989,7 +7988,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
if (!status)
return 0;
- bio_put(bio);
+ if (bio != orig_bio)
+ bio_put(bio);
out_err:
dip->errors = 1;
/*
@@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
bio->bi_private = dip;
dip->orig_bio = bio;
dip->dio_bio = dio_bio;
- atomic_set(&dip->pending_bios, 0);
+ atomic_set(&dip->pending_bios, 1);
io_bio = btrfs_io_bio(bio);
io_bio->logical = file_offset;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 8 May 2020 11:01:10 +0100
Subject: [PATCH] btrfs: fix a race between scrub and block group
removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index adaf8ab694d5..7c50ac5b6876 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3046,7 +3046,8 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
struct map_lookup *map,
struct btrfs_device *scrub_dev,
- int num, u64 base, u64 length)
+ int num, u64 base, u64 length,
+ struct btrfs_block_group *cache)
{
struct btrfs_path *path, *ppath;
struct btrfs_fs_info *fs_info = sctx->fs_info;
@@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
break;
}
+ /*
+ * If our block group was removed in the meanwhile, just
+ * stop scrubbing since there is no point in continuing.
+ * Continuing would prevent reusing its device extents
+ * for new block groups for a long time.
+ */
+ spin_lock(&cache->lock);
+ if (cache->removed) {
+ spin_unlock(&cache->lock);
+ ret = 0;
+ goto out;
+ }
+ spin_unlock(&cache->lock);
+
extent = btrfs_item_ptr(l, slot,
struct btrfs_extent_item);
flags = btrfs_extent_flags(l, extent);
@@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
map->stripes[i].physical == dev_offset) {
ret = scrub_stripe(sctx, map, scrub_dev, i,
- chunk_offset, length);
+ chunk_offset, length, cache);
if (ret)
goto out;
}
@@ -3554,6 +3569,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache)
goto skip;
+ /*
+ * Make sure that while we are scrubbing the corresponding block
+ * group doesn't get its logical address and its device extents
+ * reused for another block group, which can possibly be of a
+ * different type and different profile. We do this to prevent
+ * false error detections and crashes due to bogus attempts to
+ * repair extents.
+ */
+ spin_lock(&cache->lock);
+ if (cache->removed) {
+ spin_unlock(&cache->lock);
+ btrfs_put_block_group(cache);
+ goto skip;
+ }
+ btrfs_get_block_group_trimming(cache);
+ spin_unlock(&cache->lock);
+
/*
* we need call btrfs_inc_block_group_ro() with scrubs_paused,
* to avoid deadlock caused by:
@@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
} else {
btrfs_warn(fs_info,
"failed setting block group ro: %d", ret);
+ btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
scrub_pause_off(fs_info);
break;
@@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
spin_unlock(&cache->lock);
}
+ btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
if (ret)
break;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 8 May 2020 11:01:10 +0100
Subject: [PATCH] btrfs: fix a race between scrub and block group
removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index adaf8ab694d5..7c50ac5b6876 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3046,7 +3046,8 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
struct map_lookup *map,
struct btrfs_device *scrub_dev,
- int num, u64 base, u64 length)
+ int num, u64 base, u64 length,
+ struct btrfs_block_group *cache)
{
struct btrfs_path *path, *ppath;
struct btrfs_fs_info *fs_info = sctx->fs_info;
@@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
break;
}
+ /*
+ * If our block group was removed in the meanwhile, just
+ * stop scrubbing since there is no point in continuing.
+ * Continuing would prevent reusing its device extents
+ * for new block groups for a long time.
+ */
+ spin_lock(&cache->lock);
+ if (cache->removed) {
+ spin_unlock(&cache->lock);
+ ret = 0;
+ goto out;
+ }
+ spin_unlock(&cache->lock);
+
extent = btrfs_item_ptr(l, slot,
struct btrfs_extent_item);
flags = btrfs_extent_flags(l, extent);
@@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
map->stripes[i].physical == dev_offset) {
ret = scrub_stripe(sctx, map, scrub_dev, i,
- chunk_offset, length);
+ chunk_offset, length, cache);
if (ret)
goto out;
}
@@ -3554,6 +3569,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache)
goto skip;
+ /*
+ * Make sure that while we are scrubbing the corresponding block
+ * group doesn't get its logical address and its device extents
+ * reused for another block group, which can possibly be of a
+ * different type and different profile. We do this to prevent
+ * false error detections and crashes due to bogus attempts to
+ * repair extents.
+ */
+ spin_lock(&cache->lock);
+ if (cache->removed) {
+ spin_unlock(&cache->lock);
+ btrfs_put_block_group(cache);
+ goto skip;
+ }
+ btrfs_get_block_group_trimming(cache);
+ spin_unlock(&cache->lock);
+
/*
* we need call btrfs_inc_block_group_ro() with scrubs_paused,
* to avoid deadlock caused by:
@@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
} else {
btrfs_warn(fs_info,
"failed setting block group ro: %d", ret);
+ btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
scrub_pause_off(fs_info);
break;
@@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
spin_unlock(&cache->lock);
}
+ btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
if (ret)
break;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 8 May 2020 11:01:10 +0100
Subject: [PATCH] btrfs: fix a race between scrub and block group
removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index adaf8ab694d5..7c50ac5b6876 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3046,7 +3046,8 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
struct map_lookup *map,
struct btrfs_device *scrub_dev,
- int num, u64 base, u64 length)
+ int num, u64 base, u64 length,
+ struct btrfs_block_group *cache)
{
struct btrfs_path *path, *ppath;
struct btrfs_fs_info *fs_info = sctx->fs_info;
@@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
break;
}
+ /*
+ * If our block group was removed in the meanwhile, just
+ * stop scrubbing since there is no point in continuing.
+ * Continuing would prevent reusing its device extents
+ * for new block groups for a long time.
+ */
+ spin_lock(&cache->lock);
+ if (cache->removed) {
+ spin_unlock(&cache->lock);
+ ret = 0;
+ goto out;
+ }
+ spin_unlock(&cache->lock);
+
extent = btrfs_item_ptr(l, slot,
struct btrfs_extent_item);
flags = btrfs_extent_flags(l, extent);
@@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
map->stripes[i].physical == dev_offset) {
ret = scrub_stripe(sctx, map, scrub_dev, i,
- chunk_offset, length);
+ chunk_offset, length, cache);
if (ret)
goto out;
}
@@ -3554,6 +3569,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache)
goto skip;
+ /*
+ * Make sure that while we are scrubbing the corresponding block
+ * group doesn't get its logical address and its device extents
+ * reused for another block group, which can possibly be of a
+ * different type and different profile. We do this to prevent
+ * false error detections and crashes due to bogus attempts to
+ * repair extents.
+ */
+ spin_lock(&cache->lock);
+ if (cache->removed) {
+ spin_unlock(&cache->lock);
+ btrfs_put_block_group(cache);
+ goto skip;
+ }
+ btrfs_get_block_group_trimming(cache);
+ spin_unlock(&cache->lock);
+
/*
* we need call btrfs_inc_block_group_ro() with scrubs_paused,
* to avoid deadlock caused by:
@@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
} else {
btrfs_warn(fs_info,
"failed setting block group ro: %d", ret);
+ btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
scrub_pause_off(fs_info);
break;
@@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
spin_unlock(&cache->lock);
}
+ btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
if (ret)
break;
From: Chen Yu <yu.c.chen(a)intel.com>
Currently the system will be woken up via WOL(Wake On LAN) even if the
device wakeup ability has been disabled via sysfs:
cat /sys/devices/pci0000:00/0000:00:1f.6/power/wakeup
disabled
The system should not be woken up if the user has explicitly
disabled the wake up ability for this device.
This patch clears the WOL ability of this network device if the
user has disabled the wake up ability in sysfs.
Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver")
Reported-by: "Rafael J. Wysocki" <rafael.j.wysocki(a)intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Chen Yu <yu.c.chen(a)intel.com>
Tested-by: Aaron Brown <aaron.f.brown(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index a279f4fa9962..e2ad3f38c75c 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6611,11 +6611,17 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool runtime)
struct net_device *netdev = pci_get_drvdata(pdev);
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
- u32 ctrl, ctrl_ext, rctl, status;
- /* Runtime suspend should only enable wakeup for link changes */
- u32 wufc = runtime ? E1000_WUFC_LNKC : adapter->wol;
+ u32 ctrl, ctrl_ext, rctl, status, wufc;
int retval = 0;
+ /* Runtime suspend should only enable wakeup for link changes */
+ if (runtime)
+ wufc = E1000_WUFC_LNKC;
+ else if (device_may_wakeup(&pdev->dev))
+ wufc = adapter->wol;
+ else
+ wufc = 0;
+
status = er32(STATUS);
if (status & E1000_STATUS_LU)
wufc &= ~E1000_WUFC_LNKC;
@@ -6672,7 +6678,7 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool runtime)
if (adapter->hw.phy.type == e1000_phy_igp_3) {
e1000e_igp3_phy_powerdown_workaround_ich8lan(&adapter->hw);
} else if (hw->mac.type >= e1000_pch_lpt) {
- if (!(wufc & (E1000_WUFC_EX | E1000_WUFC_MC | E1000_WUFC_BC)))
+ if (wufc && !(wufc & (E1000_WUFC_EX | E1000_WUFC_MC | E1000_WUFC_BC)))
/* ULP does not support wake from unicast, multicast
* or broadcast.
*/
--
2.26.2
This is a note to let you know that I've just added the patch titled
usb: typec: tcpci_rt1711h: avoid screaming irq causing boot hangs
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 302c570bf36e997d55ad0d60628a2feec76954a4 Mon Sep 17 00:00:00 2001
From: Li Jun <jun.li(a)nxp.com>
Date: Thu, 4 Jun 2020 19:21:18 +0800
Subject: usb: typec: tcpci_rt1711h: avoid screaming irq causing boot hangs
John reported screaming irq caused by rt1711h when system boot[1],
this is because irq request is done before tcpci_register_port(),
so the chip->tcpci has not been setup, irq handler is entered but
can't do anything, this patch is to address this by moving the irq
request after tcpci_register_port().
[1] https://lore.kernel.org/linux-usb/20200530040157.31038-1-john.stultz@linaro…
Fixes: ce08eaeb6388 ("staging: typec: rt1711h typec chip driver")
Cc: stable <stable(a)vger.kernel.org> # v4.18+
Cc: John Stultz <john.stultz(a)linaro.org>
Reported-and-tested-by: John Stultz <john.stultz(a)linaro.org>
Reviewed-by: Guenter Roeck <linux(a)roeck-us.net>
Reviewed-by: Heikki Krogerus <heikki.krogerus(a)linux.intel.com>
Signed-off-by: Li Jun <jun.li(a)nxp.com>
Link: https://lore.kernel.org/r/20200604112118.38062-1-jun.li@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/typec/tcpm/tcpci_rt1711h.c | 31 +++++++++-----------------
1 file changed, 10 insertions(+), 21 deletions(-)
diff --git a/drivers/usb/typec/tcpm/tcpci_rt1711h.c b/drivers/usb/typec/tcpm/tcpci_rt1711h.c
index 017389021b96..b56a0880a044 100644
--- a/drivers/usb/typec/tcpm/tcpci_rt1711h.c
+++ b/drivers/usb/typec/tcpm/tcpci_rt1711h.c
@@ -179,26 +179,6 @@ static irqreturn_t rt1711h_irq(int irq, void *dev_id)
return tcpci_irq(chip->tcpci);
}
-static int rt1711h_init_alert(struct rt1711h_chip *chip,
- struct i2c_client *client)
-{
- int ret;
-
- /* Disable chip interrupts before requesting irq */
- ret = rt1711h_write16(chip, TCPC_ALERT_MASK, 0);
- if (ret < 0)
- return ret;
-
- ret = devm_request_threaded_irq(chip->dev, client->irq, NULL,
- rt1711h_irq,
- IRQF_ONESHOT | IRQF_TRIGGER_LOW,
- dev_name(chip->dev), chip);
- if (ret < 0)
- return ret;
- enable_irq_wake(client->irq);
- return 0;
-}
-
static int rt1711h_sw_reset(struct rt1711h_chip *chip)
{
int ret;
@@ -260,7 +240,8 @@ static int rt1711h_probe(struct i2c_client *client,
if (ret < 0)
return ret;
- ret = rt1711h_init_alert(chip, client);
+ /* Disable chip interrupts before requesting irq */
+ ret = rt1711h_write16(chip, TCPC_ALERT_MASK, 0);
if (ret < 0)
return ret;
@@ -271,6 +252,14 @@ static int rt1711h_probe(struct i2c_client *client,
if (IS_ERR_OR_NULL(chip->tcpci))
return PTR_ERR(chip->tcpci);
+ ret = devm_request_threaded_irq(chip->dev, client->irq, NULL,
+ rt1711h_irq,
+ IRQF_ONESHOT | IRQF_TRIGGER_LOW,
+ dev_name(chip->dev), chip);
+ if (ret < 0)
+ return ret;
+ enable_irq_wake(client->irq);
+
return 0;
}
--
2.27.0
This is a note to let you know that I've just added the patch titled
USB: ohci-sm501: Add missed iounmap() in remove
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 07c112fb09c86c0231f6ff0061a000ffe91c8eb9 Mon Sep 17 00:00:00 2001
From: Chuhong Yuan <hslester96(a)gmail.com>
Date: Wed, 10 Jun 2020 10:48:44 +0800
Subject: USB: ohci-sm501: Add missed iounmap() in remove
This driver misses calling iounmap() in remove to undo the ioremap()
called in probe.
Add the missed call to fix it.
Fixes: f54aab6ebcec ("usb: ohci-sm501 driver")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Chuhong Yuan <hslester96(a)gmail.com>
Acked-by: Alan Stern <stern(a)rowland.harvard.edu>
Link: https://lore.kernel.org/r/20200610024844.3628408-1-hslester96@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/host/ohci-sm501.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/ohci-sm501.c b/drivers/usb/host/ohci-sm501.c
index cff965240327..b91d50da6127 100644
--- a/drivers/usb/host/ohci-sm501.c
+++ b/drivers/usb/host/ohci-sm501.c
@@ -191,6 +191,7 @@ static int ohci_hcd_sm501_drv_remove(struct platform_device *pdev)
struct resource *mem;
usb_remove_hcd(hcd);
+ iounmap(hcd->regs);
release_mem_region(hcd->rsrc_start, hcd->rsrc_len);
usb_put_hcd(hcd);
mem = platform_get_resource(pdev, IORESOURCE_MEM, 1);
--
2.27.0
This is a note to let you know that I've just added the patch titled
USB: ehci: reopen solution for Synopsys HC bug
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 1ddcb71a3edf0e1682b6e056158e4c4b00325f66 Mon Sep 17 00:00:00 2001
From: Longfang Liu <liulongfang(a)huawei.com>
Date: Mon, 8 Jun 2020 11:46:59 +0800
Subject: USB: ehci: reopen solution for Synopsys HC bug
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
A Synopsys USB2.0 core used in Huawei Kunpeng920 SoC has a bug which
might cause the host controller not issuing ping.
Bug description:
After indicating an Interrupt on Async Advance, the software uses the
doorbell mechanism to delete the Next Link queue head of the last
executed queue head. At this time, the host controller still references
the removed queue head(the queue head is NULL). NULL reference causes
the host controller to lose the USB device.
Solution:
After deleting the Next Link queue head, when has_synopsys_hc_bug set
to 1,the software can write one of the valid queue head addresses to
the ASYNCLISTADDR register to allow the host controller to get
the valid queue head. in order to solve that problem, this patch set
the flag for Huawei Kunpeng920
There are detailed instructions and solutions in this patch:
commit 2f7ac6c19997 ("USB: ehci: add workaround for Synopsys HC bug")
Signed-off-by: Longfang Liu <liulongfang(a)huawei.com>
Cc: stable <stable(a)vger.kernel.org>
Acked-by: Alan Stern <stern(a)rowland.harvard.edu>
Link: https://lore.kernel.org/r/1591588019-44284-1-git-send-email-liulongfang@hua…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/host/ehci-pci.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/usb/host/ehci-pci.c b/drivers/usb/host/ehci-pci.c
index 3c3820ad9092..af3c1b9b38b2 100644
--- a/drivers/usb/host/ehci-pci.c
+++ b/drivers/usb/host/ehci-pci.c
@@ -216,6 +216,13 @@ static int ehci_pci_setup(struct usb_hcd *hcd)
ehci_info(ehci, "applying MosChip frame-index workaround\n");
ehci->frame_index_bug = 1;
break;
+ case PCI_VENDOR_ID_HUAWEI:
+ /* Synopsys HC bug */
+ if (pdev->device == 0xa239) {
+ ehci_info(ehci, "applying Synopsys HC workaround\n");
+ ehci->has_synopsys_hc_bug = 1;
+ }
+ break;
}
/* optional debug port, normally in the first BAR */
--
2.27.0
This is a note to let you know that I've just added the patch titled
usb: add USB_QUIRK_DELAY_INIT for Logitech C922
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 5d8021923e8a8cc37a421a64e27c7221f0fee33c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Tomasz=20Meresi=C5=84ski?= <tomasz(a)meresinski.eu>
Date: Wed, 3 Jun 2020 22:33:46 +0200
Subject: usb: add USB_QUIRK_DELAY_INIT for Logitech C922
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The Logitech C922, just like other Logitech webcams,
needs the USB_QUIRK_DELAY_INIT or it will randomly
not respond after device connection
Signed-off-by: Tomasz Meresiński <tomasz(a)meresinski.eu>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200603203347.7792-1-tomasz@meresinski.eu
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/core/quirks.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
index 3e8efe759c3e..e0b77674869c 100644
--- a/drivers/usb/core/quirks.c
+++ b/drivers/usb/core/quirks.c
@@ -218,11 +218,12 @@ static const struct usb_device_id usb_quirk_list[] = {
/* Logitech HD Webcam C270 */
{ USB_DEVICE(0x046d, 0x0825), .driver_info = USB_QUIRK_RESET_RESUME },
- /* Logitech HD Pro Webcams C920, C920-C, C925e and C930e */
+ /* Logitech HD Pro Webcams C920, C920-C, C922, C925e and C930e */
{ USB_DEVICE(0x046d, 0x082d), .driver_info = USB_QUIRK_DELAY_INIT },
{ USB_DEVICE(0x046d, 0x0841), .driver_info = USB_QUIRK_DELAY_INIT },
{ USB_DEVICE(0x046d, 0x0843), .driver_info = USB_QUIRK_DELAY_INIT },
{ USB_DEVICE(0x046d, 0x085b), .driver_info = USB_QUIRK_DELAY_INIT },
+ { USB_DEVICE(0x046d, 0x085c), .driver_info = USB_QUIRK_DELAY_INIT },
/* Logitech ConferenceCam CC3000e */
{ USB_DEVICE(0x046d, 0x0847), .driver_info = USB_QUIRK_DELAY_INIT },
--
2.27.0
This is a note to let you know that I've just added the patch titled
usb: dwc2: Postponed gadget registration to the udc class driver
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 207324a321a866401b098cadf19e4a2dd6584622 Mon Sep 17 00:00:00 2001
From: Minas Harutyunyan <Minas.Harutyunyan(a)synopsys.com>
Date: Tue, 9 Jun 2020 12:28:11 +0400
Subject: usb: dwc2: Postponed gadget registration to the udc class driver
During dwc2 driver probe, after gadget registration to the udc class
driver, if exist any builtin function driver it immediately bound to
dwc2 and after init host side (dwc2_hcd_init()) stucked in host mode.
Patch postpone gadget registration after host side initialization done.
Fixes: 117777b2c3bb9 ("usb: dwc2: Move gadget probe function into platform code")
Reported-by: kbuild test robot <lkp(a)intel.com>
Tested-by: Marek Vasut <marex(a)denx.de>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Minas Harutyunyan <hminas(a)synopsys.com>
Link: https://lore.kernel.org/r/f21cb38fecc72a230b86155d94c7e60c9cb66f58.15916909…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc2/gadget.c | 6 ------
drivers/usb/dwc2/platform.c | 11 +++++++++++
2 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/usb/dwc2/gadget.c b/drivers/usb/dwc2/gadget.c
index 12b98b466287..7faf5f8c056d 100644
--- a/drivers/usb/dwc2/gadget.c
+++ b/drivers/usb/dwc2/gadget.c
@@ -4920,12 +4920,6 @@ int dwc2_gadget_init(struct dwc2_hsotg *hsotg)
epnum, 0);
}
- ret = usb_add_gadget_udc(dev, &hsotg->gadget);
- if (ret) {
- dwc2_hsotg_ep_free_request(&hsotg->eps_out[0]->ep,
- hsotg->ctrl_req);
- return ret;
- }
dwc2_hsotg_dump(hsotg);
return 0;
diff --git a/drivers/usb/dwc2/platform.c b/drivers/usb/dwc2/platform.c
index e571c8ae65ec..c347d93eae64 100644
--- a/drivers/usb/dwc2/platform.c
+++ b/drivers/usb/dwc2/platform.c
@@ -575,6 +575,17 @@ static int dwc2_driver_probe(struct platform_device *dev)
if (hsotg->dr_mode == USB_DR_MODE_PERIPHERAL)
dwc2_lowlevel_hw_disable(hsotg);
+#if IS_ENABLED(CONFIG_USB_DWC2_PERIPHERAL) || \
+ IS_ENABLED(CONFIG_USB_DWC2_DUAL_ROLE)
+ /* Postponed adding a new gadget to the udc class driver list */
+ if (hsotg->gadget_enabled) {
+ retval = usb_add_gadget_udc(hsotg->dev, &hsotg->gadget);
+ if (retval) {
+ dwc2_hsotg_remove(hsotg);
+ goto error_init;
+ }
+ }
+#endif /* CONFIG_USB_DWC2_PERIPHERAL || CONFIG_USB_DWC2_DUAL_ROLE */
return 0;
error_init:
--
2.27.0
This is a note to let you know that I've just added the patch titled
usb: host: ehci-exynos: Fix error check in exynos_ehci_probe()
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 44ed240d62736ad29943ec01e41e194b96f7c5e9 Mon Sep 17 00:00:00 2001
From: Tang Bin <tangbin(a)cmss.chinamobile.com>
Date: Tue, 2 Jun 2020 19:47:08 +0800
Subject: usb: host: ehci-exynos: Fix error check in exynos_ehci_probe()
If the function platform_get_irq() failed, the negative value
returned will not be detected here. So fix error handling in
exynos_ehci_probe(). And when get irq failed, the function
platform_get_irq() logs an error message, so remove redundant
message here.
Fixes: 1bcc5aa87f04 ("USB: Add initial S5P EHCI driver")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Zhang Shengju <zhangshengju(a)cmss.chinamobile.com>
Signed-off-by: Tang Bin <tangbin(a)cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20200602114708.28620-1-tangbin@cmss.chinamobile.c…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/host/ehci-exynos.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/ehci-exynos.c b/drivers/usb/host/ehci-exynos.c
index a4e9abcbdc4f..1a9b7572e17f 100644
--- a/drivers/usb/host/ehci-exynos.c
+++ b/drivers/usb/host/ehci-exynos.c
@@ -203,9 +203,8 @@ static int exynos_ehci_probe(struct platform_device *pdev)
hcd->rsrc_len = resource_size(res);
irq = platform_get_irq(pdev, 0);
- if (!irq) {
- dev_err(&pdev->dev, "Failed to get IRQ\n");
- err = -ENODEV;
+ if (irq < 0) {
+ err = irq;
goto fail_io;
}
--
2.27.0
This is a note to let you know that I've just added the patch titled
usb: typec: mux: intel_pmc_mux: Fix DP alternate mode entry
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 130206a88683d859f63ed6d4a56ab5c2b4930c8e Mon Sep 17 00:00:00 2001
From: Heikki Krogerus <heikki.krogerus(a)linux.intel.com>
Date: Fri, 29 May 2020 16:17:53 +0300
Subject: usb: typec: mux: intel_pmc_mux: Fix DP alternate mode entry
The PMC needs to be notified separately about HPD (hotplug
detected) signal being high after mode entry. There is a bit
"HPD High" in the Alternate Mode Request that the driver
already sets, but that bit is only valid when the
DisplayPort Alternate Mode is directly entered from
disconnected state.
Fixes: 5c4edcdbcd97 ("usb: typec: mux: intel: Fix DP_HPD_LVL bit field")
Signed-off-by: Heikki Krogerus <heikki.krogerus(a)linux.intel.com>
Cc: stable <stable(a)vger.kernel.org>
Tested-by: Prashant Malani <pmalani(a)chromium.org>
Link: https://lore.kernel.org/r/20200529131753.15587-1-heikki.krogerus@linux.inte…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/typec/mux/intel_pmc_mux.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/typec/mux/intel_pmc_mux.c b/drivers/usb/typec/mux/intel_pmc_mux.c
index 962bc69a6a59..70ddc9d6d49e 100644
--- a/drivers/usb/typec/mux/intel_pmc_mux.c
+++ b/drivers/usb/typec/mux/intel_pmc_mux.c
@@ -148,7 +148,8 @@ pmc_usb_mux_dp_hpd(struct pmc_usb_port *port, struct typec_mux_state *state)
msg[0] = PMC_USB_DP_HPD;
msg[0] |= port->usb3_port << PMC_USB_MSG_USB3_PORT_SHIFT;
- msg[1] = PMC_USB_DP_HPD_IRQ;
+ if (data->status & DP_STATUS_IRQ_HPD)
+ msg[1] = PMC_USB_DP_HPD_IRQ;
if (data->status & DP_STATUS_HPD_STATE)
msg[1] |= PMC_USB_DP_HPD_LVL;
@@ -161,6 +162,7 @@ pmc_usb_mux_dp(struct pmc_usb_port *port, struct typec_mux_state *state)
{
struct typec_displayport_data *data = state->data;
struct altmode_req req = { };
+ int ret;
if (data->status & DP_STATUS_IRQ_HPD)
return pmc_usb_mux_dp_hpd(port, state);
@@ -181,7 +183,14 @@ pmc_usb_mux_dp(struct pmc_usb_port *port, struct typec_mux_state *state)
if (data->status & DP_STATUS_HPD_STATE)
req.mode_data |= PMC_USB_ALTMODE_HPD_HIGH;
- return pmc_usb_command(port, (void *)&req, sizeof(req));
+ ret = pmc_usb_command(port, (void *)&req, sizeof(req));
+ if (ret)
+ return ret;
+
+ if (data->status & DP_STATUS_HPD_STATE)
+ return pmc_usb_mux_dp_hpd(port, state);
+
+ return 0;
}
static int
--
2.27.0
From: Shubhrajyoti Datta <shubhrajyoti.datta(a)xilinx.com>
When serial console has been assigned to ttyPS1 (which is serial1 alias)
console index is not updated property and pointing to index -1 (statically
initialized) which ends up in situation where nothing has been printed on
the port.
The commit 18cc7ac8a28e ("Revert "serial: uartps: Register own uart console
and driver structures"") didn't contain this line which was removed by
accident.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta(a)xilinx.com>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Michal Simek <michal.simek(a)xilinx.com>
---
Changes in v2:
- Do better commit description
- Origin subject was "tty: xilinx_uartps: Add the id to the console"
Greg: Would be good if you can take this patch to 5.7 and also to stable
trees.
---
drivers/tty/serial/xilinx_uartps.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/tty/serial/xilinx_uartps.c b/drivers/tty/serial/xilinx_uartps.c
index 672cfa075e28..b9d672af8b65 100644
--- a/drivers/tty/serial/xilinx_uartps.c
+++ b/drivers/tty/serial/xilinx_uartps.c
@@ -1465,6 +1465,7 @@ static int cdns_uart_probe(struct platform_device *pdev)
cdns_uart_uart_driver.nr = CDNS_UART_NR_PORTS;
#ifdef CONFIG_SERIAL_XILINX_PS_UART_CONSOLE
cdns_uart_uart_driver.cons = &cdns_uart_console;
+ cdns_uart_console.index = id;
#endif
rc = uart_register_driver(&cdns_uart_uart_driver);
--
2.26.2
Hello,
We ran automated tests on a recent commit from this kernel tree:
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Commit: 8d8368a974fe - serial: amba-pl011: Make sure we initialize the port.lock spinlock
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
All kernel binaries, config files, and logs are available for download here:
https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=dataware…
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
s390x:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
Host 1:
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ selinux-policy: serge-testsuite
✅ storage: software RAID testing
🚧 ✅ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
Host 2:
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
✅ LTP
✅ Loopdev Sanity
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking: igmp conformance test
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ storage: SCSI VPD
🚧 ✅ CIFS Connectathon
🚧 ✅ POSIX pjd-fstest suites
🚧 ✅ jvm - DaCapo Benchmark Suite
🚧 ✅ jvm - jcstress tests
🚧 ✅ Memory function: kaslr
🚧 ✅ Networking firewall: basic netfilter test
🚧 ✅ audit: audit testsuite test
🚧 ✅ trace: ftrace/tracer
🚧 ✅ kdump - kexec_boot
ppc64le:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
⚡⚡⚡ pciutils: update pci ids test
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ Networking firewall: basic netfilter test
🚧 ⚡⚡⚡ audit: audit testsuite test
🚧 ⚡⚡⚡ trace: ftrace/tracer
Host 2:
✅ Boot test
🚧 ✅ kdump - sysrq-c
Host 3:
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ selinux-policy: serge-testsuite
✅ storage: software RAID testing
🚧 ✅ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
Host 4:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
⚡⚡⚡ LTP
✅ Loopdev Sanity
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
🚧 ✅ CIFS Connectathon
🚧 ✅ POSIX pjd-fstest suites
🚧 ✅ jvm - DaCapo Benchmark Suite
🚧 ✅ jvm - jcstress tests
🚧 ✅ Memory function: kaslr
🚧 ✅ Networking firewall: basic netfilter test
🚧 ✅ audit: audit testsuite test
🚧 ✅ trace: ftrace/tracer
s390x:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ Podman system integration test - as root
⚡⚡⚡ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ Networking firewall: basic netfilter test
🚧 ⚡⚡⚡ audit: audit testsuite test
🚧 ⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ kdump - kexec_boot
Host 2:
✅ Boot test
🚧 ✅ kdump - sysrq-c
Host 3:
✅ Boot test
✅ selinux-policy: serge-testsuite
✅ stress: stress-ng
🚧 ✅ Storage blktests
Host 4:
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
✅ LTP
✅ Loopdev Sanity
✅ Memory function: memfd_create
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
🚧 ✅ CIFS Connectathon
🚧 ✅ POSIX pjd-fstest suites
🚧 ✅ jvm - DaCapo Benchmark Suite
🚧 ✅ jvm - jcstress tests
🚧 ✅ Memory function: kaslr
🚧 ✅ Networking firewall: basic netfilter test
🚧 ❌ audit: audit testsuite test
🚧 ✅ trace: ftrace/tracer
🚧 ✅ kdump - kexec_boot
x86_64:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
🚧 ⚡⚡⚡ kdump - sysrq-c
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Podman system integration test - as root
⚡⚡⚡ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking: igmp conformance test
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
⚡⚡⚡ pciutils: sanity smoke test
⚡⚡⚡ pciutils: update pci ids test
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
⚡⚡⚡ storage: SCSI VPD
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ Networking firewall: basic netfilter test
🚧 ⚡⚡⚡ audit: audit testsuite test
🚧 ⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ kdump - kexec_boot
Host 3:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ xfstests - ext4
⚡⚡⚡ xfstests - xfs
⚡⚡⚡ selinux-policy: serge-testsuite
⚡⚡⚡ storage: software RAID testing
⚡⚡⚡ stress: stress-ng
🚧 ⚡⚡⚡ CPU: Frequency Driver Test
🚧 ⚡⚡⚡ CPU: Idle Test
🚧 ⚡⚡⚡ IOMMU boot test
🚧 ⚡⚡⚡ IPMI driver test
🚧 ⚡⚡⚡ IPMItool loop stress test
🚧 ⚡⚡⚡ power-management: cpupower/sanity test
🚧 ⚡⚡⚡ Storage blktests
Host 4:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
🚧 ⚡⚡⚡ kdump - sysrq-c
Host 5:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Podman system integration test - as root
⚡⚡⚡ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking: igmp conformance test
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
⚡⚡⚡ pciutils: sanity smoke test
⚡⚡⚡ pciutils: update pci ids test
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
⚡⚡⚡ storage: SCSI VPD
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ Networking firewall: basic netfilter test
🚧 ⚡⚡⚡ audit: audit testsuite test
🚧 ⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ kdump - kexec_boot
Test sources: https://github.com/CKI-project/tests-beaker
💚 Pull requests are welcome for new tests or improvements to existing tests!
Aborted tests
-------------
Tests that didn't complete running successfully are marked with ⚡⚡⚡.
If this was caused by an infrastructure issue, we try to mark that
explicitly in the report.
Waived tests
------------
If the test run included waived tests, they are marked with 🚧. Such tests are
executed but their results are not taken into account. Tests are waived when
their results are not reliable enough, e.g. when they're just introduced or are
being fixed.
Testing timeout
---------------
We aim to provide a report within reasonable timeframe. Tests that haven't
finished running yet are marked with ⏱.
From: Enric Balletbo i Serra <enric.balletbo(a)collabora.com>
[ Upstream commit 87c3d579c8ed0eaea6b1567d529a8daa85a2bc6c ]
regmap is a library function that gets selected by drivers that need
it. No driver modules should depend on it. Depending on REGMAP_I2C makes
this driver only build if another driver already selected REGMAP_I2C,
as the symbol can't be selected through the menu kernel configuration.
Fixes: 2219a935963e ("power_supply: Add TI BQ24257 charger driver")
Signed-off-by: Enric Balletbo i Serra <enric.balletbo(a)collabora.com>
Signed-off-by: Sebastian Reichel <sebastian.reichel(a)collabora.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/power/supply/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/power/supply/Kconfig b/drivers/power/supply/Kconfig
index 76806a0be820..0de9a958b29a 100644
--- a/drivers/power/supply/Kconfig
+++ b/drivers/power/supply/Kconfig
@@ -424,7 +424,7 @@ config CHARGER_BQ24257
tristate "TI BQ24250/24251/24257 battery charger driver"
depends on I2C
depends on GPIOLIB || COMPILE_TEST
- depends on REGMAP_I2C
+ select REGMAP_I2C
help
Say Y to enable support for the TI BQ24250, BQ24251, and BQ24257 battery
chargers.
--
2.25.1
The patch titled
Subject: mm, compaction: make capture control handling safe wrt interrupts
has been added to the -mm tree. Its filename is
mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-make-capture-control…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-compaction-make-capture-control…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: mm, compaction: make capture control handling safe wrt interrupts
Hugh reports:
: While stressing compaction, one run oopsed on NULL capc->cc in
: __free_one_page()'s task_capc(zone): compact_zone_order() had been
: interrupted, and a page was being freed in the return from interrupt.
:
: Though you would not expect it from the source, both gccs I was using (a
: 4.8.1 and a 7.5.0) had chosen to compile compact_zone_order() with the
: ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before callq
: compact_zone - long after the "current->capture_control = &capc". An
: interrupt in between those finds capc->cc NULL (zeroed by an earlier rep
: stos).
:
: This could presumably be fixed by a barrier() before setting
: current->capture_control in compact_zone_order(); but would also need more
: care on return from compact_zone(), in order not to risk leaking a page
: captured by interrupt just before capture_control is reset.
:
: Maybe that is the preferable fix, but I felt safer for task_capc() to
: exclude the rather surprising possibility of capture at interrupt time.
I have checked that gcc10 also behaves the same.
The advantage of fix in compact_zone_order() is that we don't add another
test in the page freeing hot path, and that it might prevent future
problems if we stop exposing pointers to unitialized structures in current
task.
So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.
Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Reported-by: Hugh Dickins <hughd(a)google.com>
Suggested-by: Hugh Dickins <hughd(a)google.com>
Cc: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Li Wang <liwang(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org> [5.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/compaction.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
--- a/mm/compaction.c~mm-compaction-make-capture-control-handling-safe-wrt-interrupts
+++ a/mm/compaction.c
@@ -2316,15 +2316,26 @@ static enum compact_result compact_zone_
.page = NULL,
};
- current->capture_control = &capc;
+ /*
+ * Make sure the structs are really initialized before we expose the
+ * capture control, in case we are interrupted and the interrupt handler
+ * frees a page.
+ */
+ barrier();
+ WRITE_ONCE(current->capture_control, &capc);
ret = compact_zone(&cc, &capc);
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
- *capture = capc.page;
- current->capture_control = NULL;
+ /*
+ * Make sure we hide capture control first before we read the captured
+ * page pointer, otherwise an interrupt could free and capture a page
+ * and we would leak it.
+ */
+ WRITE_ONCE(current->capture_control, NULL);
+ *capture = READ_ONCE(capc.page);
return ret;
}
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch
mm-page_alloc-use-unlikely-in-task_capc.patch
mm-slub-extend-slub_debug-syntax-for-multiple-blocks.patch
mm-slub-make-some-slub_debug-related-attributes-read-only.patch
mm-slub-remove-runtime-allocation-order-changes.patch
mm-slub-make-remaining-slub_debug-related-attributes-read-only.patch
mm-slub-make-reclaim_account-attribute-read-only.patch
mm-slub-introduce-static-key-for-slub_debug.patch
mm-slub-introduce-kmem_cache_debug_flags.patch
mm-slub-extend-checks-guarded-by-slub_debug-static-key.patch
mm-slab-slub-move-and-improve-cache_from_obj.patch
I updated my system with Radeon VII from kernel 5.6 to kernel 5.7, and
following started to happen on each boot:
...
BUG: kernel NULL pointer dereference, address: 0000000000000128
...
CPU: 9 PID: 1940 Comm: modprobe Tainted: G E 5.7.2-200.im0.fc32.x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 1407 04/02/2020
RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
...
Call Trace:
i2c_smbus_xfer+0x3d/0xf0
i2c_default_probe+0xf3/0x130
i2c_detect.isra.0+0xfe/0x2b0
? kfree+0xa3/0x200
? kobject_uevent_env+0x11f/0x6a0
? i2c_detect.isra.0+0x2b0/0x2b0
__process_new_driver+0x1b/0x20
bus_for_each_dev+0x64/0x90
? 0xffffffffc0f34000
i2c_register_driver+0x73/0xc0
do_one_initcall+0x46/0x200
? _cond_resched+0x16/0x40
? kmem_cache_alloc_trace+0x167/0x220
? do_init_module+0x23/0x260
do_init_module+0x5c/0x260
__do_sys_init_module+0x14f/0x170
do_syscall_64+0x5b/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
Error appears when some i2c device driver tries to probe for devices
using adapter registered by `smu_v11_0_i2c_eeprom_control_init()`.
Code supporting this adapter requires `adev->psp.ras.ras` to be not
NULL, which is true only when `amdgpu_ras_init()` detects HW support by
calling `amdgpu_ras_check_supported()`.
Before 9015d60c9ee1, adapter was registered by
-> amdgpu_device_ip_init()
-> amdgpu_ras_recovery_init()
-> amdgpu_ras_eeprom_init()
-> smu_v11_0_i2c_eeprom_control_init()
after verifying that `adev->psp.ras.ras` is not NULL in
`amdgpu_ras_recovery_init()`. Currently it is registered
unconditionally by
-> amdgpu_device_ip_init()
-> pp_sw_init()
-> hwmgr_sw_init()
-> vega20_smu_init()
-> smu_v11_0_i2c_eeprom_control_init()
Fix simply adds HW support check (ras == NULL => no support) before
calling `smu_v11_0_i2c_eeprom_control_{init,fini}()`.
Please note that there is a chance that similar fix is also required for
CHIP_ARCTURUS. I do not know whether any actual Arcturus hardware without
RAS exist, and whether calling `smu_i2c_eeprom_init()` makes any sense
when there is no HW support.
Cc: stable(a)vger.kernel.org
Fixes: 9015d60c9ee1 ("drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device")
Signed-off-by: Ivan Mironov <mironov.ivan(a)gmail.com>
---
drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
index 2fb97554134f..c2e0fbbccf56 100644
--- a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
+++ b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
@@ -522,9 +522,11 @@ static int vega20_smu_init(struct pp_hwmgr *hwmgr)
priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].version = 0x01;
priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].size = sizeof(DpmActivityMonitorCoeffInt_t);
- ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
- if (ret)
- goto err4;
+ if (adev->psp.ras.ras) {
+ ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
+ if (ret)
+ goto err4;
+ }
return 0;
@@ -560,7 +562,8 @@ static int vega20_smu_fini(struct pp_hwmgr *hwmgr)
(struct vega20_smumgr *)(hwmgr->smu_backend);
struct amdgpu_device *adev = hwmgr->adev;
- smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
+ if (adev->psp.ras.ras)
+ smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
if (priv) {
amdgpu_bo_free_kernel(&priv->smu_tables.entry[TABLE_PPTABLE].handle,
--
2.26.2
Hugh reports:
=====
While stressing compaction, one run oopsed on NULL capc->cc in
__free_one_page()'s task_capc(zone): compact_zone_order() had been
interrupted, and a page was being freed in the return from interrupt.
Though you would not expect it from the source, both gccs I was using
(a 4.8.1 and a 7.5.0) had chosen to compile compact_zone_order() with
the ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
callq compact_zone - long after the "current->capture_control = &capc".
An interrupt in between those finds capc->cc NULL (zeroed by an earlier
rep stos).
This could presumably be fixed by a barrier() before setting
current->capture_control in compact_zone_order(); but would also need
more care on return from compact_zone(), in order not to risk leaking
a page captured by interrupt just before capture_control is reset.
Maybe that is the preferable fix, but I felt safer for task_capc() to
exclude the rather surprising possibility of capture at interrupt time.
=====
I have checked that gcc10 also behaves the same.
The advantage of fix in compact_zone_order() is that we don't add another
test in the page freeing hot path, and that it might prevent future problems
if we stop exposing pointers to unitialized structures in current task.
So this patch implements the suggestion for compact_zone_order() with barrier()
(and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with WRITE_ONCE/READ_ONCE
in the proper order.
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Cc: stable(a)vger.kernel.org # 5.1+
Reported-by: Hugh Dickins <hughd(a)google.com>
Suggested-by: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
---
mm/compaction.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index fd988b7e5f2b..86375605faa9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2316,15 +2316,26 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
.page = NULL,
};
- current->capture_control = &capc;
+ /*
+ * Make sure the structs are really initialized before we expose the
+ * capture control, in case we are interrupted and the interrupt handler
+ * frees a page.
+ */
+ barrier();
+ WRITE_ONCE(current->capture_control, &capc);
ret = compact_zone(&cc, &capc);
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
- *capture = capc.page;
- current->capture_control = NULL;
+ /*
+ * Make sure we hide capture control first before we read the captured
+ * page pointer, otherwise an interrupt could free and capture a page
+ * and we would leak it.
+ */
+ WRITE_ONCE(current->capture_control, NULL);
+ *capture = READ_ONCE(capc.page);
return ret;
}
--
2.27.0
The patch titled
Subject: mm/memcontrol.c: add missed css_put()
has been added to the -mm tree. Its filename is
mm-memcontrol-fix-do-not-put-the-css-reference.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-memcontrol-fix-do-not-put-the-c…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcontrol-fix-do-not-put-the-c…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Muchun Song <songmuchun(a)bytedance.com>
Subject: mm/memcontrol.c: add missed css_put()
We should put the css reference when memory allocation failed.
Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com
Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Muchun Song <songmuchun(a)bytedance.com>
Acked-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Qian Cai <cai(a)lca.pw>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/mm/memcontrol.c~mm-memcontrol-fix-do-not-put-the-css-reference
+++ a/mm/memcontrol.c
@@ -2772,8 +2772,10 @@ static void memcg_schedule_kmem_cache_cr
return;
cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
- if (!cw)
+ if (!cw) {
+ css_put(&memcg->css);
return;
+ }
cw->memcg = memcg;
cw->cachep = cachep;
_
Patches currently in -mm which might be from songmuchun(a)bytedance.com are
mm-memcontrol-fix-do-not-put-the-css-reference.patch
This is a note to let you know that I've just added the patch titled
Staging: rtl8723bs: prevent buffer overflow in
to my staging git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
in the staging-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From b65a2d8c8614386f7e8d38ea150749f8a862f431 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dan.carpenter(a)oracle.com>
Date: Wed, 3 Jun 2020 13:19:58 +0300
Subject: Staging: rtl8723bs: prevent buffer overflow in
update_sta_support_rate()
The "ie_len" variable is in the 0-255 range and it comes from the
network. If it's over NDIS_802_11_LENGTH_RATES_EX (16) then that will
lead to memory corruption.
Fixes: 554c0a3abf21 ("staging: Add rtl8723bs sdio wifi driver")
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200603101958.GA1845750@mwanda
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/rtl8723bs/core/rtw_wlan_util.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/staging/rtl8723bs/core/rtw_wlan_util.c b/drivers/staging/rtl8723bs/core/rtw_wlan_util.c
index 69bcd172b298..a3ea7ce3e12e 100644
--- a/drivers/staging/rtl8723bs/core/rtw_wlan_util.c
+++ b/drivers/staging/rtl8723bs/core/rtw_wlan_util.c
@@ -1824,12 +1824,14 @@ int update_sta_support_rate(struct adapter *padapter, u8 *pvar_ie, uint var_ie_l
pIE = (struct ndis_80211_var_ie *)rtw_get_ie(pvar_ie, _SUPPORTEDRATES_IE_, &ie_len, var_ie_len);
if (!pIE)
return _FAIL;
+ if (ie_len > sizeof(pmlmeinfo->FW_sta_info[cam_idx].SupportedRates))
+ return _FAIL;
memcpy(pmlmeinfo->FW_sta_info[cam_idx].SupportedRates, pIE->data, ie_len);
supportRateNum = ie_len;
pIE = (struct ndis_80211_var_ie *)rtw_get_ie(pvar_ie, _EXT_SUPPORTEDRATES_IE_, &ie_len, var_ie_len);
- if (pIE)
+ if (pIE && (ie_len <= sizeof(pmlmeinfo->FW_sta_info[cam_idx].SupportedRates) - supportRateNum))
memcpy((pmlmeinfo->FW_sta_info[cam_idx].SupportedRates + supportRateNum), pIE->data, ie_len);
return _SUCCESS;
--
2.27.0
The patch titled
Subject: mm: memcontrol: handle div0 crash race condition in memory.low
has been added to the -mm tree. Its filename is
mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-memcontrol-handle-div0-crash-ra…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcontrol-handle-div0-crash-ra…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: mm: memcontrol: handle div0 crash race condition in memory.low
Tejun reports seeing rare div0 crashes in memory.low stress testing:
[37228.504582] RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150
[37228.505059] Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49
[37228.506254] RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246
[37228.506769] RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b
[37228.507319] RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000
[37228.507869] RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000
[37228.508376] R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000
[37228.508881] R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8
[37228.509397] FS: 0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000
[37228.509917] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37228.510442] CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0
[37228.511076] Call Trace:
[37228.511561] shrink_node+0x1e5/0x6c0
[37228.512044] balance_pgdat+0x32d/0x5f0
[37228.512521] kswapd+0x1d7/0x3d0
[37228.513346] ? wait_woken+0x80/0x80
[37228.514170] kthread+0x11c/0x160
[37228.514983] ? balance_pgdat+0x5f0/0x5f0
[37228.515797] ? kthread_park+0x90/0x90
[37228.516593] ret_from_fork+0x1f/0x30
This happens when parent_usage == siblings_protected. We check that usage
is bigger than protected, which should imply parent_usage being bigger
than siblings_protected. However, we don't read (or even update) these
values atomically, and they can be out of sync as the memory state changes
under us. A bit of fluctuation around the target protection isn't a big
deal, but we need to handle the div0 case.
Check the parent state explicitly to make sure we have a reasonable
positive value for the divisor.
Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org
Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reported-by: Tejun Heo <tj(a)kernel.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Chris Down <chris(a)chrisdown.name>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/mm/memcontrol.c~mm-memcontrol-handle-div0-crash-race-condition-in-memorylow
+++ a/mm/memcontrol.c
@@ -6360,11 +6360,16 @@ static unsigned long effective_protectio
* We're using unprotected memory for the weight so that if
* some cgroups DO claim explicit protection, we don't protect
* the same bytes twice.
+ *
+ * Check both usage and parent_usage against the respective
+ * protected values. One should imply the other, but they
+ * aren't read atomically - make sure the division is sane.
*/
if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT))
return ep;
-
- if (parent_effective > siblings_protected && usage > protected) {
+ if (parent_effective > siblings_protected &&
+ parent_usage > siblings_protected &&
+ usage > protected) {
unsigned long unclaimed;
unclaimed = parent_effective - siblings_protected;
_
Patches currently in -mm which might be from hannes(a)cmpxchg.org are
mm-workingset-age-nonresident-information-alongside-anonymous-pages.patch
mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch
The patch titled
Subject: mm: fix swap cache node allocation mask
has been added to the -mm tree. Its filename is
mm-fix-swap-cache-node-allocation-mask.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-fix-swap-cache-node-allocation-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-fix-swap-cache-node-allocation-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: fix swap cache node allocation mask
https://bugzilla.kernel.org/show_bug.cgi?id=208085 reports that a slightly
overcommitted load, testing swap and zram along with i915, splats and
keeps on splatting, when it had better fail less noisily:
gnome-shell: page allocation failure: order:0,
mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
nodemask=(null),cpuset=/,mems_allowed=0
CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
Call Trace:
dump_stack+0x64/0x88
warn_alloc.cold+0x75/0xd9
__alloc_pages_slowpath.constprop.0+0xcfa/0xd30
__alloc_pages_nodemask+0x2df/0x320
alloc_slab_page+0x195/0x310
allocate_slab+0x3c5/0x440
___slab_alloc+0x40c/0x5f0
__slab_alloc+0x1c/0x30
kmem_cache_alloc+0x20e/0x220
xas_nomem+0x28/0x70
add_to_swap_cache+0x321/0x400
__read_swap_cache_async+0x105/0x240
swap_cluster_readahead+0x22c/0x2e0
shmem_swapin+0x8e/0xc0
shmem_swapin_page+0x196/0x740
shmem_getpage_gfp+0x3a2/0xa60
shmem_read_mapping_page_gfp+0x32/0x60
shmem_get_pages+0x155/0x5e0 [i915]
__i915_gem_object_get_pages+0x68/0xa0 [i915]
i915_vma_pin+0x3fe/0x6c0 [i915]
eb_add_vma+0x10b/0x2c0 [i915]
i915_gem_do_execbuffer+0x704/0x3430 [i915]
i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
drm_ioctl_kernel+0x86/0xd0 [drm]
drm_ioctl+0x206/0x390 [drm]
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported on 5.7, but it goes back really to 3.1: when
shmem_read_mapping_page_gfp() was implemented for use by i915, and
allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
__read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reported-by: Chris Murphy <lists(a)colorremedies.com>
Analyzed-by: Vlastimil Babka <vbabka(a)suse.cz>
Analyzed-by: Matthew Wilcox <willy(a)infradead.org>
Tested-by: Chris Murphy <lists(a)colorremedies.com>
Cc: <stable(a)vger.kernel.org> [3.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swap_state.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/swap_state.c~mm-fix-swap-cache-node-allocation-mask
+++ a/mm/swap_state.c
@@ -21,7 +21,7 @@
#include <linux/vmalloc.h>
#include <linux/swap_slots.h>
#include <linux/huge_mm.h>
-
+#include "internal.h"
/*
* swapper_space is a fiction, retained to simplify the path through
@@ -429,7 +429,7 @@ struct page *__read_swap_cache_async(swp
__SetPageSwapBacked(page);
/* May fail (-ENOMEM) if XArray node allocation failed. */
- if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
+ if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK)) {
put_swap_page(page, entry);
goto fail_unlock;
}
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-swap-cache-node-allocation-mask.patch
mm-vmstat-add-events-for-pmd-based-thp-migration-without-split-fix.patch
The patch titled
Subject: mm/slab: use memzero_explicit() in kzfree()
has been added to the -mm tree. Its filename is
mm-slab-use-memzero_explicit-in-kzfree.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-slab-use-memzero_explicit-in-kz…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-slab-use-memzero_explicit-in-kz…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Waiman Long <longman(a)redhat.com>
Subject: mm/slab: use memzero_explicit() in kzfree()
The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back to
the pool. Memset() is currently used for buffer clearing. However
unlikely, there is still a non-zero probability that the compiler may
choose to optimize away the memory clearing especially if LTO is being
used in the future. To make sure that this optimization will never
happen, memzero_explicit(), which is introduced in v3.18, is now used in
kzfree() to future-proof it.
Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Signed-off-by: Waiman Long <longman(a)redhat.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Cc: James Morris <jmorris(a)namei.org>
Cc: "Serge E. Hallyn" <serge(a)hallyn.com>
Cc: Joe Perches <joe(a)perches.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: "Jason A . Donenfeld" <Jason(a)zx2c4.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/slab_common.c~mm-slab-use-memzero_explicit-in-kzfree
+++ a/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
- memset(mem, 0, ks);
+ memzero_explicit(mem, ks);
kfree(mem);
}
EXPORT_SYMBOL(kzfree);
_
Patches currently in -mm which might be from longman(a)redhat.com are
mm-slab-use-memzero_explicit-in-kzfree.patch
mm-treewide-rename-kzfree-to-kfree_sensitive.patch
The patch titled
Subject: lib/lzo: fix ambiguous encoding bug in lzo-rle
has been removed from the -mm tree. Its filename was
lib-lzo-fix-ambiguous-encoding-bug-in-lzo-rle.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dave Rodgman <dave.rodgman(a)arm.com>
Subject: lib/lzo: fix ambiguous encoding bug in lzo-rle
In some rare cases, for input data over 32 KB, lzo-rle could encode two
different inputs to the same compressed representation, so that
decompression is then ambiguous (i.e. data may be corrupted - although
zram is not affected because it operates over 4 KB pages).
This modifies the compressor without changing the decompressor or the
bitstream format, such that:
- there is no change to how data produced by the old compressor is
decompressed
- an old decompressor will correctly decode data from the updated
compressor
- performance and compression ratio are not affected
- we avoid introducing a new bitstream format
In testing over 12.8M real-world files totalling 903 GB, three files were
affected by this bug. I also constructed 37M semi-random 64 KB files
totalling 2.27 TB, and saw no affected files. Finally I tested over files
constructed to contain each of the ~1024 possible bad input sequences; for
all of these cases, updated lzo-rle worked correctly.
There is no significant impact to performance or compression ratio.
Link: http://lkml.kernel.org/r/20200507100203.29785-1-dave.rodgman@arm.com
Signed-off-by: Dave Rodgman <dave.rodgman(a)arm.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Dave Rodgman <dave.rodgman(a)arm.com>
Cc: Willy Tarreau <w(a)1wt.eu>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
Cc: Markus F.X.J. Oberhumer <markus(a)oberhumer.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Nitin Gupta <ngupta(a)vflare.org>
Cc: Chao Yu <yuchao0(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/lzo.txt | 8 ++++++--
lib/lzo/lzo1x_compress.c | 13 +++++++++++++
2 files changed, 19 insertions(+), 2 deletions(-)
--- a/Documentation/lzo.txt~lib-lzo-fix-ambiguous-encoding-bug-in-lzo-rle
+++ a/Documentation/lzo.txt
@@ -159,11 +159,15 @@ Byte sequences
distance = 16384 + (H << 14) + D
state = S (copy S literals after this block)
End of stream is reached if distance == 16384
+ In version 1 only, to prevent ambiguity with the RLE case when
+ ((distance & 0x803f) == 0x803f) && (261 <= length <= 264), the
+ compressor must not emit block copies where distance and length
+ meet these conditions.
In version 1 only, this instruction is also used to encode a run of
- zeros if distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
+ zeros if distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
In this case, it is followed by a fourth byte, X.
- run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4.
+ run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4
0 0 1 L L L L L (32..63)
Copy of small block within 16kB distance (preferably less than 34B)
--- a/lib/lzo/lzo1x_compress.c~lib-lzo-fix-ambiguous-encoding-bug-in-lzo-rle
+++ a/lib/lzo/lzo1x_compress.c
@@ -268,6 +268,19 @@ m_len_done:
*op++ = (M4_MARKER | ((m_off >> 11) & 8)
| (m_len - 2));
else {
+ if (unlikely(((m_off & 0x403f) == 0x403f)
+ && (m_len >= 261)
+ && (m_len <= 264))
+ && likely(bitstream_version)) {
+ // Under lzo-rle, block copies
+ // for 261 <= length <= 264 and
+ // (distance & 0x80f3) == 0x80f3
+ // can result in ambiguous
+ // output. Adjust length
+ // to 260 to prevent ambiguity.
+ ip -= m_len - 260;
+ m_len = 260;
+ }
m_len -= M4_MAX_LEN;
*op++ = (M4_MARKER | ((m_off >> 11) & 8));
while (unlikely(m_len > 255)) {
_
Patches currently in -mm which might be from dave.rodgman(a)arm.com are
This is a note to let you know that I've just added the patch titled
serial: imx: Initialize lock for non-registered console
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From 8f065acec7573672dd15916e31d1e9b2e785566c Mon Sep 17 00:00:00 2001
From: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Date: Mon, 25 May 2020 13:59:52 +0300
Subject: serial: imx: Initialize lock for non-registered console
The commit a3cb39d258ef
("serial: core: Allow detach and attach serial device for console")
changed a bit logic behind lock initialization since for most of the console
driver it's supposed to have lock already initialized even if console is not
enabled. However, it's not the case for Freescale IMX console.
Initialize lock explicitly in the ->probe().
Note, there is still an open question should or shouldn't not this driver
register console properly.
Fixes: a3cb39d258ef ("serial: core: Allow detach and attach serial device for console")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Link: https://lore.kernel.org/r/20200525105952.13744-1-andriy.shevchenko@linux.in…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/imx.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 986d902fb7fe..6b078e395931 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -2404,6 +2404,9 @@ static int imx_uart_probe(struct platform_device *pdev)
}
}
+ /* We need to initialize lock even for non-registered console */
+ spin_lock_init(&sport->port.lock);
+
imx_uart_ports[sport->port.line] = sport;
platform_set_drvdata(pdev, sport);
--
2.26.2
The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back
to the pool. Memset() is currently used for the buffer clearing. However,
it is entirely possible that the compiler may choose to optimize away the
memory clearing especially if LTO is being used. To make sure that this
optimization will not happen, memzero_explicit(), which is introduced
in v3.18, is now used in kzfree() to do the clearing.
Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Waiman Long <longman(a)redhat.com>
---
mm/slab_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..37d48a56431d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
- memset(mem, 0, ks);
+ memzero_explicit(mem, ks);
kfree(mem);
}
EXPORT_SYMBOL(kzfree);
--
2.18.1
The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back
to the pool. Memset() is currently used for buffer clearing. However
unlikely, there is still a non-zero probability that the compiler may
choose to optimize away the memory clearing especially if LTO is being
used in the future. To make sure that this optimization will never
happen, memzero_explicit(), which is introduced in v3.18, is now used
in kzfree() to future-proof it.
Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Cc: stable(a)vger.kernel.org
Acked-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
---
mm/slab_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..37d48a56431d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
- memset(mem, 0, ks);
+ memzero_explicit(mem, ks);
kfree(mem);
}
EXPORT_SYMBOL(kzfree);
--
2.18.1
The Obins Anne Pro 2 keyboard (04d9:a293) disconnects after a few
minutes of inactivity when using it wired and typing does not result
in any input events any more. This is a common firmware flaw. So add
the ALWAYS_POLL quirk for this device.
GitHub user Dietrich Moerman (dietrichm) tested the quirk and
requested my help in my project
https://github.com/sriemer/fix-linux-mouse issue 22 to provide
this patch.
Link: https://www.reddit.com/r/AnnePro/comments/gruzcb/anne_pro_2_linux_cant_type…
Signed-off-by: Sebastian Parschauer <s.parschauer(a)gmx.de>
Cc: stable(a)vger.kernel.org # v4.16+
---
drivers/hid/hid-ids.h | 1 +
drivers/hid/hid-quirks.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/drivers/hid/hid-ids.h b/drivers/hid/hid-ids.h
index 1c71a1aa76b2..3a1047e143d2 100644
--- a/drivers/hid/hid-ids.h
+++ b/drivers/hid/hid-ids.h
@@ -624,6 +624,7 @@
#define USB_DEVICE_ID_HOLTEK_ALT_MOUSE_A081 0xa081
#define USB_DEVICE_ID_HOLTEK_ALT_MOUSE_A0C2 0xa0c2
#define USB_DEVICE_ID_HOLTEK_ALT_KEYBOARD_A096 0xa096
+#define USB_DEVICE_ID_HOLTEK_ALT_KEYBOARD_A293 0xa293
#define USB_VENDOR_ID_IMATION 0x0718
#define USB_DEVICE_ID_DISC_STAKKA 0xd000
diff --git a/drivers/hid/hid-quirks.c b/drivers/hid/hid-quirks.c
index e4cb543de0cd..67839d5eece8 100644
--- a/drivers/hid/hid-quirks.c
+++ b/drivers/hid/hid-quirks.c
@@ -88,6 +88,7 @@ static const struct hid_device_id hid_quirks[] = {
{ HID_USB_DEVICE(USB_VENDOR_ID_HAPP, USB_DEVICE_ID_UGCI_FIGHTING), HID_QUIRK_BADPAD | HID_QUIRK_MULTI_INPUT },
{ HID_USB_DEVICE(USB_VENDOR_ID_HAPP, USB_DEVICE_ID_UGCI_FLYING), HID_QUIRK_BADPAD | HID_QUIRK_MULTI_INPUT },
{ HID_USB_DEVICE(USB_VENDOR_ID_HOLTEK_ALT, USB_DEVICE_ID_HOLTEK_ALT_KEYBOARD_A096), HID_QUIRK_NO_INIT_REPORTS },
+ { HID_USB_DEVICE(USB_VENDOR_ID_HOLTEK_ALT, USB_DEVICE_ID_HOLTEK_ALT_KEYBOARD_A293), HID_QUIRK_ALWAYS_POLL },
{ HID_USB_DEVICE(USB_VENDOR_ID_HP, USB_PRODUCT_ID_HP_LOGITECH_OEM_USB_OPTICAL_MOUSE_0A4A), HID_QUIRK_ALWAYS_POLL },
{ HID_USB_DEVICE(USB_VENDOR_ID_HP, USB_PRODUCT_ID_HP_LOGITECH_OEM_USB_OPTICAL_MOUSE_0B4A), HID_QUIRK_ALWAYS_POLL },
{ HID_USB_DEVICE(USB_VENDOR_ID_HP, USB_PRODUCT_ID_HP_PIXART_OEM_USB_OPTICAL_MOUSE), HID_QUIRK_ALWAYS_POLL },
--
2.26.2
commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
---
arch/arm/include/asm/kvm_host.h | 2 ++
arch/arm64/include/asm/kvm_host.h | 2 ++
virt/kvm/arm/aarch32.c | 28 ++++++++++++++++++++++++++++
3 files changed, 32 insertions(+)
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index d0d0227fc70d..c9128bb187f9 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -364,4 +364,6 @@ static inline void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu) {}
struct kvm *kvm_arch_alloc_vm(void);
void kvm_arch_free_vm(struct kvm *kvm);
+#define kvm_arm_vcpu_loaded(vcpu) (false)
+
#endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 367b2e0b6d76..3c2f270f3b50 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -535,4 +535,6 @@ void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu);
struct kvm *kvm_arch_alloc_vm(void);
void kvm_arch_free_vm(struct kvm *kvm);
+#define kvm_arm_vcpu_loaded(vcpu) ((vcpu)->arch.sysregs_loaded_on_cpu)
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/virt/kvm/arm/aarch32.c b/virt/kvm/arm/aarch32.c
index 18d6d5124397..92c9ad6c0182 100644
--- a/virt/kvm/arm/aarch32.c
+++ b/virt/kvm/arm/aarch32.c
@@ -44,6 +44,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (kvm_arm_vcpu_loaded(vcpu)) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -166,7 +186,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -179,6 +202,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -202,6 +228,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
--
2.27.0
commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
---
arch/arm/include/asm/kvm_host.h | 2 ++
arch/arm64/include/asm/kvm_host.h | 2 ++
virt/kvm/arm/aarch32.c | 28 ++++++++++++++++++++++++++++
3 files changed, 32 insertions(+)
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index a827b4d60d38..03932e172730 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -453,4 +453,6 @@ static inline bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu)
return true;
}
+#define kvm_arm_vcpu_loaded(vcpu) (false)
+
#endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 57fd46acd058..06bb53a9d8dd 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -683,4 +683,6 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
#define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
+#define kvm_arm_vcpu_loaded(vcpu) ((vcpu)->arch.sysregs_loaded_on_cpu)
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/virt/kvm/arm/aarch32.c b/virt/kvm/arm/aarch32.c
index 0a356aa91aa1..f2047fc69006 100644
--- a/virt/kvm/arm/aarch32.c
+++ b/virt/kvm/arm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (kvm_arm_vcpu_loaded(vcpu)) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
--
2.27.0
commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
---
arch/arm/include/asm/kvm_host.h | 2 ++
arch/arm64/include/asm/kvm_host.h | 2 ++
virt/kvm/arm/aarch32.c | 28 ++++++++++++++++++++++++++++
3 files changed, 32 insertions(+)
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 8a37c8e89777..1b179b1f46bc 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -421,4 +421,6 @@ static inline bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu)
return true;
}
+#define kvm_arm_vcpu_loaded(vcpu) (false)
+
#endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f656169db8c3..441899f0fb36 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -677,4 +677,6 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
#define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
+#define kvm_arm_vcpu_loaded(vcpu) ((vcpu)->arch.sysregs_loaded_on_cpu)
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/virt/kvm/arm/aarch32.c b/virt/kvm/arm/aarch32.c
index 0a356aa91aa1..f2047fc69006 100644
--- a/virt/kvm/arm/aarch32.c
+++ b/virt/kvm/arm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (kvm_arm_vcpu_loaded(vcpu)) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
--
2.27.0
commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
---
virt/kvm/arm/aarch32.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/virt/kvm/arm/aarch32.c b/virt/kvm/arm/aarch32.c
index 0a356aa91aa1..40a62a99fbf8 100644
--- a/virt/kvm/arm/aarch32.c
+++ b/virt/kvm/arm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (vcpu->arch.sysregs_loaded_on_cpu) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
--
2.27.0
commit 95e2c5b0fd6d7a022f37e7c762ea093aba7b8e34 upstream
platform/x86: sony-laptop: SNC calls should handle BUFFER types
After commit 6d232b29cfce ("ACPICA: Dispatcher: always generate buffer
objects for ASL create_field() operator") ACPICA creates buffers even
when new fields are small enough to fit into an integer.
Many SNC calls counted on the old behaviour.
Since sony-laptop already handles the INTEGER/BUFFER case in
sony_nc_buffer_call, switch sony_nc_int_call to use its more generic
function instead.
Fixes: 6d232b29cfce ("ACPICA: Dispatcher: always generate buffer objects for ASL create_field() operator")
Reported-by: Dominik Mierzejewski <dominik(a)greysector.net>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207491
Reported-by: William Bader <williambader(a)hotmail.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1830150
Signed-off-by: Mattia Dongili <malattia(a)linux.it>
---
ACPICA broke resume from s2ram for some Vaio laptops in 5.6. The
platform drivers maintainers didn't send the fix for 5.7, so the commit
is needed both on 5.6 and on 5.7.
Thanks
--
mattia
:wq!
Especially with memory hotplug, we can have offline sections (with a
garbage memmap) and overlapping zones. We have to make sure to only
touch initialized memmaps (online sections managed by the buddy) and that
the zone matches, to not move pages between zones.
To test if this can actually happen, I added a simple
BUG_ON(page_zone(page_i) != page_zone(page_j));
right before the swap. When hotplugging a 256M DIMM to a 4G x86-64 VM and
onlining the first memory block "online_movable" and the second memory
block "online_kernel", it will trigger the BUG, as both zones (NORMAL
and MOVABLE) overlap.
This might result in all kinds of weird situations (e.g., double
allocations, list corruptions, unmovable allocations ending up in the
movable zone).
Fixes: e900a918b098 ("mm: shuffle initial free memory to improve memory-side-cache utilization")
Cc: stable(a)vger.kernel.org # v5.2+
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
mm/shuffle.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 44406d9977c77..dd13ab851b3ee 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -58,25 +58,25 @@ module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
* For two pages to be swapped in the shuffle, they must be free (on a
* 'free_area' lru), have the same order, and have the same migratetype.
*/
-static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+static struct page * __meminit shuffle_valid_page(struct zone *zone,
+ unsigned long pfn, int order)
{
- struct page *page;
+ struct page *page = pfn_to_online_page(pfn);
/*
* Given we're dealing with randomly selected pfns in a zone we
* need to ask questions like...
*/
- /* ...is the pfn even in the memmap? */
- if (!pfn_valid_within(pfn))
+ /* ... is the page managed by the buddy? */
+ if (!page)
return NULL;
- /* ...is the pfn in a present section or a hole? */
- if (!pfn_in_present_section(pfn))
+ /* ... is the page assigned to the same zone? */
+ if (page_zone(page) != zone)
return NULL;
/* ...is the page free and currently on a free_area list? */
- page = pfn_to_page(pfn);
if (!PageBuddy(page))
return NULL;
@@ -123,7 +123,7 @@ void __meminit __shuffle_zone(struct zone *z)
* page_j randomly selected in the span @zone_start_pfn to
* @spanned_pages.
*/
- page_i = shuffle_valid_page(i, order);
+ page_i = shuffle_valid_page(z, i, order);
if (!page_i)
continue;
@@ -137,7 +137,7 @@ void __meminit __shuffle_zone(struct zone *z)
j = z->zone_start_pfn +
ALIGN_DOWN(get_random_long() % z->spanned_pages,
order_pages);
- page_j = shuffle_valid_page(j, order);
+ page_j = shuffle_valid_page(z, j, order);
if (page_j && page_j != page_i)
break;
}
--
2.26.2
Hi
The issue fixed with da2311a6385c ("can: kvaser_usb: kvaser_usb_leaf:
Fix some info-leaks to USB devices") seem to be present as well before
4.19, introduced in 3.8 by commit 080f40a6fa28 "can: kvaser_usb: Add
support for Kvaser CAN/USB devices" already.
For Debian (for 4.9.210-1 upload) and for 3.16.y upstream Ben
Hutchings did backport the commit.
Can you apply it please as well for v4.9.y?
Regards,
Salvatore
>From fe8da238df185e6b55301eea0c762e1ab04df625 Mon Sep 17 00:00:00 2001
From: Xiaolong Huang <butterflyhuangxx(a)gmail.com>
Date: Sat, 7 Dec 2019 22:40:24 +0800
Subject: [PATCH] can: kvaser_usb: kvaser_usb_leaf: Fix some info-leaks to USB
devices
Uninitialized Kernel memory can leak to USB devices.
Fix this by using kzalloc() instead of kmalloc().
Signed-off-by: Xiaolong Huang <butterflyhuangxx(a)gmail.com>
Fixes: 7259124eac7d ("can: kvaser_usb: Split driver into kvaser_usb_core.c and kvaser_usb_leaf.c")
Cc: linux-stable <stable(a)vger.kernel.org> # >= v4.19
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
[bwh: Backported to 4.9: adjust filename, context]
Signed-off-by: Ben Hutchings <ben(a)decadent.org.uk>
---
drivers/net/can/usb/kvaser_usb.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
index 3a75352f632b..792a1afabf5d 100644
--- a/drivers/net/can/usb/kvaser_usb.c
+++ b/drivers/net/can/usb/kvaser_usb.c
@@ -791,7 +791,7 @@ static int kvaser_usb_simple_msg_async(struct kvaser_usb_net_priv *priv,
if (!urb)
return -ENOMEM;
- buf = kmalloc(sizeof(struct kvaser_msg), GFP_ATOMIC);
+ buf = kzalloc(sizeof(struct kvaser_msg), GFP_ATOMIC);
if (!buf) {
usb_free_urb(urb);
return -ENOMEM;
@@ -1459,7 +1459,7 @@ static int kvaser_usb_set_opt_mode(const struct kvaser_usb_net_priv *priv)
struct kvaser_msg *msg;
int rc;
- msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+ msg = kzalloc(sizeof(*msg), GFP_KERNEL);
if (!msg)
return -ENOMEM;
@@ -1592,7 +1592,7 @@ static int kvaser_usb_flush_queue(struct kvaser_usb_net_priv *priv)
struct kvaser_msg *msg;
int rc;
- msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+ msg = kzalloc(sizeof(*msg), GFP_KERNEL);
if (!msg)
return -ENOMEM;
--
2.27.0
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 14ed1c908a7a623cc0cbf0203f8201d1b7d31d16 Mon Sep 17 00:00:00 2001
From: Harry Wentland <harry.wentland(a)amd.com>
Date: Thu, 28 May 2020 09:44:44 -0400
Subject: [PATCH] Revert "drm/amd/display: disable dcn20 abm feature for bring
up"
This reverts commit 96cb7cf13d8530099c256c053648ad576588c387.
This change was used for DCN2 bringup and is no longer desired.
In fact it breaks backlight on DCN2 systems.
Cc: Alexander Monakov <amonakov(a)ispras.ru>
Cc: Hersen Wu <hersenxs.wu(a)amd.com>
Cc: Anthony Koo <Anthony.Koo(a)amd.com>
Cc: Michael Chiu <Michael.Chiu(a)amd.com>
Signed-off-by: Harry Wentland <harry.wentland(a)amd.com>
Acked-by: Alex Deucher <alexander.deucher(a)amd.com>
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Reported-and-tested-by: Alexander Monakov <amonakov(a)ispras.ru>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d53c60b37cc6..f42e7e67ddba 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1356,7 +1356,7 @@ static int dm_late_init(void *handle)
unsigned int linear_lut[16];
int i;
struct dmcu *dmcu = NULL;
- bool ret = false;
+ bool ret;
if (!adev->dm.fw_dmcu)
return detect_mst_link_for_all_connectors(adev->ddev);
@@ -1377,13 +1377,10 @@ static int dm_late_init(void *handle)
*/
params.min_abm_backlight = 0x28F;
- /* todo will enable for navi10 */
- if (adev->asic_type <= CHIP_RAVEN) {
- ret = dmcu_load_iram(dmcu, params);
+ ret = dmcu_load_iram(dmcu, params);
- if (!ret)
- return -EINVAL;
- }
+ if (!ret)
+ return -EINVAL;
return detect_mst_link_for_all_connectors(adev->ddev);
}
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 14ed1c908a7a623cc0cbf0203f8201d1b7d31d16 Mon Sep 17 00:00:00 2001
From: Harry Wentland <harry.wentland(a)amd.com>
Date: Thu, 28 May 2020 09:44:44 -0400
Subject: [PATCH] Revert "drm/amd/display: disable dcn20 abm feature for bring
up"
This reverts commit 96cb7cf13d8530099c256c053648ad576588c387.
This change was used for DCN2 bringup and is no longer desired.
In fact it breaks backlight on DCN2 systems.
Cc: Alexander Monakov <amonakov(a)ispras.ru>
Cc: Hersen Wu <hersenxs.wu(a)amd.com>
Cc: Anthony Koo <Anthony.Koo(a)amd.com>
Cc: Michael Chiu <Michael.Chiu(a)amd.com>
Signed-off-by: Harry Wentland <harry.wentland(a)amd.com>
Acked-by: Alex Deucher <alexander.deucher(a)amd.com>
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Reported-and-tested-by: Alexander Monakov <amonakov(a)ispras.ru>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d53c60b37cc6..f42e7e67ddba 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1356,7 +1356,7 @@ static int dm_late_init(void *handle)
unsigned int linear_lut[16];
int i;
struct dmcu *dmcu = NULL;
- bool ret = false;
+ bool ret;
if (!adev->dm.fw_dmcu)
return detect_mst_link_for_all_connectors(adev->ddev);
@@ -1377,13 +1377,10 @@ static int dm_late_init(void *handle)
*/
params.min_abm_backlight = 0x28F;
- /* todo will enable for navi10 */
- if (adev->asic_type <= CHIP_RAVEN) {
- ret = dmcu_load_iram(dmcu, params);
+ ret = dmcu_load_iram(dmcu, params);
- if (!ret)
- return -EINVAL;
- }
+ if (!ret)
+ return -EINVAL;
return detect_mst_link_for_all_connectors(adev->ddev);
}
The patch below does not apply to the 5.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 14ed1c908a7a623cc0cbf0203f8201d1b7d31d16 Mon Sep 17 00:00:00 2001
From: Harry Wentland <harry.wentland(a)amd.com>
Date: Thu, 28 May 2020 09:44:44 -0400
Subject: [PATCH] Revert "drm/amd/display: disable dcn20 abm feature for bring
up"
This reverts commit 96cb7cf13d8530099c256c053648ad576588c387.
This change was used for DCN2 bringup and is no longer desired.
In fact it breaks backlight on DCN2 systems.
Cc: Alexander Monakov <amonakov(a)ispras.ru>
Cc: Hersen Wu <hersenxs.wu(a)amd.com>
Cc: Anthony Koo <Anthony.Koo(a)amd.com>
Cc: Michael Chiu <Michael.Chiu(a)amd.com>
Signed-off-by: Harry Wentland <harry.wentland(a)amd.com>
Acked-by: Alex Deucher <alexander.deucher(a)amd.com>
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Reported-and-tested-by: Alexander Monakov <amonakov(a)ispras.ru>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d53c60b37cc6..f42e7e67ddba 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1356,7 +1356,7 @@ static int dm_late_init(void *handle)
unsigned int linear_lut[16];
int i;
struct dmcu *dmcu = NULL;
- bool ret = false;
+ bool ret;
if (!adev->dm.fw_dmcu)
return detect_mst_link_for_all_connectors(adev->ddev);
@@ -1377,13 +1377,10 @@ static int dm_late_init(void *handle)
*/
params.min_abm_backlight = 0x28F;
- /* todo will enable for navi10 */
- if (adev->asic_type <= CHIP_RAVEN) {
- ret = dmcu_load_iram(dmcu, params);
+ ret = dmcu_load_iram(dmcu, params);
- if (!ret)
- return -EINVAL;
- }
+ if (!ret)
+ return -EINVAL;
return detect_mst_link_for_all_connectors(adev->ddev);
}
The patch below does not apply to the 5.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 96cb7cf13d8530099c256c053648ad576588c387 Mon Sep 17 00:00:00 2001
From: hersen wu <hersenxs.wu(a)amd.com>
Date: Thu, 28 Feb 2019 16:35:24 -0500
Subject: [PATCH] drm/amd/display: disable dcn20 abm feature for bring up
[WHY] dcn20 enable usb-c dp ALT mode in dmcu. There is bug
when enable abm feature which cause system crash. dal team
will debug this bug later.
[HOW] disable dcn abm feature for dcn20.
Signed-off-by: hersen wu <hersenxs.wu(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 5971aef4f033..72d14f680932 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -781,7 +781,7 @@ static int dm_late_init(void *handle)
unsigned int linear_lut[16];
int i;
struct dmcu *dmcu = adev->dm.dc->res_pool->dmcu;
- bool ret;
+ bool ret = false;
for (i = 0; i < 16; i++)
linear_lut[i] = 0xFFFF * i / 15;
@@ -792,10 +792,13 @@ static int dm_late_init(void *handle)
params.backlight_lut_array_size = 16;
params.backlight_lut_array = linear_lut;
- ret = dmcu_load_iram(dmcu, params);
+ /* todo will enable for navi10 */
+ if (adev->asic_type <= CHIP_RAVEN) {
+ ret = dmcu_load_iram(dmcu, params);
- if (!ret)
- return -EINVAL;
+ if (!ret)
+ return -EINVAL;
+ }
return detect_mst_link_for_all_connectors(adev->ddev);
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 21998a351512eba4ed5969006f0c55882d995ada Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Tue, 19 May 2020 06:40:42 -0700
Subject: [PATCH] x86/speculation: Avoid force-disabling IBPB based on STIBP
and enhanced IBRS.
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Upda…
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes: 7cc765a67d8e ("x86/speculation: Enable prctl mode for spectre_v2_user")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ed54b3b21c39..8d57562b1d2c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -495,7 +495,9 @@ early_param("nospectre_v1", nospectre_v1_cmdline);
static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
SPECTRE_V2_NONE;
-static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init =
+static enum spectre_v2_user_mitigation spectre_v2_user_stibp __ro_after_init =
+ SPECTRE_V2_USER_NONE;
+static enum spectre_v2_user_mitigation spectre_v2_user_ibpb __ro_after_init =
SPECTRE_V2_USER_NONE;
#ifdef CONFIG_RETPOLINE
@@ -641,15 +643,6 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
break;
}
- /*
- * At this point, an STIBP mode other than "off" has been set.
- * If STIBP support is not being forced, check if STIBP always-on
- * is preferred.
- */
- if (mode != SPECTRE_V2_USER_STRICT &&
- boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
- mode = SPECTRE_V2_USER_STRICT_PREFERRED;
-
/* Initialize Indirect Branch Prediction Barrier */
if (boot_cpu_has(X86_FEATURE_IBPB)) {
setup_force_cpu_cap(X86_FEATURE_USE_IBPB);
@@ -672,23 +665,36 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
pr_info("mitigation: Enabling %s Indirect Branch Prediction Barrier\n",
static_key_enabled(&switch_mm_always_ibpb) ?
"always-on" : "conditional");
+
+ spectre_v2_user_ibpb = mode;
}
- /* If enhanced IBRS is enabled no STIBP required */
- if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
+ /*
+ * If enhanced IBRS is enabled or SMT impossible, STIBP is not
+ * required.
+ */
+ if (!smt_possible || spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return;
/*
- * If SMT is not possible or STIBP is not available clear the STIBP
- * mode.
+ * At this point, an STIBP mode other than "off" has been set.
+ * If STIBP support is not being forced, check if STIBP always-on
+ * is preferred.
+ */
+ if (mode != SPECTRE_V2_USER_STRICT &&
+ boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
+ mode = SPECTRE_V2_USER_STRICT_PREFERRED;
+
+ /*
+ * If STIBP is not available, clear the STIBP mode.
*/
- if (!smt_possible || !boot_cpu_has(X86_FEATURE_STIBP))
+ if (!boot_cpu_has(X86_FEATURE_STIBP))
mode = SPECTRE_V2_USER_NONE;
+
+ spectre_v2_user_stibp = mode;
+
set_mode:
- spectre_v2_user = mode;
- /* Only print the STIBP mode when SMT possible */
- if (smt_possible)
- pr_info("%s\n", spectre_v2_user_strings[mode]);
+ pr_info("%s\n", spectre_v2_user_strings[mode]);
}
static const char * const spectre_v2_strings[] = {
@@ -921,7 +927,7 @@ void cpu_bugs_smt_update(void)
{
mutex_lock(&spec_ctrl_mutex);
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
break;
case SPECTRE_V2_USER_STRICT:
@@ -1164,14 +1170,16 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
{
switch (ctrl) {
case PR_SPEC_ENABLE:
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return 0;
/*
* Indirect branch speculation is always disabled in strict
* mode.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
@@ -1182,10 +1190,12 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
* Indirect branch speculation is always allowed when
* mitigation is force disabled.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return -EPERM;
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return 0;
task_set_spec_ib_disable(task);
if (ctrl == PR_SPEC_FORCE_DISABLE)
@@ -1216,7 +1226,8 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
{
if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP)
ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE);
- if (spectre_v2_user == SPECTRE_V2_USER_SECCOMP)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP)
ib_prctl_set(task, PR_SPEC_FORCE_DISABLE);
}
#endif
@@ -1247,22 +1258,24 @@ static int ib_prctl_get(struct task_struct *task)
if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
return PR_SPEC_NOT_AFFECTED;
- switch (spectre_v2_user) {
- case SPECTRE_V2_USER_NONE:
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_PRCTL:
- case SPECTRE_V2_USER_SECCOMP:
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ return PR_SPEC_DISABLE;
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP) {
if (task_spec_ib_force_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE;
if (task_spec_ib_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_STRICT:
- case SPECTRE_V2_USER_STRICT_PREFERRED:
- return PR_SPEC_DISABLE;
- default:
+ } else
return PR_SPEC_NOT_AFFECTED;
- }
}
int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
@@ -1501,7 +1514,7 @@ static char *stibp_state(void)
if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return "";
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
return ", STIBP: disabled";
case SPECTRE_V2_USER_STRICT:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0370964dd3ff7d3d406f292cb443a927952cbd05 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 9 Jun 2020 08:50:29 +0100
Subject: [PATCH] KVM: arm64: Synchronize sysreg state on injecting an AArch32
exception
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/kvm/aarch32.c b/arch/arm64/kvm/aarch32.c
index 0a356aa91aa1..40a62a99fbf8 100644
--- a/arch/arm64/kvm/aarch32.c
+++ b/arch/arm64/kvm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (vcpu->arch.sysregs_loaded_on_cpu) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0370964dd3ff7d3d406f292cb443a927952cbd05 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 9 Jun 2020 08:50:29 +0100
Subject: [PATCH] KVM: arm64: Synchronize sysreg state on injecting an AArch32
exception
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/kvm/aarch32.c b/arch/arm64/kvm/aarch32.c
index 0a356aa91aa1..40a62a99fbf8 100644
--- a/arch/arm64/kvm/aarch32.c
+++ b/arch/arm64/kvm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (vcpu->arch.sysregs_loaded_on_cpu) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
The patch below does not apply to the 5.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0370964dd3ff7d3d406f292cb443a927952cbd05 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 9 Jun 2020 08:50:29 +0100
Subject: [PATCH] KVM: arm64: Synchronize sysreg state on injecting an AArch32
exception
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/kvm/aarch32.c b/arch/arm64/kvm/aarch32.c
index 0a356aa91aa1..40a62a99fbf8 100644
--- a/arch/arm64/kvm/aarch32.c
+++ b/arch/arm64/kvm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (vcpu->arch.sysregs_loaded_on_cpu) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0370964dd3ff7d3d406f292cb443a927952cbd05 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 9 Jun 2020 08:50:29 +0100
Subject: [PATCH] KVM: arm64: Synchronize sysreg state on injecting an AArch32
exception
On a VHE system, the EL1 state is left in the CPU most of the time,
and only syncronized back to memory when vcpu_put() is called (most
of the time on preemption).
Which means that when injecting an exception, we'd better have a way
to either:
(1) write directly to the EL1 sysregs
(2) synchronize the state back to memory, and do the changes there
For an AArch64, we already do (1), so we are safe. Unfortunately,
doing the same thing for AArch32 would be pretty invasive. Instead,
we can easily implement (2) by calling the put/load architectural
backends, and keep preemption disabled. We can then reload the
state back into EL1.
Cc: stable(a)vger.kernel.org
Reported-by: James Morse <james.morse(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/kvm/aarch32.c b/arch/arm64/kvm/aarch32.c
index 0a356aa91aa1..40a62a99fbf8 100644
--- a/arch/arm64/kvm/aarch32.c
+++ b/arch/arm64/kvm/aarch32.c
@@ -33,6 +33,26 @@ static const u8 return_offsets[8][2] = {
[7] = { 4, 4 }, /* FIQ, unused */
};
+static bool pre_fault_synchronize(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (vcpu->arch.sysregs_loaded_on_cpu) {
+ kvm_arch_vcpu_put(vcpu);
+ return true;
+ }
+
+ preempt_enable();
+ return false;
+}
+
+static void post_fault_synchronize(struct kvm_vcpu *vcpu, bool loaded)
+{
+ if (loaded) {
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+ preempt_enable();
+ }
+}
+
/*
* When an exception is taken, most CPSR fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]).
@@ -155,7 +175,10 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, u32 vect_offset)
void kvm_inject_undef32(struct kvm_vcpu *vcpu)
{
+ bool loaded = pre_fault_synchronize(vcpu);
+
prepare_fault32(vcpu, PSR_AA32_MODE_UND, 4);
+ post_fault_synchronize(vcpu, loaded);
}
/*
@@ -168,6 +191,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
+ bool loaded;
+
+ loaded = pre_fault_synchronize(vcpu);
if (is_pabt) {
vect_offset = 12;
@@ -191,6 +217,8 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
/* no need to shuffle FS[4] into DFSR[10] as its 0 */
*fsr = DFSR_FSC_EXTABT_nLPAE;
}
+
+ post_fault_synchronize(vcpu, loaded);
}
void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ef3e40a7ea8dbe2abd0a345032cd7d5023b9684f Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Wed, 3 Jun 2020 18:24:01 +0100
Subject: [PATCH] KVM: arm64: Save the host's PtrAuth keys in non-preemptible
context
When using the PtrAuth feature in a guest, we need to save the host's
keys before allowing the guest to program them. For that, we dump
them in a per-CPU data structure (the so called host context).
But both call sites that do this are in preemptible context,
which may end up in disaster should the vcpu thread get preempted
before reentering the guest.
Instead, save the keys eagerly on each vcpu_load(). This has an
increased overhead, but is at least safe.
Cc: stable(a)vger.kernel.org
Reviewed-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index a30b4eec7cb4..977843e4d5fb 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -112,12 +112,6 @@ static inline void vcpu_ptrauth_disable(struct kvm_vcpu *vcpu)
vcpu->arch.hcr_el2 &= ~(HCR_API | HCR_APK);
}
-static inline void vcpu_ptrauth_setup_lazy(struct kvm_vcpu *vcpu)
-{
- if (vcpu_has_ptrauth(vcpu))
- vcpu_ptrauth_disable(vcpu);
-}
-
static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
{
return vcpu->arch.vsesr_el2;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d6988401c22a..152049c5055d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -337,6 +337,12 @@ void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
preempt_enable();
}
+#define __ptrauth_save_key(regs, key) \
+({ \
+ regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
+ regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
+})
+
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
int *last_ran;
@@ -370,7 +376,17 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
else
vcpu_set_wfx_traps(vcpu);
- vcpu_ptrauth_setup_lazy(vcpu);
+ if (vcpu_has_ptrauth(vcpu)) {
+ struct kvm_cpu_context *ctxt = vcpu->arch.host_cpu_context;
+
+ __ptrauth_save_key(ctxt->sys_regs, APIA);
+ __ptrauth_save_key(ctxt->sys_regs, APIB);
+ __ptrauth_save_key(ctxt->sys_regs, APDA);
+ __ptrauth_save_key(ctxt->sys_regs, APDB);
+ __ptrauth_save_key(ctxt->sys_regs, APGA);
+
+ vcpu_ptrauth_disable(vcpu);
+ }
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index eb194696ef62..065251efa2e6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -162,31 +162,16 @@ static int handle_sve(struct kvm_vcpu *vcpu, struct kvm_run *run)
return 1;
}
-#define __ptrauth_save_key(regs, key) \
-({ \
- regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
- regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
-})
-
/*
* Handle the guest trying to use a ptrauth instruction, or trying to access a
* ptrauth register.
*/
void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu)
{
- struct kvm_cpu_context *ctxt;
-
- if (vcpu_has_ptrauth(vcpu)) {
+ if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_enable(vcpu);
- ctxt = vcpu->arch.host_cpu_context;
- __ptrauth_save_key(ctxt->sys_regs, APIA);
- __ptrauth_save_key(ctxt->sys_regs, APIB);
- __ptrauth_save_key(ctxt->sys_regs, APDA);
- __ptrauth_save_key(ctxt->sys_regs, APDB);
- __ptrauth_save_key(ctxt->sys_regs, APGA);
- } else {
+ else
kvm_inject_undefined(vcpu);
- }
}
/*
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ef3e40a7ea8dbe2abd0a345032cd7d5023b9684f Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Wed, 3 Jun 2020 18:24:01 +0100
Subject: [PATCH] KVM: arm64: Save the host's PtrAuth keys in non-preemptible
context
When using the PtrAuth feature in a guest, we need to save the host's
keys before allowing the guest to program them. For that, we dump
them in a per-CPU data structure (the so called host context).
But both call sites that do this are in preemptible context,
which may end up in disaster should the vcpu thread get preempted
before reentering the guest.
Instead, save the keys eagerly on each vcpu_load(). This has an
increased overhead, but is at least safe.
Cc: stable(a)vger.kernel.org
Reviewed-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index a30b4eec7cb4..977843e4d5fb 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -112,12 +112,6 @@ static inline void vcpu_ptrauth_disable(struct kvm_vcpu *vcpu)
vcpu->arch.hcr_el2 &= ~(HCR_API | HCR_APK);
}
-static inline void vcpu_ptrauth_setup_lazy(struct kvm_vcpu *vcpu)
-{
- if (vcpu_has_ptrauth(vcpu))
- vcpu_ptrauth_disable(vcpu);
-}
-
static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
{
return vcpu->arch.vsesr_el2;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d6988401c22a..152049c5055d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -337,6 +337,12 @@ void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
preempt_enable();
}
+#define __ptrauth_save_key(regs, key) \
+({ \
+ regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
+ regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
+})
+
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
int *last_ran;
@@ -370,7 +376,17 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
else
vcpu_set_wfx_traps(vcpu);
- vcpu_ptrauth_setup_lazy(vcpu);
+ if (vcpu_has_ptrauth(vcpu)) {
+ struct kvm_cpu_context *ctxt = vcpu->arch.host_cpu_context;
+
+ __ptrauth_save_key(ctxt->sys_regs, APIA);
+ __ptrauth_save_key(ctxt->sys_regs, APIB);
+ __ptrauth_save_key(ctxt->sys_regs, APDA);
+ __ptrauth_save_key(ctxt->sys_regs, APDB);
+ __ptrauth_save_key(ctxt->sys_regs, APGA);
+
+ vcpu_ptrauth_disable(vcpu);
+ }
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index eb194696ef62..065251efa2e6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -162,31 +162,16 @@ static int handle_sve(struct kvm_vcpu *vcpu, struct kvm_run *run)
return 1;
}
-#define __ptrauth_save_key(regs, key) \
-({ \
- regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
- regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
-})
-
/*
* Handle the guest trying to use a ptrauth instruction, or trying to access a
* ptrauth register.
*/
void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu)
{
- struct kvm_cpu_context *ctxt;
-
- if (vcpu_has_ptrauth(vcpu)) {
+ if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_enable(vcpu);
- ctxt = vcpu->arch.host_cpu_context;
- __ptrauth_save_key(ctxt->sys_regs, APIA);
- __ptrauth_save_key(ctxt->sys_regs, APIB);
- __ptrauth_save_key(ctxt->sys_regs, APDA);
- __ptrauth_save_key(ctxt->sys_regs, APDB);
- __ptrauth_save_key(ctxt->sys_regs, APGA);
- } else {
+ else
kvm_inject_undefined(vcpu);
- }
}
/*
The patch below does not apply to the 5.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ef3e40a7ea8dbe2abd0a345032cd7d5023b9684f Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Wed, 3 Jun 2020 18:24:01 +0100
Subject: [PATCH] KVM: arm64: Save the host's PtrAuth keys in non-preemptible
context
When using the PtrAuth feature in a guest, we need to save the host's
keys before allowing the guest to program them. For that, we dump
them in a per-CPU data structure (the so called host context).
But both call sites that do this are in preemptible context,
which may end up in disaster should the vcpu thread get preempted
before reentering the guest.
Instead, save the keys eagerly on each vcpu_load(). This has an
increased overhead, but is at least safe.
Cc: stable(a)vger.kernel.org
Reviewed-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index a30b4eec7cb4..977843e4d5fb 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -112,12 +112,6 @@ static inline void vcpu_ptrauth_disable(struct kvm_vcpu *vcpu)
vcpu->arch.hcr_el2 &= ~(HCR_API | HCR_APK);
}
-static inline void vcpu_ptrauth_setup_lazy(struct kvm_vcpu *vcpu)
-{
- if (vcpu_has_ptrauth(vcpu))
- vcpu_ptrauth_disable(vcpu);
-}
-
static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
{
return vcpu->arch.vsesr_el2;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d6988401c22a..152049c5055d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -337,6 +337,12 @@ void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
preempt_enable();
}
+#define __ptrauth_save_key(regs, key) \
+({ \
+ regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
+ regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
+})
+
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
int *last_ran;
@@ -370,7 +376,17 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
else
vcpu_set_wfx_traps(vcpu);
- vcpu_ptrauth_setup_lazy(vcpu);
+ if (vcpu_has_ptrauth(vcpu)) {
+ struct kvm_cpu_context *ctxt = vcpu->arch.host_cpu_context;
+
+ __ptrauth_save_key(ctxt->sys_regs, APIA);
+ __ptrauth_save_key(ctxt->sys_regs, APIB);
+ __ptrauth_save_key(ctxt->sys_regs, APDA);
+ __ptrauth_save_key(ctxt->sys_regs, APDB);
+ __ptrauth_save_key(ctxt->sys_regs, APGA);
+
+ vcpu_ptrauth_disable(vcpu);
+ }
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index eb194696ef62..065251efa2e6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -162,31 +162,16 @@ static int handle_sve(struct kvm_vcpu *vcpu, struct kvm_run *run)
return 1;
}
-#define __ptrauth_save_key(regs, key) \
-({ \
- regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
- regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
-})
-
/*
* Handle the guest trying to use a ptrauth instruction, or trying to access a
* ptrauth register.
*/
void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu)
{
- struct kvm_cpu_context *ctxt;
-
- if (vcpu_has_ptrauth(vcpu)) {
+ if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_enable(vcpu);
- ctxt = vcpu->arch.host_cpu_context;
- __ptrauth_save_key(ctxt->sys_regs, APIA);
- __ptrauth_save_key(ctxt->sys_regs, APIB);
- __ptrauth_save_key(ctxt->sys_regs, APDA);
- __ptrauth_save_key(ctxt->sys_regs, APDB);
- __ptrauth_save_key(ctxt->sys_regs, APGA);
- } else {
+ else
kvm_inject_undefined(vcpu);
- }
}
/*
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ef3e40a7ea8dbe2abd0a345032cd7d5023b9684f Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Wed, 3 Jun 2020 18:24:01 +0100
Subject: [PATCH] KVM: arm64: Save the host's PtrAuth keys in non-preemptible
context
When using the PtrAuth feature in a guest, we need to save the host's
keys before allowing the guest to program them. For that, we dump
them in a per-CPU data structure (the so called host context).
But both call sites that do this are in preemptible context,
which may end up in disaster should the vcpu thread get preempted
before reentering the guest.
Instead, save the keys eagerly on each vcpu_load(). This has an
increased overhead, but is at least safe.
Cc: stable(a)vger.kernel.org
Reviewed-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index a30b4eec7cb4..977843e4d5fb 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -112,12 +112,6 @@ static inline void vcpu_ptrauth_disable(struct kvm_vcpu *vcpu)
vcpu->arch.hcr_el2 &= ~(HCR_API | HCR_APK);
}
-static inline void vcpu_ptrauth_setup_lazy(struct kvm_vcpu *vcpu)
-{
- if (vcpu_has_ptrauth(vcpu))
- vcpu_ptrauth_disable(vcpu);
-}
-
static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
{
return vcpu->arch.vsesr_el2;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d6988401c22a..152049c5055d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -337,6 +337,12 @@ void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
preempt_enable();
}
+#define __ptrauth_save_key(regs, key) \
+({ \
+ regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
+ regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
+})
+
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
int *last_ran;
@@ -370,7 +376,17 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
else
vcpu_set_wfx_traps(vcpu);
- vcpu_ptrauth_setup_lazy(vcpu);
+ if (vcpu_has_ptrauth(vcpu)) {
+ struct kvm_cpu_context *ctxt = vcpu->arch.host_cpu_context;
+
+ __ptrauth_save_key(ctxt->sys_regs, APIA);
+ __ptrauth_save_key(ctxt->sys_regs, APIB);
+ __ptrauth_save_key(ctxt->sys_regs, APDA);
+ __ptrauth_save_key(ctxt->sys_regs, APDB);
+ __ptrauth_save_key(ctxt->sys_regs, APGA);
+
+ vcpu_ptrauth_disable(vcpu);
+ }
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index eb194696ef62..065251efa2e6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -162,31 +162,16 @@ static int handle_sve(struct kvm_vcpu *vcpu, struct kvm_run *run)
return 1;
}
-#define __ptrauth_save_key(regs, key) \
-({ \
- regs[key ## KEYLO_EL1] = read_sysreg_s(SYS_ ## key ## KEYLO_EL1); \
- regs[key ## KEYHI_EL1] = read_sysreg_s(SYS_ ## key ## KEYHI_EL1); \
-})
-
/*
* Handle the guest trying to use a ptrauth instruction, or trying to access a
* ptrauth register.
*/
void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu)
{
- struct kvm_cpu_context *ctxt;
-
- if (vcpu_has_ptrauth(vcpu)) {
+ if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_enable(vcpu);
- ctxt = vcpu->arch.host_cpu_context;
- __ptrauth_save_key(ctxt->sys_regs, APIA);
- __ptrauth_save_key(ctxt->sys_regs, APIB);
- __ptrauth_save_key(ctxt->sys_regs, APDA);
- __ptrauth_save_key(ctxt->sys_regs, APDB);
- __ptrauth_save_key(ctxt->sys_regs, APGA);
- } else {
+ else
kvm_inject_undefined(vcpu);
- }
}
/*
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a3535be731c2a343912578465021f50937f7b099 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 16 May 2020 09:19:06 -0400
Subject: [PATCH] KVM: nSVM: fix condition for filtering async PF
Async page faults have to be trapped in the host (L1 in this case),
since the APF reason was passed from L0 to L1 and stored in the L1 APF
data page. This was completely reversed: the page faults were passed
to the guest, a L2 hypervisor.
Cc: stable(a)vger.kernel.org
Reviewed-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a89a166d1cb8..f4cd2d0cc360 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -880,8 +880,8 @@ int nested_svm_exit_special(struct vcpu_svm *svm)
return NESTED_EXIT_HOST;
break;
case SVM_EXIT_EXCP_BASE + PF_VECTOR:
- /* When we're shadowing, trap PFs, but not async PF */
- if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0)
+ /* Trap async PF even if not shadowing */
+ if (!npt_enabled || svm->vcpu.arch.apf.host_apf_reason)
return NESTED_EXIT_HOST;
break;
default:
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a3535be731c2a343912578465021f50937f7b099 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 16 May 2020 09:19:06 -0400
Subject: [PATCH] KVM: nSVM: fix condition for filtering async PF
Async page faults have to be trapped in the host (L1 in this case),
since the APF reason was passed from L0 to L1 and stored in the L1 APF
data page. This was completely reversed: the page faults were passed
to the guest, a L2 hypervisor.
Cc: stable(a)vger.kernel.org
Reviewed-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a89a166d1cb8..f4cd2d0cc360 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -880,8 +880,8 @@ int nested_svm_exit_special(struct vcpu_svm *svm)
return NESTED_EXIT_HOST;
break;
case SVM_EXIT_EXCP_BASE + PF_VECTOR:
- /* When we're shadowing, trap PFs, but not async PF */
- if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0)
+ /* Trap async PF even if not shadowing */
+ if (!npt_enabled || svm->vcpu.arch.apf.host_apf_reason)
return NESTED_EXIT_HOST;
break;
default:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5c911beff20aa8639e7a1f28988736c13e03ed54 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Fri, 1 May 2020 09:31:17 -0700
Subject: [PATCH] KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS
switch when running with spectre_v2_user=on/auto if the switch is
between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02.
The IBPB is intended to prevent one guest from attacking another, which
is unnecessary in the nested case as it's the same guest from KVM's
perspective.
This all but eliminates the overhead observed for nested VMX transitions
when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which
can be significant, e.g. roughly 3x on current systems.
Reported-by: Alexander Graf <graf(a)amazon.com>
Cc: KarimAllah Raslan <karahmed(a)amazon.de>
Cc: stable(a)vger.kernel.org
Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Message-Id: <20200501163117.4655-1-sean.j.christopherson(a)intel.com>
[Invert direction of bool argument. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 669445136144..7f754b3bc6dd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -303,7 +303,7 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
cpu = get_cpu();
prev = vmx->loaded_vmcs;
vmx->loaded_vmcs = vmcs;
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, prev);
vmx_sync_vmcs_host_state(vmx, prev);
put_cpu();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fbede089d0f7..db842ce74e5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1311,10 +1311,12 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
pi_set_on(pi_desc);
}
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
bool already_loaded = vmx->loaded_vmcs->cpu == cpu;
+ struct vmcs *prev;
if (!already_loaded) {
loaded_vmcs_clear(vmx->loaded_vmcs);
@@ -1333,10 +1335,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}
- if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
+ prev = per_cpu(current_vmcs, cpu);
+ if (prev != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
- indirect_branch_prediction_barrier();
+
+ /*
+ * No indirect branch prediction barrier needed when switching
+ * the active VMCS within a guest, e.g. on nested VM-Enter.
+ * The L1 VMM can protect itself with retpolines, IBPB or IBRS.
+ */
+ if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev))
+ indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@@ -1377,7 +1387,7 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
vmx_vcpu_pi_load(vcpu, cpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index b5e773267abe..d3d48acc6bd9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -320,7 +320,8 @@ struct kvm_vmx {
};
bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy);
void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
int allocate_vpid(void);
void free_vpid(int vpid);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5c911beff20aa8639e7a1f28988736c13e03ed54 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Fri, 1 May 2020 09:31:17 -0700
Subject: [PATCH] KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS
switch when running with spectre_v2_user=on/auto if the switch is
between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02.
The IBPB is intended to prevent one guest from attacking another, which
is unnecessary in the nested case as it's the same guest from KVM's
perspective.
This all but eliminates the overhead observed for nested VMX transitions
when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which
can be significant, e.g. roughly 3x on current systems.
Reported-by: Alexander Graf <graf(a)amazon.com>
Cc: KarimAllah Raslan <karahmed(a)amazon.de>
Cc: stable(a)vger.kernel.org
Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Message-Id: <20200501163117.4655-1-sean.j.christopherson(a)intel.com>
[Invert direction of bool argument. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 669445136144..7f754b3bc6dd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -303,7 +303,7 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
cpu = get_cpu();
prev = vmx->loaded_vmcs;
vmx->loaded_vmcs = vmcs;
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, prev);
vmx_sync_vmcs_host_state(vmx, prev);
put_cpu();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fbede089d0f7..db842ce74e5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1311,10 +1311,12 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
pi_set_on(pi_desc);
}
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
bool already_loaded = vmx->loaded_vmcs->cpu == cpu;
+ struct vmcs *prev;
if (!already_loaded) {
loaded_vmcs_clear(vmx->loaded_vmcs);
@@ -1333,10 +1335,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}
- if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
+ prev = per_cpu(current_vmcs, cpu);
+ if (prev != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
- indirect_branch_prediction_barrier();
+
+ /*
+ * No indirect branch prediction barrier needed when switching
+ * the active VMCS within a guest, e.g. on nested VM-Enter.
+ * The L1 VMM can protect itself with retpolines, IBPB or IBRS.
+ */
+ if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev))
+ indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@@ -1377,7 +1387,7 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
vmx_vcpu_pi_load(vcpu, cpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index b5e773267abe..d3d48acc6bd9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -320,7 +320,8 @@ struct kvm_vmx {
};
bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy);
void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
int allocate_vpid(void);
void free_vpid(int vpid);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5c911beff20aa8639e7a1f28988736c13e03ed54 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Fri, 1 May 2020 09:31:17 -0700
Subject: [PATCH] KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS
switch when running with spectre_v2_user=on/auto if the switch is
between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02.
The IBPB is intended to prevent one guest from attacking another, which
is unnecessary in the nested case as it's the same guest from KVM's
perspective.
This all but eliminates the overhead observed for nested VMX transitions
when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which
can be significant, e.g. roughly 3x on current systems.
Reported-by: Alexander Graf <graf(a)amazon.com>
Cc: KarimAllah Raslan <karahmed(a)amazon.de>
Cc: stable(a)vger.kernel.org
Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Message-Id: <20200501163117.4655-1-sean.j.christopherson(a)intel.com>
[Invert direction of bool argument. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 669445136144..7f754b3bc6dd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -303,7 +303,7 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
cpu = get_cpu();
prev = vmx->loaded_vmcs;
vmx->loaded_vmcs = vmcs;
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, prev);
vmx_sync_vmcs_host_state(vmx, prev);
put_cpu();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fbede089d0f7..db842ce74e5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1311,10 +1311,12 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
pi_set_on(pi_desc);
}
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
bool already_loaded = vmx->loaded_vmcs->cpu == cpu;
+ struct vmcs *prev;
if (!already_loaded) {
loaded_vmcs_clear(vmx->loaded_vmcs);
@@ -1333,10 +1335,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}
- if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
+ prev = per_cpu(current_vmcs, cpu);
+ if (prev != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
- indirect_branch_prediction_barrier();
+
+ /*
+ * No indirect branch prediction barrier needed when switching
+ * the active VMCS within a guest, e.g. on nested VM-Enter.
+ * The L1 VMM can protect itself with retpolines, IBPB or IBRS.
+ */
+ if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev))
+ indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@@ -1377,7 +1387,7 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
vmx_vcpu_pi_load(vcpu, cpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index b5e773267abe..d3d48acc6bd9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -320,7 +320,8 @@ struct kvm_vmx {
};
bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy);
void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
int allocate_vpid(void);
void free_vpid(int vpid);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5c911beff20aa8639e7a1f28988736c13e03ed54 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Fri, 1 May 2020 09:31:17 -0700
Subject: [PATCH] KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS
switch when running with spectre_v2_user=on/auto if the switch is
between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02.
The IBPB is intended to prevent one guest from attacking another, which
is unnecessary in the nested case as it's the same guest from KVM's
perspective.
This all but eliminates the overhead observed for nested VMX transitions
when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which
can be significant, e.g. roughly 3x on current systems.
Reported-by: Alexander Graf <graf(a)amazon.com>
Cc: KarimAllah Raslan <karahmed(a)amazon.de>
Cc: stable(a)vger.kernel.org
Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Message-Id: <20200501163117.4655-1-sean.j.christopherson(a)intel.com>
[Invert direction of bool argument. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 669445136144..7f754b3bc6dd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -303,7 +303,7 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
cpu = get_cpu();
prev = vmx->loaded_vmcs;
vmx->loaded_vmcs = vmcs;
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, prev);
vmx_sync_vmcs_host_state(vmx, prev);
put_cpu();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fbede089d0f7..db842ce74e5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1311,10 +1311,12 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
pi_set_on(pi_desc);
}
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
bool already_loaded = vmx->loaded_vmcs->cpu == cpu;
+ struct vmcs *prev;
if (!already_loaded) {
loaded_vmcs_clear(vmx->loaded_vmcs);
@@ -1333,10 +1335,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}
- if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
+ prev = per_cpu(current_vmcs, cpu);
+ if (prev != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
- indirect_branch_prediction_barrier();
+
+ /*
+ * No indirect branch prediction barrier needed when switching
+ * the active VMCS within a guest, e.g. on nested VM-Enter.
+ * The L1 VMM can protect itself with retpolines, IBPB or IBRS.
+ */
+ if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev))
+ indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@@ -1377,7 +1387,7 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- vmx_vcpu_load_vmcs(vcpu, cpu);
+ vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
vmx_vcpu_pi_load(vcpu, cpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index b5e773267abe..d3d48acc6bd9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -320,7 +320,8 @@ struct kvm_vmx {
};
bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
-void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
+ struct loaded_vmcs *buddy);
void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
int allocate_vpid(void);
void free_vpid(int vpid);
Hi,
This is the 2nd version of bootconfig bugfixes.
The previous version is here.
https://lkml.kernel.org/r/159197538852.80267.10091816844311950396.stgit@dev…
This version fixes the patch description and modify(cleanup) code
according to Steve's comment.
Thank you,
---
Masami Hiramatsu (4):
proc/bootconfig: Fix to use correct quotes for value
tools/bootconfig: Fix to use correct quotes for value
tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
tools/bootconfig: Add testcase for show-command and quotes test
fs/proc/bootconfig.c | 15 ++++++++++-----
tools/bootconfig/main.c | 24 ++++++++++++++----------
tools/bootconfig/test-bootconfig.sh | 10 ++++++++++
3 files changed, 34 insertions(+), 15 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
When "ovl_is_inuse" true case, trap inode reference not put.
plus adding the comment explaining sequence of
ovl_is_inuse after ovl_setup_trap.
Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers..")
Cc: <stable(a)vger.kernel.org> # v4.19+
Reviewed-by: Amir Goldstein <amir73il(a)gmail.com>
Signed-off-by: youngjun <her0gyugyu(a)gmail.com>
---
fs/overlayfs/super.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 91476bc422f9..3097142b1e23 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1493,14 +1493,22 @@ static int ovl_get_layers(struct super_block *sb, struct ovl_fs *ofs,
if (err < 0)
goto out;
+ /*
+ * Check if lower root conflicts with this overlay layers before checking
+ * if it is in-use as upperdir/workdir of "another" mount, because we do
+ * not bother to check in ovl_is_inuse() if the upperdir/workdir is in fact
+ * in-use by our upperdir/workdir.
+ */
err = ovl_setup_trap(sb, stack[i].dentry, &trap, "lowerdir");
if (err)
goto out;
if (ovl_is_inuse(stack[i].dentry)) {
err = ovl_report_in_use(ofs, "lowerdir");
- if (err)
+ if (err) {
+ iput(trap);
goto out;
+ }
}
mnt = clone_private_mount(&stack[i]);
--
2.17.1
Sorry. I Wrongly sent subect version. I changed it to v4.
Thank you Amir.
This is the start of the stable review cycle for the 4.19.128 release.
There are 25 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 11 Jun 2020 17:40:24 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.128-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.128-rc1
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "net/mlx5: Annotate mutex destroy for root ns"
Oleg Nesterov <oleg(a)redhat.com>
uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned
Josh Poimboeuf <jpoimboe(a)redhat.com>
x86/speculation: Add Ivy Bridge to affected list
Mark Gross <mgross(a)linux.intel.com>
x86/speculation: Add SRBDS vulnerability and mitigation documentation
Mark Gross <mgross(a)linux.intel.com>
x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation
Mark Gross <mgross(a)linux.intel.com>
x86/cpu: Add 'table' argument to cpu_matches()
Mark Gross <mgross(a)linux.intel.com>
x86/cpu: Add a steppings field to struct x86_cpu_id
Srinivas Kandagatla <srinivas.kandagatla(a)linaro.org>
nvmem: qfprom: remove incorrect write support
Oliver Neukum <oneukum(a)suse.com>
CDC-ACM: heed quirk also in error handling
Pascal Terjan <pterjan(a)google.com>
staging: rtl8712: Fix IEEE80211_ADDBA_PARAM_BUF_SIZE_MASK
Jiri Slaby <jslaby(a)suse.cz>
tty: hvc_console, fix crashes on parallel open/close
Dmitry Torokhov <dmitry.torokhov(a)gmail.com>
vt: keyboard: avoid signed integer overflow in k_ascii
Dinghao Liu <dinghao.liu(a)zju.edu.cn>
usb: musb: Fix runtime PM imbalance on error
Bin Liu <b-liu(a)ti.com>
usb: musb: start session in resume for host port
Mathieu Othacehe <m.othacehe(a)gmail.com>
iio: vcnl4000: Fix i2c swapped word reading.
Daniele Palmas <dnlplm(a)gmail.com>
USB: serial: option: add Telit LE910C1-EUX compositions
Bin Liu <b-liu(a)ti.com>
USB: serial: usb_wwan: do not resubmit rx urb on fatal errors
Matt Jolly <Kangie(a)footclan.ninja>
USB: serial: qcserial: add DW5816e QDL support
Willem de Bruijn <willemb(a)google.com>
net: check untrusted gso_size at kernel entry
Stefano Garzarella <sgarzare(a)redhat.com>
vsock: fix timeout in vsock_accept()
Chuhong Yuan <hslester96(a)gmail.com>
NFC: st21nfca: add missed kfree_skb() in an error path
Daniele Palmas <dnlplm(a)gmail.com>
net: usb: qmi_wwan: add Telit LE910C1-EUX composition
Eric Dumazet <edumazet(a)google.com>
l2tp: do not use inet_hash()/inet_unhash()
Eric Dumazet <edumazet(a)google.com>
l2tp: add sk_family checks to l2tp_validate_socket
Yang Yingliang <yangyingliang(a)huawei.com>
devinet: fix memleak in inetdev_init()
-------------
Diffstat:
Documentation/ABI/testing/sysfs-devices-system-cpu | 1 +
Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../special-register-buffer-data-sampling.rst | 149 +++++++++++++++++++++
Documentation/admin-guide/kernel-parameters.txt | 20 +++
Makefile | 4 +-
arch/x86/include/asm/cpu_device_id.h | 27 ++++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/kernel/cpu/bugs.c | 106 +++++++++++++++
arch/x86/kernel/cpu/common.c | 54 ++++++--
arch/x86/kernel/cpu/cpu.h | 1 +
arch/x86/kernel/cpu/match.c | 7 +-
drivers/base/cpu.c | 8 ++
drivers/iio/light/vcnl4000.c | 6 +-
drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 6 -
drivers/net/usb/qmi_wwan.c | 1 +
drivers/nfc/st21nfca/dep.c | 4 +-
drivers/nvmem/qfprom.c | 14 --
drivers/staging/rtl8712/wifi.h | 9 +-
drivers/tty/hvc/hvc_console.c | 23 ++--
drivers/tty/vt/keyboard.c | 26 ++--
drivers/usb/class/cdc-acm.c | 2 +-
drivers/usb/musb/musb_core.c | 7 +
drivers/usb/musb/musb_debugfs.c | 10 +-
drivers/usb/serial/option.c | 4 +
drivers/usb/serial/qcserial.c | 1 +
drivers/usb/serial/usb_wwan.c | 4 +
include/linux/mod_devicetable.h | 6 +
include/linux/virtio_net.h | 14 +-
kernel/events/uprobes.c | 16 ++-
net/ipv4/devinet.c | 1 +
net/l2tp/l2tp_core.c | 3 +
net/l2tp/l2tp_ip.c | 29 +++-
net/l2tp/l2tp_ip6.c | 30 +++--
net/vmw_vsock/af_vsock.c | 2 +-
35 files changed, 503 insertions(+), 99 deletions(-)
Hi Steve,
I found 2 bugs in /proc/bootconfig and tools/bootconfig.
- They always use double-quote to quote values. For the values
which includes double-quote, it should use single-quote instead.
- tools/bootconfig always returns error code if it shows the
bootconfig in initrd (executed without options)
This series fixes those bugs and add testcases to ensure
no regressions.
Thank you,
---
Masami Hiramatsu (4):
proc/bootconfig: Fix to use correct quotes for value
tools/bootconfig: Fix to use correct quotes for value
tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
tools/bootconfig: Add testcase for show-command and quotes test
fs/proc/bootconfig.c | 13 +++++++++----
tools/bootconfig/main.c | 18 +++++++++++-------
tools/bootconfig/test-bootconfig.sh | 10 ++++++++++
3 files changed, 30 insertions(+), 11 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
When "ovl_is_inuse" true case, trap inode reference not put.
plus adding the comment explaining sequence of
ovl_is_inuse after ovl_setup_trap.
Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers..")
Cc: <stable(a)vger.kernel.org> # v4.19+
Reviewed-by: Amir Goldstein <amir73il(a)gmail.com>
> Signed-off-by: youngjun <her0gyugyu(a)gmail.com>
> ---
> fs/overlayfs/super.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 91476bc422f9..0396793dadb8 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -1029,6 +1029,12 @@ static const struct xattr_handler *ovl_xattr_handlers[] = {
> NULL
> };
>
> +/*
> + * Check if lower root conflicts with this overlay layers before checking
> + * if it is in-use as upperdir/workdir of "another" mount, because we do
> + * not bother to check in ovl_is_inuse() if the upperdir/workdir is in fact
> + * in-use by our upperdir/workdir.
> + */
Signed-off-by: youngjun <her0gyugyu(a)gmail.com>
---
fs/overlayfs/super.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 91476bc422f9..3097142b1e23 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1493,14 +1493,22 @@ static int ovl_get_layers(struct super_block *sb, struct ovl_fs *ofs,
if (err < 0)
goto out;
+ /*
+ * Check if lower root conflicts with this overlay layers before checking
+ * if it is in-use as upperdir/workdir of "another" mount, because we do
+ * not bother to check in ovl_is_inuse() if the upperdir/workdir is in fact
+ * in-use by our upperdir/workdir.
+ */
err = ovl_setup_trap(sb, stack[i].dentry, &trap, "lowerdir");
if (err)
goto out;
if (ovl_is_inuse(stack[i].dentry)) {
err = ovl_report_in_use(ofs, "lowerdir");
- if (err)
+ if (err) {
+ iput(trap);
goto out;
+ }
}
mnt = clone_private_mount(&stack[i]);
--
2.17.1
Great thanks Amir. guidance to me really helpful.
As you comment, I modified my patch.
1) I revised three patch so version is 3.
2) I misunderstood your comment(adding annotation) firstly.
But, I clearly know at now and modified it.
the annotation before "ovl_setup_trap" is clearly understood as I see.
And I have some questions.
1) As I understand, The CC you said valid is 'linux-unionfs'?
2) The comment "Please add these lines to the bottom of commit message..."
Is it needed manually when I revised patch?
Currently the system will be woken up via WOL(Wake On Lan) even if the
device wakeup ability has been disabled via sysfs:
cat /sys/devices/pci0000:00/0000:00:1f.6/power/wakeup
disabled
The system should not be woken up if the user has explicitly
disabled the wake up ability for this device.
This patch clears the WOL ability of this network device if the
user has disabled the wake up ability in sysfs.
Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver")
Reported-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Chen Yu <yu.c.chen(a)intel.com>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 177c6da80c57..f6f730388705 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6516,11 +6516,17 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool runtime)
struct net_device *netdev = pci_get_drvdata(pdev);
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
- u32 ctrl, ctrl_ext, rctl, status;
- /* Runtime suspend should only enable wakeup for link changes */
- u32 wufc = runtime ? E1000_WUFC_LNKC : adapter->wol;
+ u32 ctrl, ctrl_ext, rctl, status, wufc;
int retval = 0;
+ /* Runtime suspend should only enable wakeup for link changes */
+ if (runtime)
+ wufc = E1000_WUFC_LNKC;
+ else if (device_may_wakeup(&pdev->dev))
+ wufc = adapter->wol;
+ else
+ wufc = 0;
+
status = er32(STATUS);
if (status & E1000_STATUS_LU)
wufc &= ~E1000_WUFC_LNKC;
@@ -6577,7 +6583,7 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool runtime)
if (adapter->hw.phy.type == e1000_phy_igp_3) {
e1000e_igp3_phy_powerdown_workaround_ich8lan(&adapter->hw);
} else if (hw->mac.type >= e1000_pch_lpt) {
- if (!(wufc & (E1000_WUFC_EX | E1000_WUFC_MC | E1000_WUFC_BC)))
+ if (wufc && !(wufc & (E1000_WUFC_EX | E1000_WUFC_MC | E1000_WUFC_BC)))
/* ULP does not support wake from unicast, multicast
* or broadcast.
*/
--
2.17.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9dd277ff92d06f6aa95b39936ad83981d781f49b Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:02 +0200
Subject: [PATCH] spi: bcm2835: Fix controller unregister order
The BCM2835 SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
bcm2835_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: bcm2835_spi_remove() tears down the DMA
channels and turns off the SPI controller, including its interrupts
and clock. The SPI controller is thus no longer usable.
When the SPI controller is subsequently unregistered, it unbinds all
its slave devices. If their drivers need to access the SPI bus,
e.g. to quiesce their interrupts, unbinding will fail.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed
after unbinding of slaves.
Fix by using the non-devm variant spi_register_controller(). Note that
the struct spi_controller as well as the driver-private data are not
freed until after bcm2835_spi_remove() has finished, so accessing them
is safe.
Fixes: 247263dba208 ("spi: bcm2835: use devm_spi_register_master()")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.13+
Link: https://lore.kernel.org/r/2397dd70cdbe95e0bc4da2b9fca0f31cb94e5aed.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index 11c235879bb7..fd887a6492f4 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -1347,7 +1347,7 @@ static int bcm2835_spi_probe(struct platform_device *pdev)
goto out_dma_release;
}
- err = devm_spi_register_controller(&pdev->dev, ctlr);
+ err = spi_register_controller(ctlr);
if (err) {
dev_err(&pdev->dev, "could not register SPI controller: %d\n",
err);
@@ -1374,6 +1374,8 @@ static int bcm2835_spi_remove(struct platform_device *pdev)
bcm2835_debugfs_remove(bs);
+ spi_unregister_controller(ctlr);
+
/* Clear FIFOs, and disable the HW block */
bcm2835_wr(bs, BCM2835_SPI_CS,
BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
direction flag set. While this isn't a big deal as nothing currently
enforces these bits in the kernel, it should be defined correctly. Fix
the define and provide support for the old command until it is no longer
needed for backward compatibility.
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Cc: stable(a)vger.kernel.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
include/uapi/linux/seccomp.h | 2 +-
kernel/seccomp.c | 9 +++++++++
tools/testing/selftests/seccomp/seccomp_bpf.c | 2 +-
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 473a61695ac3..6ba18b82a02e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -142,7 +142,7 @@ struct seccomp_notif_addfd {
#define SECCOMP_IOCTL_NOTIF_RECV SECCOMP_IOWR(0, struct seccomp_notif)
#define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
-#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
+#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOW(2, __u64)
/* On success, the return value is the remote process's added fd number */
#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOW(3, \
struct seccomp_notif_addfd)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9660abf91135..61e556bca338 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,14 @@
#include <linux/anon_inodes.h>
#include <linux/lockdep.h>
+/*
+ * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
+ * wrong direction flag in the ioctl number. This is the broken one,
+ * which the kernel needs to keep supporting until all userspaces stop
+ * using the wrong command number.
+ */
+#define SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR SECCOMP_IOR(2, __u64)
+
enum notify_state {
SECCOMP_NOTIFY_INIT,
SECCOMP_NOTIFY_SENT,
@@ -1397,6 +1405,7 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
return seccomp_notify_recv(filter, buf);
case SECCOMP_IOCTL_NOTIF_SEND:
return seccomp_notify_send(filter, buf);
+ case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
case SECCOMP_IOCTL_NOTIF_ID_VALID:
return seccomp_notify_id_valid(filter, buf);
}
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index cf1480e498ea..403c6d0f149e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -184,7 +184,7 @@ struct seccomp_metadata {
#define SECCOMP_IOCTL_NOTIF_RECV SECCOMP_IOWR(0, struct seccomp_notif)
#define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
-#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
+#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOW(2, __u64)
struct seccomp_notif {
__u64 id;
--
2.25.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d90ca42012db2863a9a30b564a2ace6016594bda Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:05:01 +0800
Subject: [PATCH] crypto: virtio: Fix dest length calculation in
__virtio_crypto_skcipher_do_req()
The src/dst length is not aligned with AES_BLOCK_SIZE(which is 16) in some
testcases in tcrypto.ko.
For example, the src/dst length of one of cts(cbc(aes))'s testcase is 17, the
crypto_virtio driver will set @src_data_len=16 but @dst_data_len=17 in this
case and get a wrong at then end.
SRC: pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp (17 bytes)
EXP: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc pp (17 bytes)
DST: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00 (pollute the last bytes)
(pp: plaintext cc:ciphertext)
Fix this issue by limit the length of dest buffer.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Cc: Gonglei <arei.gonglei(a)huawei.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-4-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 52261b6c247e..cb8a6ea2a4bc 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -407,6 +407,7 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
goto free;
}
+ dst_len = min_t(unsigned int, req->cryptlen, dst_len);
pr_debug("virtio_crypto: src_len: %u, dst_len: %llu\n",
req->cryptlen, dst_len);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b02989f37fc5e865ceeee9070907e4493b3a21e2 Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:04:59 +0800
Subject: [PATCH] crypto: virtio: Fix src/dst scatterlist calculation in
__virtio_crypto_skcipher_do_req()
The system will crash when the users insmod crypto/tcrypt.ko with mode=38
( testing "cts(cbc(aes))" ).
Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Fix it by sg_next() on calculation of src/dst
scatterlist.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin <clabbe(a)baylibre.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20200123101000.GB24255@Red
Signed-off-by: Gonglei <arei.gonglei(a)huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-2-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index fd045e64972a..5f8243563009 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -350,13 +350,18 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
int err;
unsigned long flags;
struct scatterlist outhdr, iv_sg, status_sg, **sgs;
- int i;
u64 dst_len;
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+ struct scatterlist *sg;
src_nents = sg_nents_for_len(req->src, req->cryptlen);
+ if (src_nents < 0) {
+ pr_err("Invalid number of src SG.\n");
+ return src_nents;
+ }
+
dst_nents = sg_nents(req->dst);
pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: %d)\n",
@@ -442,12 +447,12 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
vc_sym_req->iv = iv;
/* Source data */
- for (i = 0; i < src_nents; i++)
- sgs[num_out++] = &req->src[i];
+ for (sg = req->src; src_nents; sg = sg_next(sg), src_nents--)
+ sgs[num_out++] = sg;
/* Destination data */
- for (i = 0; i < dst_nents; i++)
- sgs[num_out + num_in++] = &req->dst[i];
+ for (sg = req->dst; sg; sg = sg_next(sg))
+ sgs[num_out + num_in++] = sg;
/* Status */
sg_init_one(&status_sg, &vc_req->status, sizeof(vc_req->status));
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8c855f0720ff006d75d0a2512c7f6c4f60ff60ee Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:05:00 +0800
Subject: [PATCH] crypto: virtio: Fix use-after-free in
virtio_crypto_skcipher_finalize_req()
The system'll crash when the users insmod crypto/tcrypto.ko with mode=155
( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory
of request structure.
In crypto_authenc_init_tfm(), the reqsize is set to:
[PART 1] sizeof(authenc_request_ctx) +
[PART 2] ictx->reqoff +
[PART 3] MAX(ahash part, skcipher part)
and the 'PART 3' is used by both ahash and skcipher in turn.
When the virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources whose pointers are recorded in 'skcipher parts'.
However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other's memory. So the system will crash
when these memory will be used again.
The resources which need to be cleaned up are not used any more. But the
pointers of these resources may be changed in the function
"crypto_finalize_skcipher_request". Thus release specific resources before
calling this function.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin <clabbe(a)baylibre.com>
Cc: Gonglei <arei.gonglei(a)huawei.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20200123101000.GB24255@Red
Acked-by: Gonglei <arei.gonglei(a)huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-3-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 5f8243563009..52261b6c247e 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -582,10 +582,11 @@ static void virtio_crypto_skcipher_finalize_req(
scatterwalk_map_and_copy(req->iv, req->dst,
req->cryptlen - AES_BLOCK_SIZE,
AES_BLOCK_SIZE, 0);
- crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
- req, err);
kzfree(vc_sym_req->iv);
virtcrypto_clear_request(&vc_sym_req->base);
+
+ crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
+ req, err);
}
static struct virtio_crypto_algo virtio_crypto_algs[] = { {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 65e318e17358a3fd4fcb5a69d89b14016dee2f06 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:03 +0200
Subject: [PATCH] spi: pxa2xx: Fix runtime PM ref imbalance on probe error
The PXA2xx SPI driver releases a runtime PM ref in the probe error path
even though it hasn't acquired a ref earlier.
Apparently commit e2b714afee32 ("spi: pxa2xx: Disable runtime PM if
controller registration fails") sought to copy-paste the invocation of
pm_runtime_disable() from pxa2xx_spi_remove(), but erroneously copied
the call to pm_runtime_put_noidle() as well. Drop it.
Fixes: e2b714afee32 ("spi: pxa2xx: Disable runtime PM if controller registration fails")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Jarkko Nikula <jarkko.nikula(a)linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v4.17+
Cc: Jarkko Nikula <jarkko.nikula(a)linux.intel.com>
Link: https://lore.kernel.org/r/58b2ac6942ca1f91aaeeafe512144bc5343e1d84.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index f456ce18f79e..f6e87344a36c 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1893,7 +1893,6 @@ static int pxa2xx_spi_probe(struct platform_device *pdev)
return status;
out_error_pm_runtime_enabled:
- pm_runtime_put_noidle(&pdev->dev);
pm_runtime_disable(&pdev->dev);
out_error_clock_enabled:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 32e5b57232c0411e7dea96625c415510430ac079 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:02 +0200
Subject: [PATCH] spi: pxa2xx: Fix controller unregister order
The PXA2xx SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
pxa2xx_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: pxa2xx_spi_remove() disables the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in pxa2xx_spi_remove(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
The improper use of devm_spi_register_controller() was introduced in 2013
by commit a807fcd090d6 ("spi: pxa2xx: use devm_spi_register_master()"),
but all earlier versions of the driver going back to 2006 were likewise
broken because they invoked spi_unregister_master() at the end of
pxa2xx_spi_remove(), rather than at the beginning.
Fixes: e0c9905e87ac ("[PATCH] SPI: add PXA2xx SSP SPI Driver")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v2.6.17+
Cc: Tsuchiya Yuto <kitakar(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206403#c1
Link: https://lore.kernel.org/r/834c446b1cf3284d2660f1bee1ebe3e737cd02a9.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index 73d2a65d0b6e..f456ce18f79e 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1884,7 +1884,7 @@ static int pxa2xx_spi_probe(struct platform_device *pdev)
/* Register with the SPI framework */
platform_set_drvdata(pdev, drv_data);
- status = devm_spi_register_controller(&pdev->dev, controller);
+ status = spi_register_controller(controller);
if (status != 0) {
dev_err(&pdev->dev, "problem registering spi controller\n");
goto out_error_pm_runtime_enabled;
@@ -1916,6 +1916,8 @@ static int pxa2xx_spi_remove(struct platform_device *pdev)
pm_runtime_get_sync(&pdev->dev);
+ spi_unregister_controller(drv_data->controller);
+
/* Disable the SSP at the peripheral and SOC level */
pxa2xx_spi_write(drv_data, SSCR0, 0);
clk_disable_unprepare(ssp->clk);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 84855678add8aba927faf76bc2f130a40f94b6f7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:01 +0200
Subject: [PATCH] spi: Fix controller unregister order
When an SPI controller unregisters, it unbinds all its slave devices.
For this, their drivers may need to access the SPI bus, e.g. to quiesce
interrupts.
However since commit ffbbdd21329f ("spi: create a message queueing
infrastructure"), spi_destroy_queue() is executed before unbinding the
slaves. It sets ctlr->running = false, thereby preventing SPI bus
access and causing unbinding of slave devices to fail.
Fix by unbinding slaves before calling spi_destroy_queue().
Fixes: ffbbdd21329f ("spi: create a message queueing infrastructure")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.4+
Cc: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/8aaf9d44c153fe233b17bc2dec4eb679898d7e7b.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 531d1de85f7f..25eb96d329ab 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -2763,6 +2763,8 @@ void spi_unregister_controller(struct spi_controller *ctlr)
struct spi_controller *found;
int id = ctlr->bus_num;
+ device_for_each_child(&ctlr->dev, NULL, __unregister);
+
/* First make sure that this controller was ever added */
mutex_lock(&board_lock);
found = idr_find(&spi_master_idr, id);
@@ -2775,7 +2777,6 @@ void spi_unregister_controller(struct spi_controller *ctlr)
list_del(&ctlr->list);
mutex_unlock(&board_lock);
- device_for_each_child(&ctlr->dev, NULL, __unregister);
device_unregister(&ctlr->dev);
/* free bus id */
mutex_lock(&board_lock);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ca8b19d61e3fce5d2d7790cde27a0b57bcb3f341 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:01 +0200
Subject: [PATCH] spi: dw: Fix controller unregister order
The Designware SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
dw_spi_remove_host() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: dw_spi_remove_host() shuts down the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in dw_spi_remove_host(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
Fixes: 04f421e7b0b1 ("spi: dw: use managed resources")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v3.14+
Cc: Baruch Siach <baruch(a)tkos.co.il>
Link: https://lore.kernel.org/r/3fff8cb8ae44a9893840d0688be15bb88c090a14.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-dw.c b/drivers/spi/spi-dw.c
index 31e3f866d11a..780ffad64a91 100644
--- a/drivers/spi/spi-dw.c
+++ b/drivers/spi/spi-dw.c
@@ -526,7 +526,7 @@ int dw_spi_add_host(struct device *dev, struct dw_spi *dws)
}
}
- ret = devm_spi_register_controller(dev, master);
+ ret = spi_register_controller(master);
if (ret) {
dev_err(&master->dev, "problem registering spi master\n");
goto err_dma_exit;
@@ -550,6 +550,8 @@ void dw_spi_remove_host(struct dw_spi *dws)
{
dw_spi_debugfs_remove(dws);
+ spi_unregister_controller(dws->master);
+
if (dws->dma_ops && dws->dma_ops->dma_exit)
dws->dma_ops->dma_exit(dws);
The patch below does not apply to the 5.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f5f27b79eab80de0287c243a22169e4876b08d5e Mon Sep 17 00:00:00 2001
From: Franck LENORMAND <franck.lenormand(a)nxp.com>
Date: Thu, 26 Mar 2020 00:00:05 +0200
Subject: [PATCH] firmware: imx: scu: Fix corruption of header
The header of the message to send can be changed if the
response is longer than the request:
- 1st word, the header is sent
- the remaining words of the message are sent
- the response is received asynchronously during the
execution of the loop, changing the size field in
the header
- the for loop test the termination condition using
the corrupted header
It is the case for the API build_info which has just a
header as request but 3 words in response.
This issue is fixed storing the header locally instead of
using a pointer on it.
Fixes: edbee095fafb (firmware: imx: add SCU firmware driver support)
Signed-off-by: Franck LENORMAND <franck.lenormand(a)nxp.com>
Reviewed-by: Leonard Crestez <leonard.crestez(a)nxp.com>
Signed-off-by: Leonard Crestez <leonard.crestez(a)nxp.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Dong Aisheng <aisheng.dong(a)nxp.com>
Signed-off-by: Shawn Guo <shawnguo(a)kernel.org>
diff --git a/drivers/firmware/imx/imx-scu.c b/drivers/firmware/imx/imx-scu.c
index e94a5585b698..b3da2e193ad2 100644
--- a/drivers/firmware/imx/imx-scu.c
+++ b/drivers/firmware/imx/imx-scu.c
@@ -158,7 +158,7 @@ static void imx_scu_rx_callback(struct mbox_client *c, void *msg)
static int imx_scu_ipc_write(struct imx_sc_ipc *sc_ipc, void *msg)
{
- struct imx_sc_rpc_msg *hdr = msg;
+ struct imx_sc_rpc_msg hdr = *(struct imx_sc_rpc_msg *)msg;
struct imx_sc_chan *sc_chan;
u32 *data = msg;
int ret;
@@ -166,13 +166,13 @@ static int imx_scu_ipc_write(struct imx_sc_ipc *sc_ipc, void *msg)
int i;
/* Check size */
- if (hdr->size > IMX_SC_RPC_MAX_MSG)
+ if (hdr.size > IMX_SC_RPC_MAX_MSG)
return -EINVAL;
- dev_dbg(sc_ipc->dev, "RPC SVC %u FUNC %u SIZE %u\n", hdr->svc,
- hdr->func, hdr->size);
+ dev_dbg(sc_ipc->dev, "RPC SVC %u FUNC %u SIZE %u\n", hdr.svc,
+ hdr.func, hdr.size);
- size = sc_ipc->fast_ipc ? 1 : hdr->size;
+ size = sc_ipc->fast_ipc ? 1 : hdr.size;
for (i = 0; i < size; i++) {
sc_chan = &sc_ipc->chans[i % 4];
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d90ca42012db2863a9a30b564a2ace6016594bda Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:05:01 +0800
Subject: [PATCH] crypto: virtio: Fix dest length calculation in
__virtio_crypto_skcipher_do_req()
The src/dst length is not aligned with AES_BLOCK_SIZE(which is 16) in some
testcases in tcrypto.ko.
For example, the src/dst length of one of cts(cbc(aes))'s testcase is 17, the
crypto_virtio driver will set @src_data_len=16 but @dst_data_len=17 in this
case and get a wrong at then end.
SRC: pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp (17 bytes)
EXP: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc pp (17 bytes)
DST: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00 (pollute the last bytes)
(pp: plaintext cc:ciphertext)
Fix this issue by limit the length of dest buffer.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Cc: Gonglei <arei.gonglei(a)huawei.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-4-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 52261b6c247e..cb8a6ea2a4bc 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -407,6 +407,7 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
goto free;
}
+ dst_len = min_t(unsigned int, req->cryptlen, dst_len);
pr_debug("virtio_crypto: src_len: %u, dst_len: %llu\n",
req->cryptlen, dst_len);
From: "Luck, Tony" <tony.luck(a)intel.com>
commit 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 upstream
An interesting thing happened when a guest Linux instance took
a machine check. The VMM unmapped the bad page from guest physical
space and passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
Reported-by: Jue Wang <juew(a)google.com>
Tested-by: Jue Wang <juew(a)google.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Link: https://lore.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel…
---
arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
arch/x86/kernel/cpu/mce/core.c | 11 +++++++++--
include/linux/set_memory.h | 2 +-
3 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 2ee8e469dcf5..162128cdfbf2 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -85,28 +85,35 @@ void set_kernel_text_rw(void);
void set_kernel_text_ro(void);
#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;
/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index aecb15ba66cd..fd76e3733dd3 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -533,6 +533,13 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -601,7 +608,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0))
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));
}
return NOTIFY_OK;
@@ -1103,7 +1110,7 @@ static int do_memory_failure(struct mce *m)
if (ret)
pr_err("Memory error not recovered");
else
- set_mce_nospec(m->addr >> PAGE_SHIFT);
+ set_mce_nospec(m->addr >> PAGE_SHIFT, whole_page(m));
return ret;
}
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif
#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
--
2.21.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d90ca42012db2863a9a30b564a2ace6016594bda Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:05:01 +0800
Subject: [PATCH] crypto: virtio: Fix dest length calculation in
__virtio_crypto_skcipher_do_req()
The src/dst length is not aligned with AES_BLOCK_SIZE(which is 16) in some
testcases in tcrypto.ko.
For example, the src/dst length of one of cts(cbc(aes))'s testcase is 17, the
crypto_virtio driver will set @src_data_len=16 but @dst_data_len=17 in this
case and get a wrong at then end.
SRC: pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp (17 bytes)
EXP: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc pp (17 bytes)
DST: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00 (pollute the last bytes)
(pp: plaintext cc:ciphertext)
Fix this issue by limit the length of dest buffer.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Cc: Gonglei <arei.gonglei(a)huawei.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-4-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 52261b6c247e..cb8a6ea2a4bc 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -407,6 +407,7 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
goto free;
}
+ dst_len = min_t(unsigned int, req->cryptlen, dst_len);
pr_debug("virtio_crypto: src_len: %u, dst_len: %llu\n",
req->cryptlen, dst_len);
The patch below does not apply to the 5.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck(a)intel.com>
Date: Wed, 20 May 2020 09:35:46 -0700
Subject: [PATCH] x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned
An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
[ bp: Adjust to x86/entry changes. ]
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew(a)google.com>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel…
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
extern int kernel_set_to_readonly;
#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;
/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 30413325de22..ce9120c4f740 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -520,6 +520,14 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -573,7 +581,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
@@ -1173,11 +1181,12 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
- if (!(p->mce_status & MCG_STATUS_RIPV))
+
+ if (!p->mce_ripv)
flags |= MF_MUST_KILL;
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
return;
}
@@ -1331,7 +1340,8 @@ void noinstr do_machine_check(struct pt_regs *regs)
BUG_ON(!on_thread_stack() || !user_mode(regs));
current->mce_addr = m.addr;
- current->mce_status = m.mcgstatus;
+ current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(&m);
current->mce_kill_me.func = kill_me_maybe;
if (kill_it)
current->mce_kill_me.func = kill_me_now;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d96e3e7fff..62c1de522fc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,7 +1304,9 @@ struct task_struct {
#ifdef CONFIG_X86_MCE
u64 mce_addr;
- u64 mce_status;
+ __u64 mce_ripv : 1,
+ mce_whole_page : 1,
+ __mce_reserved : 62;
struct callback_head mce_kill_me;
#endif
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif
#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
From: Michał Mirosław <mirq-linux(a)rere.qmqm.pl>
commit 951e2736f4b11b58dc44d41964fa17c3527d882a upstream.
Prevent SNDRV_PCM_IOCTL_LINK linking stream to itself - the code
can't handle it. Fixed commit is not where bug was introduced, but
changes the context significantly.
Cc: stable(a)vger.kernel.org
Fixes: 0888c321de70 ("pcm_native: switch to fdget()/fdput()")
Signed-off-by: Michał Mirosław <mirq-linux(a)rere.qmqm.pl>
Link: https://lore.kernel.org/r/89c4a2487609a0ed6af3ecf01cc972bdc59a7a2d.15916349…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
---
Backported to 4.19.y. It can be used for older branches, too.
sound/core/pcm_native.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index 14b1ee29509d..071e09c3d855 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -1950,6 +1950,11 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
}
pcm_file = f.file->private_data;
substream1 = pcm_file->substream;
+ if (substream == substream1) {
+ res = -EINVAL;
+ goto _badf;
+ }
+
group = kmalloc(sizeof(*group), GFP_KERNEL);
if (!group) {
res = -ENOMEM;
--
2.16.4
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4518a3cc273cf82efdd36522fb1f13baad173c70 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Tue, 26 May 2020 20:34:02 +0300
Subject: [PATCH] io_uring: fix flush req->refs underflow
In io_uring_cancel_files(), after refcount_sub_and_test() leaves 0
req->refs, it calls io_put_req(), which would also put a ref. Call
io_free_req() instead.
Cc: stable(a)vger.kernel.org
Fixes: 2ca10259b418 ("io_uring: prune request from overflow list on flush")
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0b51f21e5432..37422fcdaa7f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7534,7 +7534,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
* all we had, then we're done with this request.
*/
if (refcount_sub_and_test(2, &cancel_req->refs)) {
- io_put_req(cancel_req);
+ io_free_req(cancel_req);
finish_wait(&ctx->inflight_wait, &wait);
continue;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b02989f37fc5e865ceeee9070907e4493b3a21e2 Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:04:59 +0800
Subject: [PATCH] crypto: virtio: Fix src/dst scatterlist calculation in
__virtio_crypto_skcipher_do_req()
The system will crash when the users insmod crypto/tcrypt.ko with mode=38
( testing "cts(cbc(aes))" ).
Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Fix it by sg_next() on calculation of src/dst
scatterlist.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin <clabbe(a)baylibre.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20200123101000.GB24255@Red
Signed-off-by: Gonglei <arei.gonglei(a)huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-2-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index fd045e64972a..5f8243563009 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -350,13 +350,18 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
int err;
unsigned long flags;
struct scatterlist outhdr, iv_sg, status_sg, **sgs;
- int i;
u64 dst_len;
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+ struct scatterlist *sg;
src_nents = sg_nents_for_len(req->src, req->cryptlen);
+ if (src_nents < 0) {
+ pr_err("Invalid number of src SG.\n");
+ return src_nents;
+ }
+
dst_nents = sg_nents(req->dst);
pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: %d)\n",
@@ -442,12 +447,12 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
vc_sym_req->iv = iv;
/* Source data */
- for (i = 0; i < src_nents; i++)
- sgs[num_out++] = &req->src[i];
+ for (sg = req->src; src_nents; sg = sg_next(sg), src_nents--)
+ sgs[num_out++] = sg;
/* Destination data */
- for (i = 0; i < dst_nents; i++)
- sgs[num_out + num_in++] = &req->dst[i];
+ for (sg = req->dst; sg; sg = sg_next(sg))
+ sgs[num_out + num_in++] = sg;
/* Status */
sg_init_one(&status_sg, &vc_req->status, sizeof(vc_req->status));
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b02989f37fc5e865ceeee9070907e4493b3a21e2 Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:04:59 +0800
Subject: [PATCH] crypto: virtio: Fix src/dst scatterlist calculation in
__virtio_crypto_skcipher_do_req()
The system will crash when the users insmod crypto/tcrypt.ko with mode=38
( testing "cts(cbc(aes))" ).
Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Fix it by sg_next() on calculation of src/dst
scatterlist.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin <clabbe(a)baylibre.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20200123101000.GB24255@Red
Signed-off-by: Gonglei <arei.gonglei(a)huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-2-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index fd045e64972a..5f8243563009 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -350,13 +350,18 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
int err;
unsigned long flags;
struct scatterlist outhdr, iv_sg, status_sg, **sgs;
- int i;
u64 dst_len;
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+ struct scatterlist *sg;
src_nents = sg_nents_for_len(req->src, req->cryptlen);
+ if (src_nents < 0) {
+ pr_err("Invalid number of src SG.\n");
+ return src_nents;
+ }
+
dst_nents = sg_nents(req->dst);
pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: %d)\n",
@@ -442,12 +447,12 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
vc_sym_req->iv = iv;
/* Source data */
- for (i = 0; i < src_nents; i++)
- sgs[num_out++] = &req->src[i];
+ for (sg = req->src; src_nents; sg = sg_next(sg), src_nents--)
+ sgs[num_out++] = sg;
/* Destination data */
- for (i = 0; i < dst_nents; i++)
- sgs[num_out + num_in++] = &req->dst[i];
+ for (sg = req->dst; sg; sg = sg_next(sg))
+ sgs[num_out + num_in++] = sg;
/* Status */
sg_init_one(&status_sg, &vc_req->status, sizeof(vc_req->status));
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8c855f0720ff006d75d0a2512c7f6c4f60ff60ee Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:05:00 +0800
Subject: [PATCH] crypto: virtio: Fix use-after-free in
virtio_crypto_skcipher_finalize_req()
The system'll crash when the users insmod crypto/tcrypto.ko with mode=155
( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory
of request structure.
In crypto_authenc_init_tfm(), the reqsize is set to:
[PART 1] sizeof(authenc_request_ctx) +
[PART 2] ictx->reqoff +
[PART 3] MAX(ahash part, skcipher part)
and the 'PART 3' is used by both ahash and skcipher in turn.
When the virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources whose pointers are recorded in 'skcipher parts'.
However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other's memory. So the system will crash
when these memory will be used again.
The resources which need to be cleaned up are not used any more. But the
pointers of these resources may be changed in the function
"crypto_finalize_skcipher_request". Thus release specific resources before
calling this function.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin <clabbe(a)baylibre.com>
Cc: Gonglei <arei.gonglei(a)huawei.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20200123101000.GB24255@Red
Acked-by: Gonglei <arei.gonglei(a)huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-3-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 5f8243563009..52261b6c247e 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -582,10 +582,11 @@ static void virtio_crypto_skcipher_finalize_req(
scatterwalk_map_and_copy(req->iv, req->dst,
req->cryptlen - AES_BLOCK_SIZE,
AES_BLOCK_SIZE, 0);
- crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
- req, err);
kzfree(vc_sym_req->iv);
virtcrypto_clear_request(&vc_sym_req->base);
+
+ crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
+ req, err);
}
static struct virtio_crypto_algo virtio_crypto_algs[] = { {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8c855f0720ff006d75d0a2512c7f6c4f60ff60ee Mon Sep 17 00:00:00 2001
From: "Longpeng(Mike)" <longpeng2(a)huawei.com>
Date: Tue, 2 Jun 2020 15:05:00 +0800
Subject: [PATCH] crypto: virtio: Fix use-after-free in
virtio_crypto_skcipher_finalize_req()
The system'll crash when the users insmod crypto/tcrypto.ko with mode=155
( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory
of request structure.
In crypto_authenc_init_tfm(), the reqsize is set to:
[PART 1] sizeof(authenc_request_ctx) +
[PART 2] ictx->reqoff +
[PART 3] MAX(ahash part, skcipher part)
and the 'PART 3' is used by both ahash and skcipher in turn.
When the virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources whose pointers are recorded in 'skcipher parts'.
However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other's memory. So the system will crash
when these memory will be used again.
The resources which need to be cleaned up are not used any more. But the
pointers of these resources may be changed in the function
"crypto_finalize_skcipher_request". Thus release specific resources before
calling this function.
Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin <clabbe(a)baylibre.com>
Cc: Gonglei <arei.gonglei(a)huawei.com>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: "Michael S. Tsirkin" <mst(a)redhat.com>
Cc: Jason Wang <jasowang(a)redhat.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: virtualization(a)lists.linux-foundation.org
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20200123101000.GB24255@Red
Acked-by: Gonglei <arei.gonglei(a)huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2(a)huawei.com>
Link: https://lore.kernel.org/r/20200602070501.2023-3-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 5f8243563009..52261b6c247e 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -582,10 +582,11 @@ static void virtio_crypto_skcipher_finalize_req(
scatterwalk_map_and_copy(req->iv, req->dst,
req->cryptlen - AES_BLOCK_SIZE,
AES_BLOCK_SIZE, 0);
- crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
- req, err);
kzfree(vc_sym_req->iv);
virtcrypto_clear_request(&vc_sym_req->base);
+
+ crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
+ req, err);
}
static struct virtio_crypto_algo virtio_crypto_algs[] = { {
Commit 17839856fd58 ("gup: document and work around "COW can break
either way" issue") is a real fix, but wasn't marked for stable
because I wanted it to get more coverage testing in mainline first.
Not because the patch is all that complex or scary, but because I was
worried we'd find some odd case where it would make things slower by
triggering the GUP slowpath much more often due to people doing odd
things.
It turns out my worry seems to have been misplaced. The kernel test
robot did indeed trigger a case where this made a big difference, but
rather than being bad, it improved the odd corner-case test-case
performance by a factor of 20x by breaking the COW and triggering the
fast-case code that way, rather than the other way around.
See
https://lore.kernel.org/lkml/20200611040453.GK12456@shao2-debian/
for details.
So that commit fixes a bug, isn't expected to really make any
difference on any sane workload, and can apparently help the crazy
cases by a huge amount. Let's just push it to stable..
Linus
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d8df8cbb9156b0a0ab3f802b80cb5db57acc0bf Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Sun, 7 Jun 2020 05:44:19 -0700
Subject: [PATCH] x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for
indirect branches.
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes: 9137bb27e60e ("x86/speculation: Add prctl() control for indirect branch speculation")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 8d57562b1d2c..56f573aa764f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1175,11 +1175,14 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
return 0;
/*
* Indirect branch speculation is always disabled in strict
- * mode.
+ * mode. It can neither be enabled if it was force-disabled
+ * by a previous prctl call.
+
*/
if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED ||
+ task_spec_ib_force_disable(task))
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
From: "Luck, Tony" <tony.luck(a)intel.com>
commit 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 upstream
An interesting thing happened when a guest Linux instance took
a machine check. The VMM unmapped the bad page from guest physical
space and passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
Reported-by: Jue Wang <juew(a)google.com>
Tested-by: Jue Wang <juew(a)google.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Link: https://lore.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel…
---
arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
arch/x86/kernel/cpu/mce/core.c | 11 +++++++++--
include/linux/set_memory.h | 2 +-
3 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
extern int kernel_set_to_readonly;
#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;
/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 54165f3569e8..c1a480a27164 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -529,6 +529,13 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -600,7 +607,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0))
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));
return NOTIFY_OK;
}
@@ -1098,7 +1105,7 @@ static int do_memory_failure(struct mce *m)
if (ret)
pr_err("Memory error not recovered");
else
- set_mce_nospec(m->addr >> PAGE_SHIFT);
+ set_mce_nospec(m->addr >> PAGE_SHIFT, whole_page(m));
return ret;
}
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif
#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
--
2.21.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d43e2675e96fc6ae1a633b6a69d296394448cc32 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Tue, 19 May 2020 05:34:41 -0400
Subject: [PATCH] KVM: x86: only do L1TF workaround on affected processors
KVM stores the gfn in MMIO SPTEs as a caching optimization. These are split
in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
in an L1TF attack. This works as long as there are 5 free bits between
MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
access triggers a reserved-bit-set page fault.
The bit positions however were computed wrongly for AMD processors that have
encryption support. In this case, x86_phys_bits is reduced (for example
from 48 to 43, to account for the C bit at position 47 and four bits used
internally to store the SEV ASID and other stuff) while x86_cache_bits in
would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
and bit 51 are set. Then low_phys_bits would also cover some of the
bits that are set in the shadow_mmio_value, terribly confusing the gfn
caching mechanism.
To fix this, avoid splitting gfns as long as the processor does not have
the L1TF bug (which includes all AMD processors). When there is no
splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
the overlap. This fixes "npt=0" operation on EPYC processors.
Thanks to Maxim Levitsky for bisecting this bug.
Cc: stable(a)vger.kernel.org
Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8071952e9cf2..86619631ff6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -335,6 +335,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
{
BUG_ON((u64)(unsigned)access_mask != access_mask);
BUG_ON((mmio_mask & mmio_value) != mmio_value);
+ WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
+ WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
shadow_mmio_access_mask = access_mask;
@@ -583,16 +585,15 @@ static void kvm_mmu_reset_all_pte_masks(void)
* the most significant bits of legal physical address space.
*/
shadow_nonpresent_or_rsvd_mask = 0;
- low_phys_bits = boot_cpu_data.x86_cache_bits;
- if (boot_cpu_data.x86_cache_bits <
- 52 - shadow_nonpresent_or_rsvd_mask_len) {
+ low_phys_bits = boot_cpu_data.x86_phys_bits;
+ if (boot_cpu_has_bug(X86_BUG_L1TF) &&
+ !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
+ 52 - shadow_nonpresent_or_rsvd_mask_len)) {
+ low_phys_bits = boot_cpu_data.x86_cache_bits
+ - shadow_nonpresent_or_rsvd_mask_len;
shadow_nonpresent_or_rsvd_mask =
- rsvd_bits(boot_cpu_data.x86_cache_bits -
- shadow_nonpresent_or_rsvd_mask_len,
- boot_cpu_data.x86_cache_bits - 1);
- low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
- } else
- WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
+ rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
+ }
shadow_nonpresent_or_rsvd_lower_gfn_mask =
GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 951e2736f4b11b58dc44d41964fa17c3527d882a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Micha=C5=82=20Miros=C5=82aw?= <mirq-linux(a)rere.qmqm.pl>
Date: Mon, 8 Jun 2020 18:50:39 +0200
Subject: [PATCH] ALSA: pcm: disallow linking stream to itself
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Prevent SNDRV_PCM_IOCTL_LINK linking stream to itself - the code
can't handle it. Fixed commit is not where bug was introduced, but
changes the context significantly.
Cc: stable(a)vger.kernel.org
Fixes: 0888c321de70 ("pcm_native: switch to fdget()/fdput()")
Signed-off-by: Michał Mirosław <mirq-linux(a)rere.qmqm.pl>
Link: https://lore.kernel.org/r/89c4a2487609a0ed6af3ecf01cc972bdc59a7a2d.15916349…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index c08732998a42..eeab8850ed76 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -2176,6 +2176,12 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
}
pcm_file = f.file->private_data;
substream1 = pcm_file->substream;
+
+ if (substream == substream1) {
+ res = -EINVAL;
+ goto _badf;
+ }
+
group = kzalloc(sizeof(*group), GFP_KERNEL);
if (!group) {
res = -ENOMEM;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 96cb7cf13d8530099c256c053648ad576588c387 Mon Sep 17 00:00:00 2001
From: hersen wu <hersenxs.wu(a)amd.com>
Date: Thu, 28 Feb 2019 16:35:24 -0500
Subject: [PATCH] drm/amd/display: disable dcn20 abm feature for bring up
[WHY] dcn20 enable usb-c dp ALT mode in dmcu. There is bug
when enable abm feature which cause system crash. dal team
will debug this bug later.
[HOW] disable dcn abm feature for dcn20.
Signed-off-by: hersen wu <hersenxs.wu(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 5971aef4f033..72d14f680932 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -781,7 +781,7 @@ static int dm_late_init(void *handle)
unsigned int linear_lut[16];
int i;
struct dmcu *dmcu = adev->dm.dc->res_pool->dmcu;
- bool ret;
+ bool ret = false;
for (i = 0; i < 16; i++)
linear_lut[i] = 0xFFFF * i / 15;
@@ -792,10 +792,13 @@ static int dm_late_init(void *handle)
params.backlight_lut_array_size = 16;
params.backlight_lut_array = linear_lut;
- ret = dmcu_load_iram(dmcu, params);
+ /* todo will enable for navi10 */
+ if (adev->asic_type <= CHIP_RAVEN) {
+ ret = dmcu_load_iram(dmcu, params);
- if (!ret)
- return -EINVAL;
+ if (!ret)
+ return -EINVAL;
+ }
return detect_mst_link_for_all_connectors(adev->ddev);
}
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 96cb7cf13d8530099c256c053648ad576588c387 Mon Sep 17 00:00:00 2001
From: hersen wu <hersenxs.wu(a)amd.com>
Date: Thu, 28 Feb 2019 16:35:24 -0500
Subject: [PATCH] drm/amd/display: disable dcn20 abm feature for bring up
[WHY] dcn20 enable usb-c dp ALT mode in dmcu. There is bug
when enable abm feature which cause system crash. dal team
will debug this bug later.
[HOW] disable dcn abm feature for dcn20.
Signed-off-by: hersen wu <hersenxs.wu(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 5971aef4f033..72d14f680932 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -781,7 +781,7 @@ static int dm_late_init(void *handle)
unsigned int linear_lut[16];
int i;
struct dmcu *dmcu = adev->dm.dc->res_pool->dmcu;
- bool ret;
+ bool ret = false;
for (i = 0; i < 16; i++)
linear_lut[i] = 0xFFFF * i / 15;
@@ -792,10 +792,13 @@ static int dm_late_init(void *handle)
params.backlight_lut_array_size = 16;
params.backlight_lut_array = linear_lut;
- ret = dmcu_load_iram(dmcu, params);
+ /* todo will enable for navi10 */
+ if (adev->asic_type <= CHIP_RAVEN) {
+ ret = dmcu_load_iram(dmcu, params);
- if (!ret)
- return -EINVAL;
+ if (!ret)
+ return -EINVAL;
+ }
return detect_mst_link_for_all_connectors(adev->ddev);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9dd277ff92d06f6aa95b39936ad83981d781f49b Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:02 +0200
Subject: [PATCH] spi: bcm2835: Fix controller unregister order
The BCM2835 SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
bcm2835_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: bcm2835_spi_remove() tears down the DMA
channels and turns off the SPI controller, including its interrupts
and clock. The SPI controller is thus no longer usable.
When the SPI controller is subsequently unregistered, it unbinds all
its slave devices. If their drivers need to access the SPI bus,
e.g. to quiesce their interrupts, unbinding will fail.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed
after unbinding of slaves.
Fix by using the non-devm variant spi_register_controller(). Note that
the struct spi_controller as well as the driver-private data are not
freed until after bcm2835_spi_remove() has finished, so accessing them
is safe.
Fixes: 247263dba208 ("spi: bcm2835: use devm_spi_register_master()")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.13+
Link: https://lore.kernel.org/r/2397dd70cdbe95e0bc4da2b9fca0f31cb94e5aed.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index 11c235879bb7..fd887a6492f4 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -1347,7 +1347,7 @@ static int bcm2835_spi_probe(struct platform_device *pdev)
goto out_dma_release;
}
- err = devm_spi_register_controller(&pdev->dev, ctlr);
+ err = spi_register_controller(ctlr);
if (err) {
dev_err(&pdev->dev, "could not register SPI controller: %d\n",
err);
@@ -1374,6 +1374,8 @@ static int bcm2835_spi_remove(struct platform_device *pdev)
bcm2835_debugfs_remove(bs);
+ spi_unregister_controller(ctlr);
+
/* Clear FIFOs, and disable the HW block */
bcm2835_wr(bs, BCM2835_SPI_CS,
BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9dd277ff92d06f6aa95b39936ad83981d781f49b Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:02 +0200
Subject: [PATCH] spi: bcm2835: Fix controller unregister order
The BCM2835 SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
bcm2835_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: bcm2835_spi_remove() tears down the DMA
channels and turns off the SPI controller, including its interrupts
and clock. The SPI controller is thus no longer usable.
When the SPI controller is subsequently unregistered, it unbinds all
its slave devices. If their drivers need to access the SPI bus,
e.g. to quiesce their interrupts, unbinding will fail.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed
after unbinding of slaves.
Fix by using the non-devm variant spi_register_controller(). Note that
the struct spi_controller as well as the driver-private data are not
freed until after bcm2835_spi_remove() has finished, so accessing them
is safe.
Fixes: 247263dba208 ("spi: bcm2835: use devm_spi_register_master()")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.13+
Link: https://lore.kernel.org/r/2397dd70cdbe95e0bc4da2b9fca0f31cb94e5aed.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index 11c235879bb7..fd887a6492f4 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -1347,7 +1347,7 @@ static int bcm2835_spi_probe(struct platform_device *pdev)
goto out_dma_release;
}
- err = devm_spi_register_controller(&pdev->dev, ctlr);
+ err = spi_register_controller(ctlr);
if (err) {
dev_err(&pdev->dev, "could not register SPI controller: %d\n",
err);
@@ -1374,6 +1374,8 @@ static int bcm2835_spi_remove(struct platform_device *pdev)
bcm2835_debugfs_remove(bs);
+ spi_unregister_controller(ctlr);
+
/* Clear FIFOs, and disable the HW block */
bcm2835_wr(bs, BCM2835_SPI_CS,
BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9dd277ff92d06f6aa95b39936ad83981d781f49b Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:02 +0200
Subject: [PATCH] spi: bcm2835: Fix controller unregister order
The BCM2835 SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
bcm2835_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: bcm2835_spi_remove() tears down the DMA
channels and turns off the SPI controller, including its interrupts
and clock. The SPI controller is thus no longer usable.
When the SPI controller is subsequently unregistered, it unbinds all
its slave devices. If their drivers need to access the SPI bus,
e.g. to quiesce their interrupts, unbinding will fail.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed
after unbinding of slaves.
Fix by using the non-devm variant spi_register_controller(). Note that
the struct spi_controller as well as the driver-private data are not
freed until after bcm2835_spi_remove() has finished, so accessing them
is safe.
Fixes: 247263dba208 ("spi: bcm2835: use devm_spi_register_master()")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.13+
Link: https://lore.kernel.org/r/2397dd70cdbe95e0bc4da2b9fca0f31cb94e5aed.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index 11c235879bb7..fd887a6492f4 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -1347,7 +1347,7 @@ static int bcm2835_spi_probe(struct platform_device *pdev)
goto out_dma_release;
}
- err = devm_spi_register_controller(&pdev->dev, ctlr);
+ err = spi_register_controller(ctlr);
if (err) {
dev_err(&pdev->dev, "could not register SPI controller: %d\n",
err);
@@ -1374,6 +1374,8 @@ static int bcm2835_spi_remove(struct platform_device *pdev)
bcm2835_debugfs_remove(bs);
+ spi_unregister_controller(ctlr);
+
/* Clear FIFOs, and disable the HW block */
bcm2835_wr(bs, BCM2835_SPI_CS,
BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 32e5b57232c0411e7dea96625c415510430ac079 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:02 +0200
Subject: [PATCH] spi: pxa2xx: Fix controller unregister order
The PXA2xx SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
pxa2xx_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: pxa2xx_spi_remove() disables the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in pxa2xx_spi_remove(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
The improper use of devm_spi_register_controller() was introduced in 2013
by commit a807fcd090d6 ("spi: pxa2xx: use devm_spi_register_master()"),
but all earlier versions of the driver going back to 2006 were likewise
broken because they invoked spi_unregister_master() at the end of
pxa2xx_spi_remove(), rather than at the beginning.
Fixes: e0c9905e87ac ("[PATCH] SPI: add PXA2xx SSP SPI Driver")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v2.6.17+
Cc: Tsuchiya Yuto <kitakar(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206403#c1
Link: https://lore.kernel.org/r/834c446b1cf3284d2660f1bee1ebe3e737cd02a9.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index 73d2a65d0b6e..f456ce18f79e 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1884,7 +1884,7 @@ static int pxa2xx_spi_probe(struct platform_device *pdev)
/* Register with the SPI framework */
platform_set_drvdata(pdev, drv_data);
- status = devm_spi_register_controller(&pdev->dev, controller);
+ status = spi_register_controller(controller);
if (status != 0) {
dev_err(&pdev->dev, "problem registering spi controller\n");
goto out_error_pm_runtime_enabled;
@@ -1916,6 +1916,8 @@ static int pxa2xx_spi_remove(struct platform_device *pdev)
pm_runtime_get_sync(&pdev->dev);
+ spi_unregister_controller(drv_data->controller);
+
/* Disable the SSP at the peripheral and SOC level */
pxa2xx_spi_write(drv_data, SSCR0, 0);
clk_disable_unprepare(ssp->clk);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 32e5b57232c0411e7dea96625c415510430ac079 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:02 +0200
Subject: [PATCH] spi: pxa2xx: Fix controller unregister order
The PXA2xx SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
pxa2xx_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: pxa2xx_spi_remove() disables the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in pxa2xx_spi_remove(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
The improper use of devm_spi_register_controller() was introduced in 2013
by commit a807fcd090d6 ("spi: pxa2xx: use devm_spi_register_master()"),
but all earlier versions of the driver going back to 2006 were likewise
broken because they invoked spi_unregister_master() at the end of
pxa2xx_spi_remove(), rather than at the beginning.
Fixes: e0c9905e87ac ("[PATCH] SPI: add PXA2xx SSP SPI Driver")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v2.6.17+
Cc: Tsuchiya Yuto <kitakar(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206403#c1
Link: https://lore.kernel.org/r/834c446b1cf3284d2660f1bee1ebe3e737cd02a9.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index 73d2a65d0b6e..f456ce18f79e 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1884,7 +1884,7 @@ static int pxa2xx_spi_probe(struct platform_device *pdev)
/* Register with the SPI framework */
platform_set_drvdata(pdev, drv_data);
- status = devm_spi_register_controller(&pdev->dev, controller);
+ status = spi_register_controller(controller);
if (status != 0) {
dev_err(&pdev->dev, "problem registering spi controller\n");
goto out_error_pm_runtime_enabled;
@@ -1916,6 +1916,8 @@ static int pxa2xx_spi_remove(struct platform_device *pdev)
pm_runtime_get_sync(&pdev->dev);
+ spi_unregister_controller(drv_data->controller);
+
/* Disable the SSP at the peripheral and SOC level */
pxa2xx_spi_write(drv_data, SSCR0, 0);
clk_disable_unprepare(ssp->clk);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 32e5b57232c0411e7dea96625c415510430ac079 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:02 +0200
Subject: [PATCH] spi: pxa2xx: Fix controller unregister order
The PXA2xx SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
pxa2xx_spi_remove() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: pxa2xx_spi_remove() disables the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in pxa2xx_spi_remove(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
The improper use of devm_spi_register_controller() was introduced in 2013
by commit a807fcd090d6 ("spi: pxa2xx: use devm_spi_register_master()"),
but all earlier versions of the driver going back to 2006 were likewise
broken because they invoked spi_unregister_master() at the end of
pxa2xx_spi_remove(), rather than at the beginning.
Fixes: e0c9905e87ac ("[PATCH] SPI: add PXA2xx SSP SPI Driver")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v2.6.17+
Cc: Tsuchiya Yuto <kitakar(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206403#c1
Link: https://lore.kernel.org/r/834c446b1cf3284d2660f1bee1ebe3e737cd02a9.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index 73d2a65d0b6e..f456ce18f79e 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1884,7 +1884,7 @@ static int pxa2xx_spi_probe(struct platform_device *pdev)
/* Register with the SPI framework */
platform_set_drvdata(pdev, drv_data);
- status = devm_spi_register_controller(&pdev->dev, controller);
+ status = spi_register_controller(controller);
if (status != 0) {
dev_err(&pdev->dev, "problem registering spi controller\n");
goto out_error_pm_runtime_enabled;
@@ -1916,6 +1916,8 @@ static int pxa2xx_spi_remove(struct platform_device *pdev)
pm_runtime_get_sync(&pdev->dev);
+ spi_unregister_controller(drv_data->controller);
+
/* Disable the SSP at the peripheral and SOC level */
pxa2xx_spi_write(drv_data, SSCR0, 0);
clk_disable_unprepare(ssp->clk);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 84855678add8aba927faf76bc2f130a40f94b6f7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:01 +0200
Subject: [PATCH] spi: Fix controller unregister order
When an SPI controller unregisters, it unbinds all its slave devices.
For this, their drivers may need to access the SPI bus, e.g. to quiesce
interrupts.
However since commit ffbbdd21329f ("spi: create a message queueing
infrastructure"), spi_destroy_queue() is executed before unbinding the
slaves. It sets ctlr->running = false, thereby preventing SPI bus
access and causing unbinding of slave devices to fail.
Fix by unbinding slaves before calling spi_destroy_queue().
Fixes: ffbbdd21329f ("spi: create a message queueing infrastructure")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.4+
Cc: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/8aaf9d44c153fe233b17bc2dec4eb679898d7e7b.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 531d1de85f7f..25eb96d329ab 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -2763,6 +2763,8 @@ void spi_unregister_controller(struct spi_controller *ctlr)
struct spi_controller *found;
int id = ctlr->bus_num;
+ device_for_each_child(&ctlr->dev, NULL, __unregister);
+
/* First make sure that this controller was ever added */
mutex_lock(&board_lock);
found = idr_find(&spi_master_idr, id);
@@ -2775,7 +2777,6 @@ void spi_unregister_controller(struct spi_controller *ctlr)
list_del(&ctlr->list);
mutex_unlock(&board_lock);
- device_for_each_child(&ctlr->dev, NULL, __unregister);
device_unregister(&ctlr->dev);
/* free bus id */
mutex_lock(&board_lock);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 84855678add8aba927faf76bc2f130a40f94b6f7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:01 +0200
Subject: [PATCH] spi: Fix controller unregister order
When an SPI controller unregisters, it unbinds all its slave devices.
For this, their drivers may need to access the SPI bus, e.g. to quiesce
interrupts.
However since commit ffbbdd21329f ("spi: create a message queueing
infrastructure"), spi_destroy_queue() is executed before unbinding the
slaves. It sets ctlr->running = false, thereby preventing SPI bus
access and causing unbinding of slave devices to fail.
Fix by unbinding slaves before calling spi_destroy_queue().
Fixes: ffbbdd21329f ("spi: create a message queueing infrastructure")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.4+
Cc: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/8aaf9d44c153fe233b17bc2dec4eb679898d7e7b.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 531d1de85f7f..25eb96d329ab 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -2763,6 +2763,8 @@ void spi_unregister_controller(struct spi_controller *ctlr)
struct spi_controller *found;
int id = ctlr->bus_num;
+ device_for_each_child(&ctlr->dev, NULL, __unregister);
+
/* First make sure that this controller was ever added */
mutex_lock(&board_lock);
found = idr_find(&spi_master_idr, id);
@@ -2775,7 +2777,6 @@ void spi_unregister_controller(struct spi_controller *ctlr)
list_del(&ctlr->list);
mutex_unlock(&board_lock);
- device_for_each_child(&ctlr->dev, NULL, __unregister);
device_unregister(&ctlr->dev);
/* free bus id */
mutex_lock(&board_lock);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 84855678add8aba927faf76bc2f130a40f94b6f7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Fri, 15 May 2020 17:58:01 +0200
Subject: [PATCH] spi: Fix controller unregister order
When an SPI controller unregisters, it unbinds all its slave devices.
For this, their drivers may need to access the SPI bus, e.g. to quiesce
interrupts.
However since commit ffbbdd21329f ("spi: create a message queueing
infrastructure"), spi_destroy_queue() is executed before unbinding the
slaves. It sets ctlr->running = false, thereby preventing SPI bus
access and causing unbinding of slave devices to fail.
Fix by unbinding slaves before calling spi_destroy_queue().
Fixes: ffbbdd21329f ("spi: create a message queueing infrastructure")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Cc: stable(a)vger.kernel.org # v3.4+
Cc: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/8aaf9d44c153fe233b17bc2dec4eb679898d7e7b.15895575…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 531d1de85f7f..25eb96d329ab 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -2763,6 +2763,8 @@ void spi_unregister_controller(struct spi_controller *ctlr)
struct spi_controller *found;
int id = ctlr->bus_num;
+ device_for_each_child(&ctlr->dev, NULL, __unregister);
+
/* First make sure that this controller was ever added */
mutex_lock(&board_lock);
found = idr_find(&spi_master_idr, id);
@@ -2775,7 +2777,6 @@ void spi_unregister_controller(struct spi_controller *ctlr)
list_del(&ctlr->list);
mutex_unlock(&board_lock);
- device_for_each_child(&ctlr->dev, NULL, __unregister);
device_unregister(&ctlr->dev);
/* free bus id */
mutex_lock(&board_lock);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ca8b19d61e3fce5d2d7790cde27a0b57bcb3f341 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:01 +0200
Subject: [PATCH] spi: dw: Fix controller unregister order
The Designware SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
dw_spi_remove_host() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: dw_spi_remove_host() shuts down the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in dw_spi_remove_host(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
Fixes: 04f421e7b0b1 ("spi: dw: use managed resources")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v3.14+
Cc: Baruch Siach <baruch(a)tkos.co.il>
Link: https://lore.kernel.org/r/3fff8cb8ae44a9893840d0688be15bb88c090a14.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-dw.c b/drivers/spi/spi-dw.c
index 31e3f866d11a..780ffad64a91 100644
--- a/drivers/spi/spi-dw.c
+++ b/drivers/spi/spi-dw.c
@@ -526,7 +526,7 @@ int dw_spi_add_host(struct device *dev, struct dw_spi *dws)
}
}
- ret = devm_spi_register_controller(dev, master);
+ ret = spi_register_controller(master);
if (ret) {
dev_err(&master->dev, "problem registering spi master\n");
goto err_dma_exit;
@@ -550,6 +550,8 @@ void dw_spi_remove_host(struct dw_spi *dws)
{
dw_spi_debugfs_remove(dws);
+ spi_unregister_controller(dws->master);
+
if (dws->dma_ops && dws->dma_ops->dma_exit)
dws->dma_ops->dma_exit(dws);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ca8b19d61e3fce5d2d7790cde27a0b57bcb3f341 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Mon, 25 May 2020 14:25:01 +0200
Subject: [PATCH] spi: dw: Fix controller unregister order
The Designware SPI driver uses devm_spi_register_controller() on bind.
As a consequence, on unbind, __device_release_driver() first invokes
dw_spi_remove_host() before unregistering the SPI controller via
devres_release_all().
This order is incorrect: dw_spi_remove_host() shuts down the chip,
rendering the SPI bus inaccessible even though the SPI controller is
still registered. When the SPI controller is subsequently unregistered,
it unbinds all its slave devices. Because their drivers cannot access
the SPI bus, e.g. to quiesce interrupts, the slave devices may be left
in an improper state.
As a rule, devm_spi_register_controller() must not be used if the
->remove() hook performs teardown steps which shall be performed after
unregistering the controller and specifically after unbinding of slaves.
Fix by reverting to the non-devm variant of spi_register_controller().
An alternative approach would be to use device-managed functions for all
steps in dw_spi_remove_host(), e.g. by calling devm_add_action_or_reset()
on probe. However that approach would add more LoC to the driver and
it wouldn't lend itself as well to backporting to stable.
Fixes: 04f421e7b0b1 ("spi: dw: use managed resources")
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v3.14+
Cc: Baruch Siach <baruch(a)tkos.co.il>
Link: https://lore.kernel.org/r/3fff8cb8ae44a9893840d0688be15bb88c090a14.15904084…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-dw.c b/drivers/spi/spi-dw.c
index 31e3f866d11a..780ffad64a91 100644
--- a/drivers/spi/spi-dw.c
+++ b/drivers/spi/spi-dw.c
@@ -526,7 +526,7 @@ int dw_spi_add_host(struct device *dev, struct dw_spi *dws)
}
}
- ret = devm_spi_register_controller(dev, master);
+ ret = spi_register_controller(master);
if (ret) {
dev_err(&master->dev, "problem registering spi master\n");
goto err_dma_exit;
@@ -550,6 +550,8 @@ void dw_spi_remove_host(struct dw_spi *dws)
{
dw_spi_debugfs_remove(dws);
+ spi_unregister_controller(dws->master);
+
if (dws->dma_ops && dws->dma_ops->dma_exit)
dws->dma_ops->dma_exit(dws);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 951e2736f4b11b58dc44d41964fa17c3527d882a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Micha=C5=82=20Miros=C5=82aw?= <mirq-linux(a)rere.qmqm.pl>
Date: Mon, 8 Jun 2020 18:50:39 +0200
Subject: [PATCH] ALSA: pcm: disallow linking stream to itself
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Prevent SNDRV_PCM_IOCTL_LINK linking stream to itself - the code
can't handle it. Fixed commit is not where bug was introduced, but
changes the context significantly.
Cc: stable(a)vger.kernel.org
Fixes: 0888c321de70 ("pcm_native: switch to fdget()/fdput()")
Signed-off-by: Michał Mirosław <mirq-linux(a)rere.qmqm.pl>
Link: https://lore.kernel.org/r/89c4a2487609a0ed6af3ecf01cc972bdc59a7a2d.15916349…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index c08732998a42..eeab8850ed76 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -2176,6 +2176,12 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
}
pcm_file = f.file->private_data;
substream1 = pcm_file->substream;
+
+ if (substream == substream1) {
+ res = -EINVAL;
+ goto _badf;
+ }
+
group = kzalloc(sizeof(*group), GFP_KERNEL);
if (!group) {
res = -ENOMEM;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 951e2736f4b11b58dc44d41964fa17c3527d882a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Micha=C5=82=20Miros=C5=82aw?= <mirq-linux(a)rere.qmqm.pl>
Date: Mon, 8 Jun 2020 18:50:39 +0200
Subject: [PATCH] ALSA: pcm: disallow linking stream to itself
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Prevent SNDRV_PCM_IOCTL_LINK linking stream to itself - the code
can't handle it. Fixed commit is not where bug was introduced, but
changes the context significantly.
Cc: stable(a)vger.kernel.org
Fixes: 0888c321de70 ("pcm_native: switch to fdget()/fdput()")
Signed-off-by: Michał Mirosław <mirq-linux(a)rere.qmqm.pl>
Link: https://lore.kernel.org/r/89c4a2487609a0ed6af3ecf01cc972bdc59a7a2d.15916349…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index c08732998a42..eeab8850ed76 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -2176,6 +2176,12 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
}
pcm_file = f.file->private_data;
substream1 = pcm_file->substream;
+
+ if (substream == substream1) {
+ res = -EINVAL;
+ goto _badf;
+ }
+
group = kzalloc(sizeof(*group), GFP_KERNEL);
if (!group) {
res = -ENOMEM;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 951e2736f4b11b58dc44d41964fa17c3527d882a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Micha=C5=82=20Miros=C5=82aw?= <mirq-linux(a)rere.qmqm.pl>
Date: Mon, 8 Jun 2020 18:50:39 +0200
Subject: [PATCH] ALSA: pcm: disallow linking stream to itself
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Prevent SNDRV_PCM_IOCTL_LINK linking stream to itself - the code
can't handle it. Fixed commit is not where bug was introduced, but
changes the context significantly.
Cc: stable(a)vger.kernel.org
Fixes: 0888c321de70 ("pcm_native: switch to fdget()/fdput()")
Signed-off-by: Michał Mirosław <mirq-linux(a)rere.qmqm.pl>
Link: https://lore.kernel.org/r/89c4a2487609a0ed6af3ecf01cc972bdc59a7a2d.15916349…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index c08732998a42..eeab8850ed76 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -2176,6 +2176,12 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
}
pcm_file = f.file->private_data;
substream1 = pcm_file->substream;
+
+ if (substream == substream1) {
+ res = -EINVAL;
+ goto _badf;
+ }
+
group = kzalloc(sizeof(*group), GFP_KERNEL);
if (!group) {
res = -ENOMEM;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f5f27b79eab80de0287c243a22169e4876b08d5e Mon Sep 17 00:00:00 2001
From: Franck LENORMAND <franck.lenormand(a)nxp.com>
Date: Thu, 26 Mar 2020 00:00:05 +0200
Subject: [PATCH] firmware: imx: scu: Fix corruption of header
The header of the message to send can be changed if the
response is longer than the request:
- 1st word, the header is sent
- the remaining words of the message are sent
- the response is received asynchronously during the
execution of the loop, changing the size field in
the header
- the for loop test the termination condition using
the corrupted header
It is the case for the API build_info which has just a
header as request but 3 words in response.
This issue is fixed storing the header locally instead of
using a pointer on it.
Fixes: edbee095fafb (firmware: imx: add SCU firmware driver support)
Signed-off-by: Franck LENORMAND <franck.lenormand(a)nxp.com>
Reviewed-by: Leonard Crestez <leonard.crestez(a)nxp.com>
Signed-off-by: Leonard Crestez <leonard.crestez(a)nxp.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Dong Aisheng <aisheng.dong(a)nxp.com>
Signed-off-by: Shawn Guo <shawnguo(a)kernel.org>
diff --git a/drivers/firmware/imx/imx-scu.c b/drivers/firmware/imx/imx-scu.c
index e94a5585b698..b3da2e193ad2 100644
--- a/drivers/firmware/imx/imx-scu.c
+++ b/drivers/firmware/imx/imx-scu.c
@@ -158,7 +158,7 @@ static void imx_scu_rx_callback(struct mbox_client *c, void *msg)
static int imx_scu_ipc_write(struct imx_sc_ipc *sc_ipc, void *msg)
{
- struct imx_sc_rpc_msg *hdr = msg;
+ struct imx_sc_rpc_msg hdr = *(struct imx_sc_rpc_msg *)msg;
struct imx_sc_chan *sc_chan;
u32 *data = msg;
int ret;
@@ -166,13 +166,13 @@ static int imx_scu_ipc_write(struct imx_sc_ipc *sc_ipc, void *msg)
int i;
/* Check size */
- if (hdr->size > IMX_SC_RPC_MAX_MSG)
+ if (hdr.size > IMX_SC_RPC_MAX_MSG)
return -EINVAL;
- dev_dbg(sc_ipc->dev, "RPC SVC %u FUNC %u SIZE %u\n", hdr->svc,
- hdr->func, hdr->size);
+ dev_dbg(sc_ipc->dev, "RPC SVC %u FUNC %u SIZE %u\n", hdr.svc,
+ hdr.func, hdr.size);
- size = sc_ipc->fast_ipc ? 1 : hdr->size;
+ size = sc_ipc->fast_ipc ? 1 : hdr.size;
for (i = 0; i < size; i++) {
sc_chan = &sc_ipc->chans[i % 4];
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f5f27b79eab80de0287c243a22169e4876b08d5e Mon Sep 17 00:00:00 2001
From: Franck LENORMAND <franck.lenormand(a)nxp.com>
Date: Thu, 26 Mar 2020 00:00:05 +0200
Subject: [PATCH] firmware: imx: scu: Fix corruption of header
The header of the message to send can be changed if the
response is longer than the request:
- 1st word, the header is sent
- the remaining words of the message are sent
- the response is received asynchronously during the
execution of the loop, changing the size field in
the header
- the for loop test the termination condition using
the corrupted header
It is the case for the API build_info which has just a
header as request but 3 words in response.
This issue is fixed storing the header locally instead of
using a pointer on it.
Fixes: edbee095fafb (firmware: imx: add SCU firmware driver support)
Signed-off-by: Franck LENORMAND <franck.lenormand(a)nxp.com>
Reviewed-by: Leonard Crestez <leonard.crestez(a)nxp.com>
Signed-off-by: Leonard Crestez <leonard.crestez(a)nxp.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Dong Aisheng <aisheng.dong(a)nxp.com>
Signed-off-by: Shawn Guo <shawnguo(a)kernel.org>
diff --git a/drivers/firmware/imx/imx-scu.c b/drivers/firmware/imx/imx-scu.c
index e94a5585b698..b3da2e193ad2 100644
--- a/drivers/firmware/imx/imx-scu.c
+++ b/drivers/firmware/imx/imx-scu.c
@@ -158,7 +158,7 @@ static void imx_scu_rx_callback(struct mbox_client *c, void *msg)
static int imx_scu_ipc_write(struct imx_sc_ipc *sc_ipc, void *msg)
{
- struct imx_sc_rpc_msg *hdr = msg;
+ struct imx_sc_rpc_msg hdr = *(struct imx_sc_rpc_msg *)msg;
struct imx_sc_chan *sc_chan;
u32 *data = msg;
int ret;
@@ -166,13 +166,13 @@ static int imx_scu_ipc_write(struct imx_sc_ipc *sc_ipc, void *msg)
int i;
/* Check size */
- if (hdr->size > IMX_SC_RPC_MAX_MSG)
+ if (hdr.size > IMX_SC_RPC_MAX_MSG)
return -EINVAL;
- dev_dbg(sc_ipc->dev, "RPC SVC %u FUNC %u SIZE %u\n", hdr->svc,
- hdr->func, hdr->size);
+ dev_dbg(sc_ipc->dev, "RPC SVC %u FUNC %u SIZE %u\n", hdr.svc,
+ hdr.func, hdr.size);
- size = sc_ipc->fast_ipc ? 1 : hdr->size;
+ size = sc_ipc->fast_ipc ? 1 : hdr.size;
for (i = 0; i < size; i++) {
sc_chan = &sc_ipc->chans[i % 4];
The patch below was submitted to be applied to the 5.7-stable tree.
I fail to see how this patch meets the stable kernel rules as found at
Documentation/process/stable-kernel-rules.rst.
I could be totally wrong, and if so, please respond to
<stable(a)vger.kernel.org> and let me know why this patch should be
applied. Otherwise, it is now dropped from my patch queues, never to be
seen again.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 82e9367c43890cb6a870f700c9180c7eb2035684 Mon Sep 17 00:00:00 2001
From: Steve French <stfrench(a)microsoft.com>
Date: Tue, 19 May 2020 03:06:57 -0500
Subject: [PATCH] smb3: Add new parm "nodelete"
In order to handle workloads where it is important to make sure that
a buggy app did not delete content on the drive, the new mount option
"nodelete" allows standard permission checks on the server to work,
but prevents on the client any attempts to unlink a file or delete
a directory on that mount point. This can be helpful when running
a little understood app on a network mount that contains important
content that should not be deleted.
Signed-off-by: Steve French <stfrench(a)microsoft.com>
CC: Stable <stable(a)vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov(a)microsoft.com>
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index c31f362fa098..889f9c71049b 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -534,6 +534,8 @@ cifs_show_options(struct seq_file *s, struct dentry *root)
seq_puts(s, ",signloosely");
if (tcon->nocase)
seq_puts(s, ",nocase");
+ if (tcon->nodelete)
+ seq_puts(s, ",nodelete");
if (tcon->local_lease)
seq_puts(s, ",locallease");
if (tcon->retry)
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 39b708d9d86d..4d261fd78fcb 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -562,6 +562,7 @@ struct smb_vol {
bool override_gid:1;
bool dynperm:1;
bool noperm:1;
+ bool nodelete:1;
bool mode_ace:1;
bool no_psx_acl:1; /* set if posix acl support should be disabled */
bool cifs_acl:1;
@@ -1136,6 +1137,7 @@ struct cifs_tcon {
bool retry:1;
bool nocase:1;
bool nohandlecache:1; /* if strange server resource prob can turn off */
+ bool nodelete:1;
bool seal:1; /* transport encryption for this mounted share */
bool unix_ext:1; /* if false disable Linux extensions to CIFS protocol
for this mount even if server would support */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 329babc6b18a..57d1cc6bf86f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -75,7 +75,7 @@ enum {
Opt_forceuid, Opt_noforceuid,
Opt_forcegid, Opt_noforcegid,
Opt_noblocksend, Opt_noautotune, Opt_nolease,
- Opt_hard, Opt_soft, Opt_perm, Opt_noperm,
+ Opt_hard, Opt_soft, Opt_perm, Opt_noperm, Opt_nodelete,
Opt_mapposix, Opt_nomapposix,
Opt_mapchars, Opt_nomapchars, Opt_sfu,
Opt_nosfu, Opt_nodfs, Opt_posixpaths,
@@ -141,6 +141,7 @@ static const match_table_t cifs_mount_option_tokens = {
{ Opt_soft, "soft" },
{ Opt_perm, "perm" },
{ Opt_noperm, "noperm" },
+ { Opt_nodelete, "nodelete" },
{ Opt_mapchars, "mapchars" }, /* SFU style */
{ Opt_nomapchars, "nomapchars" },
{ Opt_mapposix, "mapposix" }, /* SFM style */
@@ -1760,6 +1761,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
case Opt_noperm:
vol->noperm = 1;
break;
+ case Opt_nodelete:
+ vol->nodelete = 1;
+ break;
case Opt_mapchars:
vol->sfu_remap = true;
vol->remap = false; /* disable SFM mapping */
@@ -3362,6 +3366,8 @@ static int match_tcon(struct cifs_tcon *tcon, struct smb_vol *volume_info)
return 0;
if (tcon->no_lease != volume_info->no_lease)
return 0;
+ if (tcon->nodelete != volume_info->nodelete)
+ return 0;
return 1;
}
@@ -3597,6 +3603,7 @@ cifs_get_tcon(struct cifs_ses *ses, struct smb_vol *volume_info)
tcon->retry = volume_info->retry;
tcon->nocase = volume_info->nocase;
tcon->nohandlecache = volume_info->nohandlecache;
+ tcon->nodelete = volume_info->nodelete;
tcon->local_lease = volume_info->local_lease;
INIT_LIST_HEAD(&tcon->pending_opens);
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 5d2965a23730..873b1effd412 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1418,6 +1418,11 @@ int cifs_unlink(struct inode *dir, struct dentry *dentry)
xid = get_xid();
+ if (tcon->nodelete) {
+ rc = -EACCES;
+ goto unlink_out;
+ }
+
/* Unlink can be called from rename so we can not take the
* sb->s_vfs_rename_mutex here */
full_path = build_path_from_dentry(dentry);
@@ -1746,6 +1751,12 @@ int cifs_rmdir(struct inode *inode, struct dentry *direntry)
goto rmdir_exit;
}
+ if (tcon->nodelete) {
+ rc = -EACCES;
+ cifs_put_tlink(tlink);
+ goto rmdir_exit;
+ }
+
rc = server->ops->rmdir(xid, tcon, full_path, cifs_sb);
cifs_put_tlink(tlink);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck(a)intel.com>
Date: Wed, 20 May 2020 09:35:46 -0700
Subject: [PATCH] x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned
An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
[ bp: Adjust to x86/entry changes. ]
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew(a)google.com>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel…
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
extern int kernel_set_to_readonly;
#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;
/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 30413325de22..ce9120c4f740 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -520,6 +520,14 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -573,7 +581,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
@@ -1173,11 +1181,12 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
- if (!(p->mce_status & MCG_STATUS_RIPV))
+
+ if (!p->mce_ripv)
flags |= MF_MUST_KILL;
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
return;
}
@@ -1331,7 +1340,8 @@ void noinstr do_machine_check(struct pt_regs *regs)
BUG_ON(!on_thread_stack() || !user_mode(regs));
current->mce_addr = m.addr;
- current->mce_status = m.mcgstatus;
+ current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(&m);
current->mce_kill_me.func = kill_me_maybe;
if (kill_it)
current->mce_kill_me.func = kill_me_now;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d96e3e7fff..62c1de522fc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,7 +1304,9 @@ struct task_struct {
#ifdef CONFIG_X86_MCE
u64 mce_addr;
- u64 mce_status;
+ __u64 mce_ripv : 1,
+ mce_whole_page : 1,
+ __mce_reserved : 62;
struct callback_head mce_kill_me;
#endif
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif
#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck(a)intel.com>
Date: Wed, 20 May 2020 09:35:46 -0700
Subject: [PATCH] x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned
An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
[ bp: Adjust to x86/entry changes. ]
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew(a)google.com>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel…
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
extern int kernel_set_to_readonly;
#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;
/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 30413325de22..ce9120c4f740 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -520,6 +520,14 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -573,7 +581,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
@@ -1173,11 +1181,12 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
- if (!(p->mce_status & MCG_STATUS_RIPV))
+
+ if (!p->mce_ripv)
flags |= MF_MUST_KILL;
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
return;
}
@@ -1331,7 +1340,8 @@ void noinstr do_machine_check(struct pt_regs *regs)
BUG_ON(!on_thread_stack() || !user_mode(regs));
current->mce_addr = m.addr;
- current->mce_status = m.mcgstatus;
+ current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(&m);
current->mce_kill_me.func = kill_me_maybe;
if (kill_it)
current->mce_kill_me.func = kill_me_now;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d96e3e7fff..62c1de522fc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,7 +1304,9 @@ struct task_struct {
#ifdef CONFIG_X86_MCE
u64 mce_addr;
- u64 mce_status;
+ __u64 mce_ripv : 1,
+ mce_whole_page : 1,
+ __mce_reserved : 62;
struct callback_head mce_kill_me;
#endif
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif
#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck(a)intel.com>
Date: Wed, 20 May 2020 09:35:46 -0700
Subject: [PATCH] x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned
An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
[ bp: Adjust to x86/entry changes. ]
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew(a)google.com>
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel…
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
extern int kernel_set_to_readonly;
#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;
/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 30413325de22..ce9120c4f740 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -520,6 +520,14 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -573,7 +581,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
@@ -1173,11 +1181,12 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
- if (!(p->mce_status & MCG_STATUS_RIPV))
+
+ if (!p->mce_ripv)
flags |= MF_MUST_KILL;
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
return;
}
@@ -1331,7 +1340,8 @@ void noinstr do_machine_check(struct pt_regs *regs)
BUG_ON(!on_thread_stack() || !user_mode(regs));
current->mce_addr = m.addr;
- current->mce_status = m.mcgstatus;
+ current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(&m);
current->mce_kill_me.func = kill_me_maybe;
if (kill_it)
current->mce_kill_me.func = kill_me_now;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d96e3e7fff..62c1de522fc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,7 +1304,9 @@ struct task_struct {
#ifdef CONFIG_X86_MCE
u64 mce_addr;
- u64 mce_status;
+ __u64 mce_ripv : 1,
+ mce_whole_page : 1,
+ __mce_reserved : 62;
struct callback_head mce_kill_me;
#endif
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif
#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
Commit "aio: fix async fsync creds"
(530f32fc370fd1431ea9802dbc53ab5601dfccdb) was committed for v5.8-rc1,
but I don't see it in any stable branches or rc queues (it is marked for
4.18+). Did it slip through the cracks? Or is it on some other queue?
Thanks, Avi
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f4cfcd2d5aea4e96c5d483c476f3057b6b7baf6a Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Sat, 23 May 2020 19:14:55 +0300
Subject: [PATCH] KVM: x86: don't expose MSR_IA32_UMWAIT_CONTROL
unconditionally
This msr is only available when the host supports WAITPKG feature.
This breaks a nested guest, if the L1 hypervisor is set to ignore
unknown msrs, because the only other safety check that the
kernel does is that it attempts to read the msr and
rejects it if it gets an exception.
Cc: stable(a)vger.kernel.org
Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20200523161455.3940-3-mlevitsk(a)redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c17e6eb9ad43..e0083a08da9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5242,6 +5242,10 @@ static void kvm_init_msr_list(void)
if (!kvm_cpu_cap_has(X86_FEATURE_RDTSCP))
continue;
break;
+ case MSR_IA32_UMWAIT_CONTROL:
+ if (!kvm_cpu_cap_has(X86_FEATURE_WAITPKG))
+ continue;
+ break;
case MSR_IA32_RTIT_CTL:
case MSR_IA32_RTIT_STATUS:
if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT))
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f4cfcd2d5aea4e96c5d483c476f3057b6b7baf6a Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Sat, 23 May 2020 19:14:55 +0300
Subject: [PATCH] KVM: x86: don't expose MSR_IA32_UMWAIT_CONTROL
unconditionally
This msr is only available when the host supports WAITPKG feature.
This breaks a nested guest, if the L1 hypervisor is set to ignore
unknown msrs, because the only other safety check that the
kernel does is that it attempts to read the msr and
rejects it if it gets an exception.
Cc: stable(a)vger.kernel.org
Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20200523161455.3940-3-mlevitsk(a)redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c17e6eb9ad43..e0083a08da9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5242,6 +5242,10 @@ static void kvm_init_msr_list(void)
if (!kvm_cpu_cap_has(X86_FEATURE_RDTSCP))
continue;
break;
+ case MSR_IA32_UMWAIT_CONTROL:
+ if (!kvm_cpu_cap_has(X86_FEATURE_WAITPKG))
+ continue;
+ break;
case MSR_IA32_RTIT_CTL:
case MSR_IA32_RTIT_STATUS:
if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT))
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 Mon Sep 17 00:00:00 2001
From: Eiichi Tsukata <eiichi.tsukata(a)nutanix.com>
Date: Sat, 6 Jun 2020 13:26:27 +0900
Subject: [PATCH] KVM: x86: Fix APIC page invalidation race
Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
to fix inappropriate APIC page invalidation by re-introducing arch
specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. However, the patch left a
possible race where the VMCS APIC address cache is updated *before*
it is unmapped:
(Invalidator) kvm_mmu_notifier_invalidate_range_start()
(Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
(KVM VCPU) vcpu_enter_guest()
(KVM VCPU) kvm_vcpu_reload_apic_access_page()
(Invalidator) actually unmap page
Because of the above race, there can be a mismatch between the
host physical address stored in the APIC_ACCESS_PAGE VMCS field and
the host physical address stored in the EPT entry for the APIC GPA
(0xfee0000). When this happens, the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page.
Because Windows OS periodically checks for unexpected modifications to
the LAPIC register, this will show up as a BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
cannot guarantee that no additional references are taken to the pages in
the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
this case is supported by the MMU notifier API, as documented in
include/linux/mmu_notifier.h:
* If the subsystem
* can't guarantee that no additional references are taken to
* the pages in the range, it has to implement the
* invalidate_range() notifier to remove any references taken
* after invalidate_range_start().
The fix therefore is to reload the APIC-access page field in the VMCS
from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
Cc: stable(a)vger.kernel.org
Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
Signed-off-by: Eiichi Tsukata <eiichi.tsukata(a)nutanix.com>
Message-Id: <20200606042627.61070-1-eiichi.tsukata(a)nutanix.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c26dd1363151..24de847af52e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8270,9 +8270,8 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops.load_eoi_exitmap(vcpu, eoi_exit_bitmap);
}
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end,
- bool blockable)
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end)
{
unsigned long apic_address;
@@ -8283,8 +8282,6 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
-
- return 0;
}
void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d38d6b9c24be..e2f82131bb3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1420,8 +1420,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
}
#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end, bool blockable);
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end);
#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4db151f6101e..7b6013f2ba19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -155,10 +155,9 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
static unsigned long long kvm_createvm_count;
static unsigned long long kvm_active_vms;
-__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end, bool blockable)
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end)
{
- return 0;
}
bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
@@ -384,6 +383,18 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
}
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -408,7 +419,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
- int ret;
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
@@ -425,14 +435,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
-
- ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start,
- range->end,
- mmu_notifier_range_blockable(range));
-
srcu_read_unlock(&kvm->srcu, idx);
- return ret;
+ return 0;
}
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -538,6 +543,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
}
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+ .invalidate_range = kvm_mmu_notifier_invalidate_range,
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 Mon Sep 17 00:00:00 2001
From: Eiichi Tsukata <eiichi.tsukata(a)nutanix.com>
Date: Sat, 6 Jun 2020 13:26:27 +0900
Subject: [PATCH] KVM: x86: Fix APIC page invalidation race
Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
to fix inappropriate APIC page invalidation by re-introducing arch
specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. However, the patch left a
possible race where the VMCS APIC address cache is updated *before*
it is unmapped:
(Invalidator) kvm_mmu_notifier_invalidate_range_start()
(Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
(KVM VCPU) vcpu_enter_guest()
(KVM VCPU) kvm_vcpu_reload_apic_access_page()
(Invalidator) actually unmap page
Because of the above race, there can be a mismatch between the
host physical address stored in the APIC_ACCESS_PAGE VMCS field and
the host physical address stored in the EPT entry for the APIC GPA
(0xfee0000). When this happens, the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page.
Because Windows OS periodically checks for unexpected modifications to
the LAPIC register, this will show up as a BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
cannot guarantee that no additional references are taken to the pages in
the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
this case is supported by the MMU notifier API, as documented in
include/linux/mmu_notifier.h:
* If the subsystem
* can't guarantee that no additional references are taken to
* the pages in the range, it has to implement the
* invalidate_range() notifier to remove any references taken
* after invalidate_range_start().
The fix therefore is to reload the APIC-access page field in the VMCS
from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
Cc: stable(a)vger.kernel.org
Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
Signed-off-by: Eiichi Tsukata <eiichi.tsukata(a)nutanix.com>
Message-Id: <20200606042627.61070-1-eiichi.tsukata(a)nutanix.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c26dd1363151..24de847af52e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8270,9 +8270,8 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops.load_eoi_exitmap(vcpu, eoi_exit_bitmap);
}
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end,
- bool blockable)
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end)
{
unsigned long apic_address;
@@ -8283,8 +8282,6 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
-
- return 0;
}
void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d38d6b9c24be..e2f82131bb3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1420,8 +1420,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
}
#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end, bool blockable);
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end);
#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4db151f6101e..7b6013f2ba19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -155,10 +155,9 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
static unsigned long long kvm_createvm_count;
static unsigned long long kvm_active_vms;
-__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end, bool blockable)
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end)
{
- return 0;
}
bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
@@ -384,6 +383,18 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
}
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -408,7 +419,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
- int ret;
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
@@ -425,14 +435,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
-
- ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start,
- range->end,
- mmu_notifier_range_blockable(range));
-
srcu_read_unlock(&kvm->srcu, idx);
- return ret;
+ return 0;
}
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -538,6 +543,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
}
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+ .invalidate_range = kvm_mmu_notifier_invalidate_range,
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 Mon Sep 17 00:00:00 2001
From: Eiichi Tsukata <eiichi.tsukata(a)nutanix.com>
Date: Sat, 6 Jun 2020 13:26:27 +0900
Subject: [PATCH] KVM: x86: Fix APIC page invalidation race
Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
to fix inappropriate APIC page invalidation by re-introducing arch
specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. However, the patch left a
possible race where the VMCS APIC address cache is updated *before*
it is unmapped:
(Invalidator) kvm_mmu_notifier_invalidate_range_start()
(Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
(KVM VCPU) vcpu_enter_guest()
(KVM VCPU) kvm_vcpu_reload_apic_access_page()
(Invalidator) actually unmap page
Because of the above race, there can be a mismatch between the
host physical address stored in the APIC_ACCESS_PAGE VMCS field and
the host physical address stored in the EPT entry for the APIC GPA
(0xfee0000). When this happens, the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page.
Because Windows OS periodically checks for unexpected modifications to
the LAPIC register, this will show up as a BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
cannot guarantee that no additional references are taken to the pages in
the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
this case is supported by the MMU notifier API, as documented in
include/linux/mmu_notifier.h:
* If the subsystem
* can't guarantee that no additional references are taken to
* the pages in the range, it has to implement the
* invalidate_range() notifier to remove any references taken
* after invalidate_range_start().
The fix therefore is to reload the APIC-access page field in the VMCS
from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
Cc: stable(a)vger.kernel.org
Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
Signed-off-by: Eiichi Tsukata <eiichi.tsukata(a)nutanix.com>
Message-Id: <20200606042627.61070-1-eiichi.tsukata(a)nutanix.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c26dd1363151..24de847af52e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8270,9 +8270,8 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops.load_eoi_exitmap(vcpu, eoi_exit_bitmap);
}
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end,
- bool blockable)
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end)
{
unsigned long apic_address;
@@ -8283,8 +8282,6 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
-
- return 0;
}
void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d38d6b9c24be..e2f82131bb3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1420,8 +1420,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
}
#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end, bool blockable);
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end);
#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4db151f6101e..7b6013f2ba19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -155,10 +155,9 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
static unsigned long long kvm_createvm_count;
static unsigned long long kvm_active_vms;
-__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end, bool blockable)
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end)
{
- return 0;
}
bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
@@ -384,6 +383,18 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
}
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -408,7 +419,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
- int ret;
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
@@ -425,14 +435,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
-
- ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start,
- range->end,
- mmu_notifier_range_blockable(range));
-
srcu_read_unlock(&kvm->srcu, idx);
- return ret;
+ return 0;
}
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -538,6 +543,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
}
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+ .invalidate_range = kvm_mmu_notifier_invalidate_range,
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6129ed877d409037b79866327102c9dc59a302fe Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Wed, 27 May 2020 01:49:09 -0700
Subject: [PATCH] KVM: x86/mmu: Set mmio_value to '0' if reserved #PF can't be
generated
Set the mmio_value to '0' instead of simply clearing the present bit to
squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains
about the mmio_value overlapping the lower GFN mask on systems with 52
bits of PA space.
Opportunistically clean up the code and comments.
Cc: stable(a)vger.kernel.org
Fixes: d43e2675e96fc ("KVM: x86: only do L1TF workaround on affected processors")
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Message-Id: <20200527084909.23492-1-sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 86619631ff6a..92d056954194 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6143,25 +6143,16 @@ static void kvm_set_mmio_spte_mask(void)
u64 mask;
/*
- * Set the reserved bits and the present bit of an paging-structure
- * entry to generate page fault with PFER.RSV = 1.
+ * Set a reserved PA bit in MMIO SPTEs to generate page faults with
+ * PFEC.RSVD=1 on MMIO accesses. 64-bit PTEs (PAE, x86-64, and EPT
+ * paging) support a maximum of 52 bits of PA, i.e. if the CPU supports
+ * 52-bit physical addresses then there are no reserved PA bits in the
+ * PTEs and so the reserved PA approach must be disabled.
*/
-
- /*
- * Mask the uppermost physical address bit, which would be reserved as
- * long as the supported physical address width is less than 52.
- */
- mask = 1ull << 51;
-
- /* Set the present bit. */
- mask |= 1ull;
-
- /*
- * If reserved bit is not supported, clear the present bit to disable
- * mmio page fault.
- */
- if (shadow_phys_bits == 52)
- mask &= ~1ull;
+ if (shadow_phys_bits < 52)
+ mask = BIT_ULL(51) | PT_PRESENT_MASK;
+ else
+ mask = 0;
kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6129ed877d409037b79866327102c9dc59a302fe Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Wed, 27 May 2020 01:49:09 -0700
Subject: [PATCH] KVM: x86/mmu: Set mmio_value to '0' if reserved #PF can't be
generated
Set the mmio_value to '0' instead of simply clearing the present bit to
squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains
about the mmio_value overlapping the lower GFN mask on systems with 52
bits of PA space.
Opportunistically clean up the code and comments.
Cc: stable(a)vger.kernel.org
Fixes: d43e2675e96fc ("KVM: x86: only do L1TF workaround on affected processors")
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Message-Id: <20200527084909.23492-1-sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 86619631ff6a..92d056954194 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6143,25 +6143,16 @@ static void kvm_set_mmio_spte_mask(void)
u64 mask;
/*
- * Set the reserved bits and the present bit of an paging-structure
- * entry to generate page fault with PFER.RSV = 1.
+ * Set a reserved PA bit in MMIO SPTEs to generate page faults with
+ * PFEC.RSVD=1 on MMIO accesses. 64-bit PTEs (PAE, x86-64, and EPT
+ * paging) support a maximum of 52 bits of PA, i.e. if the CPU supports
+ * 52-bit physical addresses then there are no reserved PA bits in the
+ * PTEs and so the reserved PA approach must be disabled.
*/
-
- /*
- * Mask the uppermost physical address bit, which would be reserved as
- * long as the supported physical address width is less than 52.
- */
- mask = 1ull << 51;
-
- /* Set the present bit. */
- mask |= 1ull;
-
- /*
- * If reserved bit is not supported, clear the present bit to disable
- * mmio page fault.
- */
- if (shadow_phys_bits == 52)
- mask &= ~1ull;
+ if (shadow_phys_bits < 52)
+ mask = BIT_ULL(51) | PT_PRESENT_MASK;
+ else
+ mask = 0;
kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0abcc8f65cc23b65bc8d1614cc64b02b1641ed7c Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Sat, 23 May 2020 19:14:54 +0300
Subject: [PATCH] KVM: VMX: enable X86_FEATURE_WAITPKG in KVM capabilities
Even though we might not allow the guest to use WAITPKG's new
instructions, we should tell KVM that the feature is supported by the
host CPU.
Note that vmx_waitpkg_supported checks that WAITPKG _can_ be set in
secondary execution controls as specified by VMX capability MSR, rather
that we actually enable it for a guest.
Cc: stable(a)vger.kernel.org
Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions")
Suggested-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20200523161455.3940-2-mlevitsk(a)redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 89c766fad889..9b63ac8d97ee 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7138,6 +7138,9 @@ static __init void vmx_set_cpu_caps(void)
/* CPUID 0x80000001 */
if (!cpu_has_vmx_rdtscp())
kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
+
+ if (vmx_waitpkg_supported())
+ kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}
static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
The patch below does not apply to the 5.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0abcc8f65cc23b65bc8d1614cc64b02b1641ed7c Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk(a)redhat.com>
Date: Sat, 23 May 2020 19:14:54 +0300
Subject: [PATCH] KVM: VMX: enable X86_FEATURE_WAITPKG in KVM capabilities
Even though we might not allow the guest to use WAITPKG's new
instructions, we should tell KVM that the feature is supported by the
host CPU.
Note that vmx_waitpkg_supported checks that WAITPKG _can_ be set in
secondary execution controls as specified by VMX capability MSR, rather
that we actually enable it for a guest.
Cc: stable(a)vger.kernel.org
Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions")
Suggested-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20200523161455.3940-2-mlevitsk(a)redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 89c766fad889..9b63ac8d97ee 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7138,6 +7138,9 @@ static __init void vmx_set_cpu_caps(void)
/* CPUID 0x80000001 */
if (!cpu_has_vmx_rdtscp())
kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
+
+ if (vmx_waitpkg_supported())
+ kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}
static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d8df8cbb9156b0a0ab3f802b80cb5db57acc0bf Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Sun, 7 Jun 2020 05:44:19 -0700
Subject: [PATCH] x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for
indirect branches.
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes: 9137bb27e60e ("x86/speculation: Add prctl() control for indirect branch speculation")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 8d57562b1d2c..56f573aa764f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1175,11 +1175,14 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
return 0;
/*
* Indirect branch speculation is always disabled in strict
- * mode.
+ * mode. It can neither be enabled if it was force-disabled
+ * by a previous prctl call.
+
*/
if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED ||
+ task_spec_ib_force_disable(task))
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d8df8cbb9156b0a0ab3f802b80cb5db57acc0bf Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Sun, 7 Jun 2020 05:44:19 -0700
Subject: [PATCH] x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for
indirect branches.
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes: 9137bb27e60e ("x86/speculation: Add prctl() control for indirect branch speculation")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 8d57562b1d2c..56f573aa764f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1175,11 +1175,14 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
return 0;
/*
* Indirect branch speculation is always disabled in strict
- * mode.
+ * mode. It can neither be enabled if it was force-disabled
+ * by a previous prctl call.
+
*/
if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED ||
+ task_spec_ib_force_disable(task))
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d8df8cbb9156b0a0ab3f802b80cb5db57acc0bf Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Sun, 7 Jun 2020 05:44:19 -0700
Subject: [PATCH] x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for
indirect branches.
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes: 9137bb27e60e ("x86/speculation: Add prctl() control for indirect branch speculation")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 8d57562b1d2c..56f573aa764f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1175,11 +1175,14 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
return 0;
/*
* Indirect branch speculation is always disabled in strict
- * mode.
+ * mode. It can neither be enabled if it was force-disabled
+ * by a previous prctl call.
+
*/
if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED ||
+ task_spec_ib_force_disable(task))
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 21998a351512eba4ed5969006f0c55882d995ada Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Tue, 19 May 2020 06:40:42 -0700
Subject: [PATCH] x86/speculation: Avoid force-disabling IBPB based on STIBP
and enhanced IBRS.
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Upda…
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes: 7cc765a67d8e ("x86/speculation: Enable prctl mode for spectre_v2_user")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ed54b3b21c39..8d57562b1d2c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -495,7 +495,9 @@ early_param("nospectre_v1", nospectre_v1_cmdline);
static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
SPECTRE_V2_NONE;
-static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init =
+static enum spectre_v2_user_mitigation spectre_v2_user_stibp __ro_after_init =
+ SPECTRE_V2_USER_NONE;
+static enum spectre_v2_user_mitigation spectre_v2_user_ibpb __ro_after_init =
SPECTRE_V2_USER_NONE;
#ifdef CONFIG_RETPOLINE
@@ -641,15 +643,6 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
break;
}
- /*
- * At this point, an STIBP mode other than "off" has been set.
- * If STIBP support is not being forced, check if STIBP always-on
- * is preferred.
- */
- if (mode != SPECTRE_V2_USER_STRICT &&
- boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
- mode = SPECTRE_V2_USER_STRICT_PREFERRED;
-
/* Initialize Indirect Branch Prediction Barrier */
if (boot_cpu_has(X86_FEATURE_IBPB)) {
setup_force_cpu_cap(X86_FEATURE_USE_IBPB);
@@ -672,23 +665,36 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
pr_info("mitigation: Enabling %s Indirect Branch Prediction Barrier\n",
static_key_enabled(&switch_mm_always_ibpb) ?
"always-on" : "conditional");
+
+ spectre_v2_user_ibpb = mode;
}
- /* If enhanced IBRS is enabled no STIBP required */
- if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
+ /*
+ * If enhanced IBRS is enabled or SMT impossible, STIBP is not
+ * required.
+ */
+ if (!smt_possible || spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return;
/*
- * If SMT is not possible or STIBP is not available clear the STIBP
- * mode.
+ * At this point, an STIBP mode other than "off" has been set.
+ * If STIBP support is not being forced, check if STIBP always-on
+ * is preferred.
+ */
+ if (mode != SPECTRE_V2_USER_STRICT &&
+ boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
+ mode = SPECTRE_V2_USER_STRICT_PREFERRED;
+
+ /*
+ * If STIBP is not available, clear the STIBP mode.
*/
- if (!smt_possible || !boot_cpu_has(X86_FEATURE_STIBP))
+ if (!boot_cpu_has(X86_FEATURE_STIBP))
mode = SPECTRE_V2_USER_NONE;
+
+ spectre_v2_user_stibp = mode;
+
set_mode:
- spectre_v2_user = mode;
- /* Only print the STIBP mode when SMT possible */
- if (smt_possible)
- pr_info("%s\n", spectre_v2_user_strings[mode]);
+ pr_info("%s\n", spectre_v2_user_strings[mode]);
}
static const char * const spectre_v2_strings[] = {
@@ -921,7 +927,7 @@ void cpu_bugs_smt_update(void)
{
mutex_lock(&spec_ctrl_mutex);
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
break;
case SPECTRE_V2_USER_STRICT:
@@ -1164,14 +1170,16 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
{
switch (ctrl) {
case PR_SPEC_ENABLE:
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return 0;
/*
* Indirect branch speculation is always disabled in strict
* mode.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
@@ -1182,10 +1190,12 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
* Indirect branch speculation is always allowed when
* mitigation is force disabled.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return -EPERM;
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return 0;
task_set_spec_ib_disable(task);
if (ctrl == PR_SPEC_FORCE_DISABLE)
@@ -1216,7 +1226,8 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
{
if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP)
ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE);
- if (spectre_v2_user == SPECTRE_V2_USER_SECCOMP)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP)
ib_prctl_set(task, PR_SPEC_FORCE_DISABLE);
}
#endif
@@ -1247,22 +1258,24 @@ static int ib_prctl_get(struct task_struct *task)
if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
return PR_SPEC_NOT_AFFECTED;
- switch (spectre_v2_user) {
- case SPECTRE_V2_USER_NONE:
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_PRCTL:
- case SPECTRE_V2_USER_SECCOMP:
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ return PR_SPEC_DISABLE;
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP) {
if (task_spec_ib_force_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE;
if (task_spec_ib_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_STRICT:
- case SPECTRE_V2_USER_STRICT_PREFERRED:
- return PR_SPEC_DISABLE;
- default:
+ } else
return PR_SPEC_NOT_AFFECTED;
- }
}
int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
@@ -1501,7 +1514,7 @@ static char *stibp_state(void)
if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return "";
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
return ", STIBP: disabled";
case SPECTRE_V2_USER_STRICT:
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 21998a351512eba4ed5969006f0c55882d995ada Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Tue, 19 May 2020 06:40:42 -0700
Subject: [PATCH] x86/speculation: Avoid force-disabling IBPB based on STIBP
and enhanced IBRS.
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Upda…
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes: 7cc765a67d8e ("x86/speculation: Enable prctl mode for spectre_v2_user")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ed54b3b21c39..8d57562b1d2c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -495,7 +495,9 @@ early_param("nospectre_v1", nospectre_v1_cmdline);
static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
SPECTRE_V2_NONE;
-static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init =
+static enum spectre_v2_user_mitigation spectre_v2_user_stibp __ro_after_init =
+ SPECTRE_V2_USER_NONE;
+static enum spectre_v2_user_mitigation spectre_v2_user_ibpb __ro_after_init =
SPECTRE_V2_USER_NONE;
#ifdef CONFIG_RETPOLINE
@@ -641,15 +643,6 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
break;
}
- /*
- * At this point, an STIBP mode other than "off" has been set.
- * If STIBP support is not being forced, check if STIBP always-on
- * is preferred.
- */
- if (mode != SPECTRE_V2_USER_STRICT &&
- boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
- mode = SPECTRE_V2_USER_STRICT_PREFERRED;
-
/* Initialize Indirect Branch Prediction Barrier */
if (boot_cpu_has(X86_FEATURE_IBPB)) {
setup_force_cpu_cap(X86_FEATURE_USE_IBPB);
@@ -672,23 +665,36 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
pr_info("mitigation: Enabling %s Indirect Branch Prediction Barrier\n",
static_key_enabled(&switch_mm_always_ibpb) ?
"always-on" : "conditional");
+
+ spectre_v2_user_ibpb = mode;
}
- /* If enhanced IBRS is enabled no STIBP required */
- if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
+ /*
+ * If enhanced IBRS is enabled or SMT impossible, STIBP is not
+ * required.
+ */
+ if (!smt_possible || spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return;
/*
- * If SMT is not possible or STIBP is not available clear the STIBP
- * mode.
+ * At this point, an STIBP mode other than "off" has been set.
+ * If STIBP support is not being forced, check if STIBP always-on
+ * is preferred.
+ */
+ if (mode != SPECTRE_V2_USER_STRICT &&
+ boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
+ mode = SPECTRE_V2_USER_STRICT_PREFERRED;
+
+ /*
+ * If STIBP is not available, clear the STIBP mode.
*/
- if (!smt_possible || !boot_cpu_has(X86_FEATURE_STIBP))
+ if (!boot_cpu_has(X86_FEATURE_STIBP))
mode = SPECTRE_V2_USER_NONE;
+
+ spectre_v2_user_stibp = mode;
+
set_mode:
- spectre_v2_user = mode;
- /* Only print the STIBP mode when SMT possible */
- if (smt_possible)
- pr_info("%s\n", spectre_v2_user_strings[mode]);
+ pr_info("%s\n", spectre_v2_user_strings[mode]);
}
static const char * const spectre_v2_strings[] = {
@@ -921,7 +927,7 @@ void cpu_bugs_smt_update(void)
{
mutex_lock(&spec_ctrl_mutex);
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
break;
case SPECTRE_V2_USER_STRICT:
@@ -1164,14 +1170,16 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
{
switch (ctrl) {
case PR_SPEC_ENABLE:
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return 0;
/*
* Indirect branch speculation is always disabled in strict
* mode.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
@@ -1182,10 +1190,12 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
* Indirect branch speculation is always allowed when
* mitigation is force disabled.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return -EPERM;
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return 0;
task_set_spec_ib_disable(task);
if (ctrl == PR_SPEC_FORCE_DISABLE)
@@ -1216,7 +1226,8 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
{
if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP)
ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE);
- if (spectre_v2_user == SPECTRE_V2_USER_SECCOMP)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP)
ib_prctl_set(task, PR_SPEC_FORCE_DISABLE);
}
#endif
@@ -1247,22 +1258,24 @@ static int ib_prctl_get(struct task_struct *task)
if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
return PR_SPEC_NOT_AFFECTED;
- switch (spectre_v2_user) {
- case SPECTRE_V2_USER_NONE:
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_PRCTL:
- case SPECTRE_V2_USER_SECCOMP:
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ return PR_SPEC_DISABLE;
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP) {
if (task_spec_ib_force_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE;
if (task_spec_ib_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_STRICT:
- case SPECTRE_V2_USER_STRICT_PREFERRED:
- return PR_SPEC_DISABLE;
- default:
+ } else
return PR_SPEC_NOT_AFFECTED;
- }
}
int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
@@ -1501,7 +1514,7 @@ static char *stibp_state(void)
if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return "";
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
return ", STIBP: disabled";
case SPECTRE_V2_USER_STRICT:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 21998a351512eba4ed5969006f0c55882d995ada Mon Sep 17 00:00:00 2001
From: Anthony Steinhauser <asteinhauser(a)google.com>
Date: Tue, 19 May 2020 06:40:42 -0700
Subject: [PATCH] x86/speculation: Avoid force-disabling IBPB based on STIBP
and enhanced IBRS.
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Upda…
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes: 7cc765a67d8e ("x86/speculation: Enable prctl mode for spectre_v2_user")
Signed-off-by: Anthony Steinhauser <asteinhauser(a)google.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ed54b3b21c39..8d57562b1d2c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -495,7 +495,9 @@ early_param("nospectre_v1", nospectre_v1_cmdline);
static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
SPECTRE_V2_NONE;
-static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init =
+static enum spectre_v2_user_mitigation spectre_v2_user_stibp __ro_after_init =
+ SPECTRE_V2_USER_NONE;
+static enum spectre_v2_user_mitigation spectre_v2_user_ibpb __ro_after_init =
SPECTRE_V2_USER_NONE;
#ifdef CONFIG_RETPOLINE
@@ -641,15 +643,6 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
break;
}
- /*
- * At this point, an STIBP mode other than "off" has been set.
- * If STIBP support is not being forced, check if STIBP always-on
- * is preferred.
- */
- if (mode != SPECTRE_V2_USER_STRICT &&
- boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
- mode = SPECTRE_V2_USER_STRICT_PREFERRED;
-
/* Initialize Indirect Branch Prediction Barrier */
if (boot_cpu_has(X86_FEATURE_IBPB)) {
setup_force_cpu_cap(X86_FEATURE_USE_IBPB);
@@ -672,23 +665,36 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
pr_info("mitigation: Enabling %s Indirect Branch Prediction Barrier\n",
static_key_enabled(&switch_mm_always_ibpb) ?
"always-on" : "conditional");
+
+ spectre_v2_user_ibpb = mode;
}
- /* If enhanced IBRS is enabled no STIBP required */
- if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
+ /*
+ * If enhanced IBRS is enabled or SMT impossible, STIBP is not
+ * required.
+ */
+ if (!smt_possible || spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return;
/*
- * If SMT is not possible or STIBP is not available clear the STIBP
- * mode.
+ * At this point, an STIBP mode other than "off" has been set.
+ * If STIBP support is not being forced, check if STIBP always-on
+ * is preferred.
+ */
+ if (mode != SPECTRE_V2_USER_STRICT &&
+ boot_cpu_has(X86_FEATURE_AMD_STIBP_ALWAYS_ON))
+ mode = SPECTRE_V2_USER_STRICT_PREFERRED;
+
+ /*
+ * If STIBP is not available, clear the STIBP mode.
*/
- if (!smt_possible || !boot_cpu_has(X86_FEATURE_STIBP))
+ if (!boot_cpu_has(X86_FEATURE_STIBP))
mode = SPECTRE_V2_USER_NONE;
+
+ spectre_v2_user_stibp = mode;
+
set_mode:
- spectre_v2_user = mode;
- /* Only print the STIBP mode when SMT possible */
- if (smt_possible)
- pr_info("%s\n", spectre_v2_user_strings[mode]);
+ pr_info("%s\n", spectre_v2_user_strings[mode]);
}
static const char * const spectre_v2_strings[] = {
@@ -921,7 +927,7 @@ void cpu_bugs_smt_update(void)
{
mutex_lock(&spec_ctrl_mutex);
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
break;
case SPECTRE_V2_USER_STRICT:
@@ -1164,14 +1170,16 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
{
switch (ctrl) {
case PR_SPEC_ENABLE:
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return 0;
/*
* Indirect branch speculation is always disabled in strict
* mode.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return -EPERM;
task_clear_spec_ib_disable(task);
task_update_spec_tif(task);
@@ -1182,10 +1190,12 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl)
* Indirect branch speculation is always allowed when
* mitigation is force disabled.
*/
- if (spectre_v2_user == SPECTRE_V2_USER_NONE)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return -EPERM;
- if (spectre_v2_user == SPECTRE_V2_USER_STRICT ||
- spectre_v2_user == SPECTRE_V2_USER_STRICT_PREFERRED)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
return 0;
task_set_spec_ib_disable(task);
if (ctrl == PR_SPEC_FORCE_DISABLE)
@@ -1216,7 +1226,8 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
{
if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP)
ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE);
- if (spectre_v2_user == SPECTRE_V2_USER_SECCOMP)
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP)
ib_prctl_set(task, PR_SPEC_FORCE_DISABLE);
}
#endif
@@ -1247,22 +1258,24 @@ static int ib_prctl_get(struct task_struct *task)
if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
return PR_SPEC_NOT_AFFECTED;
- switch (spectre_v2_user) {
- case SPECTRE_V2_USER_NONE:
+ if (spectre_v2_user_ibpb == SPECTRE_V2_USER_NONE &&
+ spectre_v2_user_stibp == SPECTRE_V2_USER_NONE)
return PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_PRCTL:
- case SPECTRE_V2_USER_SECCOMP:
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED)
+ return PR_SPEC_DISABLE;
+ else if (spectre_v2_user_ibpb == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_ibpb == SPECTRE_V2_USER_SECCOMP ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_PRCTL ||
+ spectre_v2_user_stibp == SPECTRE_V2_USER_SECCOMP) {
if (task_spec_ib_force_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE;
if (task_spec_ib_disable(task))
return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
- case SPECTRE_V2_USER_STRICT:
- case SPECTRE_V2_USER_STRICT_PREFERRED:
- return PR_SPEC_DISABLE;
- default:
+ } else
return PR_SPEC_NOT_AFFECTED;
- }
}
int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
@@ -1501,7 +1514,7 @@ static char *stibp_state(void)
if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
return "";
- switch (spectre_v2_user) {
+ switch (spectre_v2_user_stibp) {
case SPECTRE_V2_USER_NONE:
return ", STIBP: disabled";
case SPECTRE_V2_USER_STRICT:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d43e2675e96fc6ae1a633b6a69d296394448cc32 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Tue, 19 May 2020 05:34:41 -0400
Subject: [PATCH] KVM: x86: only do L1TF workaround on affected processors
KVM stores the gfn in MMIO SPTEs as a caching optimization. These are split
in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
in an L1TF attack. This works as long as there are 5 free bits between
MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
access triggers a reserved-bit-set page fault.
The bit positions however were computed wrongly for AMD processors that have
encryption support. In this case, x86_phys_bits is reduced (for example
from 48 to 43, to account for the C bit at position 47 and four bits used
internally to store the SEV ASID and other stuff) while x86_cache_bits in
would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
and bit 51 are set. Then low_phys_bits would also cover some of the
bits that are set in the shadow_mmio_value, terribly confusing the gfn
caching mechanism.
To fix this, avoid splitting gfns as long as the processor does not have
the L1TF bug (which includes all AMD processors). When there is no
splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
the overlap. This fixes "npt=0" operation on EPYC processors.
Thanks to Maxim Levitsky for bisecting this bug.
Cc: stable(a)vger.kernel.org
Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8071952e9cf2..86619631ff6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -335,6 +335,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
{
BUG_ON((u64)(unsigned)access_mask != access_mask);
BUG_ON((mmio_mask & mmio_value) != mmio_value);
+ WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
+ WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
shadow_mmio_access_mask = access_mask;
@@ -583,16 +585,15 @@ static void kvm_mmu_reset_all_pte_masks(void)
* the most significant bits of legal physical address space.
*/
shadow_nonpresent_or_rsvd_mask = 0;
- low_phys_bits = boot_cpu_data.x86_cache_bits;
- if (boot_cpu_data.x86_cache_bits <
- 52 - shadow_nonpresent_or_rsvd_mask_len) {
+ low_phys_bits = boot_cpu_data.x86_phys_bits;
+ if (boot_cpu_has_bug(X86_BUG_L1TF) &&
+ !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
+ 52 - shadow_nonpresent_or_rsvd_mask_len)) {
+ low_phys_bits = boot_cpu_data.x86_cache_bits
+ - shadow_nonpresent_or_rsvd_mask_len;
shadow_nonpresent_or_rsvd_mask =
- rsvd_bits(boot_cpu_data.x86_cache_bits -
- shadow_nonpresent_or_rsvd_mask_len,
- boot_cpu_data.x86_cache_bits - 1);
- low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
- } else
- WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
+ rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
+ }
shadow_nonpresent_or_rsvd_lower_gfn_mask =
GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
stable team,
please consider
commit 49f2d2419d60a103752e5fbaf158cf8d07c0d884
usercopy: mark dma-kmalloc caches as usercopy caches
for stable.
-------- Forwarded Message --------
Subject: [merged] usercopy-mark-dma-kmalloc-caches-as-usercopy-caches.patch removed from -mm tree
Date: Tue, 02 Jun 2020 14:37:44 -0700
From: akpm(a)linux-foundation.org
To: borntraeger(a)de.ibm.com, christoffer.dall(a)linaro.org, cl(a)linux.com, dave.kleikamp(a)oracle.com, dave(a)nullcore.net, davem(a)davemloft.net, hch(a)infradead.org, iamjoonsoo.kim(a)lge.com, jack(a)suse.cz, jannh(a)google.com, jslaby(a)suse.cz, jwi(a)linux.ibm.com, labbott(a)redhat.com, luisbg(a)kernel.org, luto(a)kernel.org, marc.zyngier(a)arm.com, mark.rutland(a)arm.com, martin.petersen(a)oracle.com, mjg59(a)google.com, mkubecek(a)suse.cz, mm-commits(a)vger.kernel.org, pbonzini(a)redhat.com, penberg(a)kernel.org, riel(a)surriel.com, rientjes(a)google.com, torvalds(a)linux-foundation.org, ubraun(a)linux.ibm.com, vbabka(a)suse.cz, viro(a)zeniv.linux.org.uk
The patch titled
Subject: usercopy: mark dma-kmalloc caches as usercopy caches
has been removed from the -mm tree. Its filename was
usercopy-mark-dma-kmalloc-caches-as-usercopy-caches.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: usercopy: mark dma-kmalloc caches as usercopy caches
We have seen a "usercopy: Kernel memory overwrite attempt detected to SLUB
object 'dma-kmalloc-1 k' (offset 0, size 11)!" error on s390x, as IUCV
uses kmalloc() with __GFP_DMA because of memory address restrictions. The
issue has been discussed [2] and it has been noted that if all the kmalloc
caches are marked as usercopy, there's little reason not to mark
dma-kmalloc caches too. The 'dma' part merely means that __GFP_DMA is
used to restrict memory address range.
As Jann Horn put it [3]:
"I think dma-kmalloc slabs should be handled the same way as normal
kmalloc slabs. When a dma-kmalloc allocation is freshly created, it is
just normal kernel memory - even if it might later be used for DMA -, and
it should be perfectly fine to copy_from_user() into such allocations at
that point, and to copy_to_user() out of them at the end. If you look at
the places where such allocations are created, you can see things like
kmemdup(), memcpy() and so on - all normal operations that shouldn't
conceptually be different from usercopy in any relevant way."
Thus this patch marks the dma-kmalloc-* caches as usercopy.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1156053
[2] https://lore.kernel.org/kernel-hardening/bfca96db-bbd0-d958-7732-76e36c667c…
[3] https://lore.kernel.org/kernel-hardening/CAG48ez1a4waGk9kB0WLaSbs4muSoK0AYA…
Link: http://lkml.kernel.org/r/7d810f6d-8085-ea2f-7805-47ba3842dc50@suse.cz
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Acked-by: Jiri Slaby <jslaby(a)suse.cz>
Cc: Jann Horn <jannh(a)google.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Christopher Lameter <cl(a)linux.com>
Cc: Julian Wiedmann <jwi(a)linux.ibm.com>
Cc: Ursula Braun <ubraun(a)linux.ibm.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: David Windsor <dave(a)nullcore.net>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Laura Abbott <labbott(a)redhat.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: "Martin K. Petersen" <martin.petersen(a)oracle.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Christoffer Dall <christoffer.dall(a)linaro.org>
Cc: Dave Kleikamp <dave.kleikamp(a)oracle.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Luis de Bethencourt <luisbg(a)kernel.org>
Cc: Marc Zyngier <marc.zyngier(a)arm.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Matthew Garrett <mjg59(a)google.com>
Cc: Michal Kubecek <mkubecek(a)suse.cz>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/slab_common.c~usercopy-mark-dma-kmalloc-caches-as-usercopy-caches
+++ a/mm/slab_common.c
@@ -1303,7 +1303,8 @@ void __init create_kmalloc_caches(slab_f
kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache(
kmalloc_info[i].name[KMALLOC_DMA],
kmalloc_info[i].size,
- SLAB_CACHE_DMA | flags, 0, 0);
+ SLAB_CACHE_DMA | flags, 0,
+ kmalloc_info[i].size);
}
}
#endif
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
kernel-sysctl-support-setting-sysctl-parameters-from-kernel-command-line.patch
kernel-sysctl-support-handling-command-line-aliases.patch
kernel-hung_task-convert-hung_task_panic-boot-parameter-to-sysctl.patch
tools-testing-selftests-sysctl-sysctlsh-support-config_test_sysctl=y.patch
lib-test_sysctl-support-testing-of-sysctl-boot-parameter.patch
lib-test_sysctl-support-testing-of-sysctl-boot-parameter-fix.patch