From: Niklas Cassel <niklas.cassel(a)wdc.com>
Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE
and BLKFINISHZONE) should be allowed under the same permissions as write().
(write() does not require CAP_SYS_ADMIN).
Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if
the fd was successfully opened with FMODE_WRITE.
(They do not require CAP_SYS_ADMIN).
Currently, zone management send operations require both CAP_SYS_ADMIN
and that the fd was successfully opened with FMODE_WRITE.
Remove the CAP_SYS_ADMIN requirement, so that zone management send
operations match the access control requirement of write(), BLKSECDISCARD
and BLKZEROOUT.
Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal(a)wdc.com>
Cc: stable(a)vger.kernel.org # v4.10+
---
Changes since v2:
-None
Note to backporter:
Function was added as blkdev_reset_zones_ioctl() in v4.10.
Function was renamed to blkdev_zone_mgmt_ioctl() in v5.5.
The patch is valid both before and after the function rename.
block/blk-zoned.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 250cb76ee615..0789e6e9f7db 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -349,9 +349,6 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
if (!blk_queue_is_zoned(q))
return -ENOTTY;
- if (!capable(CAP_SYS_ADMIN))
- return -EACCES;
-
if (!(mode & FMODE_WRITE))
return -EBADF;
--
2.31.1
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 28e5e44aa3f4e0e0370864ed008fb5e2d85f4dc8
Gitweb: https://git.kernel.org/tip/28e5e44aa3f4e0e0370864ed008fb5e2d85f4dc8
Author: Fan Du <fan.du(a)intel.com>
AuthorDate: Thu, 17 Jun 2021 12:46:57 -07:00
Committer: Borislav Petkov <bp(a)suse.de>
CommitterDate: Fri, 18 Jun 2021 19:37:01 +02:00
x86/mm: Avoid truncating memblocks for SGX memory
tl;dr:
Several SGX users reported seeing the following message on NUMA systems:
sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
This turned out to be the memblock code mistakenly throwing away SGX
memory.
=== Full Changelog ===
The 'max_pfn' variable represents the highest known RAM address. It can
be used, for instance, to quickly determine for which physical addresses
there is mem_map[] space allocated. The numa_meminfo code makes an
effort to throw out ("trim") all memory blocks which are above 'max_pfn'.
SGX memory is not considered RAM (it is marked as "Reserved" in the
e820) and is not taken into account by max_pfn. Despite this, SGX memory
areas have NUMA affinity and are enumerated in the ACPI SRAT table. The
existing SGX code uses the numa_meminfo mechanism to look up the NUMA
affinity for its memory areas.
In cases where SGX memory was above max_pfn (usually just the one EPC
section in the last highest NUMA node), the numa_memblock is truncated
at 'max_pfn', which is below the SGX memory. When the SGX code tries to
look up the affinity of this memory, it fails and produces an error message:
sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
and assigns the memory to NUMA node 0.
Instead of silently truncating the memory block at 'max_pfn' and
dropping the SGX memory, add the truncated portion to
'numa_reserved_meminfo'. This allows the SGX code to later determine
the NUMA affinity of its 'Reserved' area.
Before, numa_meminfo looked like this (from 'crash'):
blk = { start = 0x0, end = 0x2080000000, nid = 0x0 }
{ start = 0x2080000000, end = 0x4000000000, nid = 0x1 }
numa_reserved_meminfo is empty.
With this, numa_meminfo looks like this:
blk = { start = 0x0, end = 0x2080000000, nid = 0x0 }
{ start = 0x2080000000, end = 0x4000000000, nid = 0x1 }
and numa_reserved_meminfo has an entry for node 1's SGX memory:
blk = { start = 0x4000000000, end = 0x4080000000, nid = 0x1 }
[ daveh: completely rewrote/reworked changelog ]
Fixes: 5d30f92e7631 ("x86/NUMA: Provide a range-to-target_node lookup facility")
Reported-by: Reinette Chatre <reinette.chatre(a)intel.com>
Signed-off-by: Fan Du <fan.du(a)intel.com>
Signed-off-by: Dave Hansen <dave.hansen(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Dave Hansen <dave.hansen(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20210617194657.0A99CB22@viggo.jf.intel.com
---
arch/x86/mm/numa.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5eb4dc2..e94da74 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -254,7 +254,13 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
/* make sure all non-reserved blocks are inside the limits */
bi->start = max(bi->start, low);
- bi->end = min(bi->end, high);
+
+ /* preserve info for non-RAM areas above 'max_pfn': */
+ if (bi->end > high) {
+ numa_add_memblk_to(bi->nid, high, bi->end,
+ &numa_reserved_meminfo);
+ bi->end = high;
+ }
/* and there's no empty block */
if (bi->start >= bi->end)
Bad counter reads are experienced sometimes when bit 10 or greater rolls
over. Originally, testing showed that at least 10 lower bits would be
set to the same value during these bad reads. However, some users still
reported time skips.
Wider testing revealed that on some chips, occasionally only the lowest
9 bits would read as the anomalous value. During these reads (which
still happen only when bit 10), bit 9 would read as the correct value.
Reduce the mask by one bit to cover these cases as well.
Cc: stable(a)vger.kernel.org
Fixes: c950ca8c35ee ("clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability")
Reported-by: Roman Stratiienko <r.stratiienko(a)gmail.com>
Signed-off-by: Samuel Holland <samuel(a)sholland.org>
---
The tool used for testing is here:
https://github.com/smaeul/timer-tools
For examples of the 9-bit pattern, see the data here:
https://github.com/8bitgc/timer-tools/tree/master/output
I was able to reproduce the same pattern (although _extremely_ rarely)
on 1 of the 8 A64 boards I currently have access to.
This explanation is consistent with the earlier report here:
https://lore.kernel.org/lkml/20200929111347.1967438-1-r.stratiienko@gmail.c…
In that report, the time went backward 20542 ns == 493 cycles @ 24 MHz,
which matches the expected 2^9 == 512 cycles minus system call overhead.
drivers/clocksource/arm_arch_timer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index d0177824c518..f4881764bf8f 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -352,7 +352,7 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
do { \
_val = read_sysreg(reg); \
_retries--; \
- } while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries); \
+ } while (((_val + 1) & GENMASK(8, 0)) <= 1 && _retries); \
\
WARN_ON_ONCE(!_retries); \
_val; \
--
2.26.3