Changes since v3 [1] (note v4 was a partial re-roll, but more feedback came in requiring a v5): - Fix a kbuild robot report for a missing header include of cc_platform.h - Switch to selecting STRICT_DEVMEM and IOSTRICT_DEVMEM rather than "depends on". (Naveen) - Clarify the "SEPT violation" vs "crash" and other changelog fixups for devmem maintainers and other arch maintainers. (Dave) - Drop patch numbering since patch2 is a fix and has no dependencies on patch1
[1]: http://lore.kernel.org/174491711228.1395340.3647010925173796093.stgit@dwilli...
--- The original response to Nikolay's report of a "crash" (unhandled SEPT violation) triggered by /dev/mem access to private memory was "let's just turn off /dev/mem".
After some machinations of x86_platform_ops to block a subset of problematic access, spelunking the history of devmem_is_allowed() returning "2" to enable some compatibility benefits while blocking access, and discovering that userspace depends buggy kernel behavior for mmap(2) of the first 1MB of memory on x86, the proposal has circled back to "disable /dev/mem".
Require both STRICT_DEVMEM and IO_STRICT_DEVMEM for x86 confidential guests to close /dev/mem hole while still allowing for userspace mapping of PCI MMIO as long as the kernel and userspace are not mapping the range at the same time.
The range_is_allowed() cleanup is not strictly necessary, but might as well close a 17 year-old "TODO".
Dan Williams (2): x86/devmem: Remove duplicate range_is_allowed() definition x86/devmem: Drop /dev/mem access for confidential guests
arch/x86/Kconfig | 4 ++++ arch/x86/mm/pat/memtype.c | 31 ++++--------------------------- drivers/char/mem.c | 28 ++++++++++------------------ include/linux/io.h | 21 +++++++++++++++++++++ 4 files changed, 39 insertions(+), 45 deletions(-)
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
17 years ago, Venki suggested [1] "A future improvement would be to avoid the range_is_allowed duplication".
The only thing preventing a common implementation is that phys_mem_access_prot_allowed() expects the range check to exit immediately when PAT is disabled [2]. I.e. there is no cache conflict to manage in that case. This cleanup was noticed on the path to considering changing range_is_allowed() policy to blanket deny /dev/mem for private (confidential computing) memory.
Note, however that phys_mem_access_prot_allowed() has long since stopped being relevant for managing cache-type validation due to [3], and [4].
Commit 0124cecfc85a ("x86, PAT: disable /dev/mem mmap RAM with PAT") [1] Commit 9e41bff2708e ("x86: fix /dev/mem mmap breakage when PAT is disabled") [2] Commit 1886297ce0c8 ("x86/mm/pat: Fix BUG_ON() in mmap_mem() on QEMU/i386") [3] Commit 0c3c8a18361a ("x86, PAT: Remove duplicate memtype reserve in devmem mmap") [4]
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Ingo Molnar mingo@kernel.org Cc: "Naveen N Rao" naveen@kernel.org Reviewed-by: Nikolay Borisov nik.borisov@suse.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/x86/mm/pat/memtype.c | 31 ++++--------------------------- drivers/char/mem.c | 18 ------------------ include/linux/io.h | 21 +++++++++++++++++++++ 3 files changed, 25 insertions(+), 45 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 72d8cbc61158..c97b6598f187 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -38,6 +38,7 @@ #include <linux/kernel.h> #include <linux/pfn_t.h> #include <linux/slab.h> +#include <linux/io.h> #include <linux/mm.h> #include <linux/highmem.h> #include <linux/fs.h> @@ -773,38 +774,14 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, return vma_prot; }
-#ifdef CONFIG_STRICT_DEVMEM -/* This check is done in drivers/char/mem.c in case of STRICT_DEVMEM */ -static inline int range_is_allowed(unsigned long pfn, unsigned long size) -{ - return 1; -} -#else -/* This check is needed to avoid cache aliasing when PAT is enabled */ -static inline int range_is_allowed(unsigned long pfn, unsigned long size) -{ - u64 from = ((u64)pfn) << PAGE_SHIFT; - u64 to = from + size; - u64 cursor = from; - - if (!pat_enabled()) - return 1; - - while (cursor < to) { - if (!devmem_is_allowed(pfn)) - return 0; - cursor += PAGE_SIZE; - pfn++; - } - return 1; -} -#endif /* CONFIG_STRICT_DEVMEM */ - int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, unsigned long size, pgprot_t *vma_prot) { enum page_cache_mode pcm = _PAGE_CACHE_MODE_WB;
+ if (!pat_enabled()) + return 1; + if (!range_is_allowed(pfn, size)) return 0;
diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 169eed162a7f..48839958b0b1 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -61,29 +61,11 @@ static inline int page_is_allowed(unsigned long pfn) { return devmem_is_allowed(pfn); } -static inline int range_is_allowed(unsigned long pfn, unsigned long size) -{ - u64 from = ((u64)pfn) << PAGE_SHIFT; - u64 to = from + size; - u64 cursor = from; - - while (cursor < to) { - if (!devmem_is_allowed(pfn)) - return 0; - cursor += PAGE_SIZE; - pfn++; - } - return 1; -} #else static inline int page_is_allowed(unsigned long pfn) { return 1; } -static inline int range_is_allowed(unsigned long pfn, unsigned long size) -{ - return 1; -} #endif
static inline bool should_stop_iteration(void) diff --git a/include/linux/io.h b/include/linux/io.h index 6a6bc4d46d0a..0642c7ee41db 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -183,4 +183,25 @@ static inline void arch_io_free_memtype_wc(resource_size_t base, int devm_arch_io_reserve_memtype_wc(struct device *dev, resource_size_t start, resource_size_t size);
+#ifdef CONFIG_STRICT_DEVMEM +static inline int range_is_allowed(unsigned long pfn, unsigned long size) +{ + u64 from = ((u64)pfn) << PAGE_SHIFT; + u64 to = from + size; + u64 cursor = from; + + while (cursor < to) { + if (!devmem_is_allowed(pfn)) + return 0; + cursor += PAGE_SIZE; + pfn++; + } + return 1; +} +#else +static inline int range_is_allowed(unsigned long pfn, unsigned long size) +{ + return 1; +} +#endif #endif /* _LINUX_IO_H */
Nikolay reports that accessing BIOS data (first 1MB of the physical address space) via /dev/mem results in a "crash" / terminated VM (unhandled SEPT violation). See report [1] for details.
The cause is ioremap() (via xlate_dev_mem_ptr()) establishes an unencrypted mapping where the kernel had established an encrypted mapping previously. The CPU enforces mapping consistency with a fault upon detecting a mismatch. A similar situation arises with devmem access to "unaccepted" confidential memory. In summary, it is fraught to allow uncoordinated userspace mapping of confidential memory.
While there is an existing mitigation to simulate and redirect access to the BIOS data area with STRICT_DEVMEM=y, it is insufficient. Specifically, STRICT_DEVMEM=y traps read(2) access to the BIOS data area, and returns a zeroed buffer. However, it turns out the kernel fails to enforce the same via mmap(2), and a direct mapping is established. This is a hole, and unfortunately userspace has learned to exploit it [2].
This means the kernel either needs: a mechanism to ensure consistent plus accepted "encrypted" mappings of this /dev/mem mmap() hole, close the hole by mapping the zero page in the mmap(2) path, block only BIOS data access and let typical STRICT_DEVMEM protect the rest, or disable /dev/mem altogether.
The simplest option for now is arrange for /dev/mem to always behave as if lockdown is enabled for confidential guests. Require confidential guest userspace to jettison legacy dependencies on /dev/mem similar to how other legacy mechanisms are jettisoned for confidential operation. Recall that modern methods for BIOS data access are available like /sys/firmware/dmi/tables.
Now, this begs the question what to do with PCI sysfs which allows userspace mappings of confidential MMIO with similar mapping consistency and acceptance expectations? Here, the existing mitigation of IO_STRICT_DEVMEM is likely sufficient. The kernel is expected to use request_mem_region() when toggling the state of MMIO. With IO_STRICT_DEVMEM that enforces kernel-exclusive access until release_mem_region(), i.e. mapping conflicts are prevented.
Cc: x86@kernel.org Cc: Kees Cook kees@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: Vishal Annapurve vannapurve@google.com Cc: Kirill Shutemov kirill.shutemov@linux.intel.com Cc: Arnd Bergmann arnd@arndb.de Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Suzuki K Poulose suzuki.poulose@arm.com Link: https://sources.debian.org/src/libdebian-installer/0.125/src/system/subarch-... [2] Reported-by: Nikolay Borisov nik.borisov@suse.com Closes: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [1] Fixes: 9aa6ea69852c ("x86/tdx: Make pages shared in ioremap()") Cc: stable@vger.kernel.org Acked-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Nikolay Borisov nik.borisov@suse.com Acked-by: Naveen N Rao (AMD) naveen@kernel.org Tested-by: Naveen N Rao (AMD) naveen@kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/x86/Kconfig | 4 ++++ drivers/char/mem.c | 10 ++++++++++ 2 files changed, 14 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4b9f378e05f6..36f11aad1ae5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -891,6 +891,8 @@ config INTEL_TDX_GUEST depends on X86_X2APIC depends on EFI_STUB depends on PARAVIRT + select STRICT_DEVMEM + select IO_STRICT_DEVMEM select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE @@ -1510,6 +1512,8 @@ config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD depends on EFI_STUB + select STRICT_DEVMEM + select IO_STRICT_DEVMEM select DMA_COHERENT_POOL select ARCH_USE_MEMREMAP_PROT select INSTRUCTION_DECODER diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 48839958b0b1..47729606b817 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -30,6 +30,7 @@ #include <linux/uio.h> #include <linux/uaccess.h> #include <linux/security.h> +#include <linux/cc_platform.h>
#define DEVMEM_MINOR 1 #define DEVPORT_MINOR 4 @@ -595,6 +596,15 @@ static int open_port(struct inode *inode, struct file *filp) if (rc) return rc;
+ /* + * Enforce encrypted mapping consistency and avoid unaccepted + * memory conflicts, "lockdown" /dev/mem for confidential + * guests. + */ + if (IS_ENABLED(CONFIG_STRICT_DEVMEM) && + cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) + return -EPERM; + if (iminor(inode) != DEVMEM_MINOR) return 0;
linux-stable-mirror@lists.linaro.org