Changes since v1 [1]: * Fix the fact that devmem_is_allowed() == 2 does not prevent mmap access (Kees) * Rather than teach devmem_is_allowed() == 2 to map zero pages in the mmap case, just fail (Nikolay)
[1]: http://lore.kernel.org/67f5b75c37143_71fe2949b@dwillia2-xfh.jf.intel.com.not...
--- The story starts with Nikolay reporting an SEPT violation due to mismatched encrypted/non-encrypted mappings of the BIOS data space [2].
An initial suggestion to just make sure that the BIOS data space is mapped consistently [3] ran into another issue that TDX and SEV-SNP disagree about when that space can be mapped as encrypted.
Then, in response to a partial patch to allow SEV-SNP to block BIOS data space for other reasons [4], Dave asked why not just give up on /dev/mem access entirely in the confidential VM case [5].
Enter this series to:
1/ Close a subtle hole whereby /dev/mem that is supposed return zeros in lieu of access only enforces that for read()/write()
2/ Use that new closed hole to reliably disable all /dev/mem access for confidential x86 VMs
[2]: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [3]: http://lore.kernel.org/174346288005.2166708.14425674491111625620.stgit@dwill... [4]: http://lore.kernel.org/20250403120228.2344377-1-naveen@kernel.org [5]: http://lore.kernel.org/fd683daa-d953-48ca-8c5d-6f4688ad442c@intel.com ---
Dan Williams (3): x86/devmem: Remove duplicate range_is_allowed() definition devmem: Block mmap access when read/write access is restricted x86/devmem: Restrict /dev/mem access for potentially unaccepted memory by default
arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ arch/x86/mm/pat/memtype.c | 31 ++++--------------------------- drivers/char/mem.c | 18 ------------------ include/linux/io.h | 26 ++++++++++++++++++++++++++ 7 files changed, 57 insertions(+), 51 deletions(-)
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
Nikolay reports [1] that accessing BIOS data (first 1MB of the physical address space) via /dev/mem results in an SEPT violation.
The cause is ioremap() (via xlate_dev_mem_ptr()) establishes an unencrypted mapping where the kernel had established an encrypted mapping previously.
An initial attempt to fix this revealed that TDX and SEV-SNP have different expectations about which and when address ranges can be mapped via /dev/mem.
Rather than develop a precise set of allowed /dev/mem capable TVM address ranges, teach devmem_is_allowed() to always restrict access to the BIOS data space. This means return 0s for read(), drop write(), and -EPERM mmap(). This can still be later relaxed as specific needs arise, but in the meantime, close off this source of mismatched IORES_MAP_ENCRYPTED expectations.
Cc: x86@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Vishal Annapurve vannapurve@google.com Cc: Kirill Shutemov kirill.shutemov@linux.intel.com Reported-by: Nikolay Borisov nik.borisov@suse.com Closes: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [1] Reviewed-by: Nikolay Borisov nik.borisov@suse.com Fixes: 9aa6ea69852c ("x86/tdx: Make pages shared in ioremap()") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ 4 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4b9f378e05f6..12a1b5acd55b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -891,6 +891,7 @@ config INTEL_TDX_GUEST depends on X86_X2APIC depends on EFI_STUB depends on PARAVIRT + depends on STRICT_DEVMEM select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE @@ -1510,6 +1511,7 @@ config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD depends on EFI_STUB + depends on STRICT_DEVMEM select DMA_COHERENT_POOL select ARCH_USE_MEMREMAP_PROT select INSTRUCTION_DECODER diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 213cf5379a5a..0ae436b34b88 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -305,6 +305,7 @@ struct x86_hyper_runtime { * semantics. * @realmode_reserve: reserve memory for realmode trampoline * @realmode_init: initialize realmode trampoline + * @devmem_is_allowed restrict /dev/mem and PCI sysfs resource access * @hyper: x86 hypervisor specific runtime callbacks */ struct x86_platform_ops { @@ -323,6 +324,7 @@ struct x86_platform_ops { void (*set_legacy_features)(void); void (*realmode_reserve)(void); void (*realmode_init)(void); + bool (*devmem_is_allowed)(unsigned long pfn); struct x86_hyper_runtime hyper; struct x86_guest guest; }; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index 0a2bbd674a6d..346301375bd4 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -143,6 +143,11 @@ static void enc_kexec_begin_noop(void) {} static void enc_kexec_finish_noop(void) {} static bool is_private_mmio_noop(u64 addr) {return false; }
+static bool platform_devmem_is_allowed(unsigned long pfn) +{ + return !cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT); +} + struct x86_platform_ops x86_platform __ro_after_init = { .calibrate_cpu = native_calibrate_cpu_early, .calibrate_tsc = native_calibrate_tsc, @@ -156,6 +161,7 @@ struct x86_platform_ops x86_platform __ro_after_init = { .restore_sched_clock_state = tsc_restore_sched_clock_state, .realmode_reserve = reserve_real_mode, .realmode_init = init_real_mode, + .devmem_is_allowed = platform_devmem_is_allowed, .hyper.pin_vcpu = x86_op_int_noop, .hyper.is_private_mmio = is_private_mmio_noop,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bfa444a7dbb0..df5435c8dbea 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -861,18 +861,23 @@ void __init poking_init(void) * area traditionally contains BIOS code and data regions used by X, dosemu, * and similar apps. Since they map the entire memory range, the whole range * must be allowed (for mapping), but any areas that would otherwise be - * disallowed are flagged as being "zero filled" instead of rejected. + * disallowed are flagged as being "zero filled" instead of rejected, for + * read()/write(). + * * Access has to be given to non-kernel-ram areas as well, these contain the * PCI mmio resources as well as potential bios/acpi data regions. */ int devmem_is_allowed(unsigned long pagenr) { + bool platform_allowed = x86_platform.devmem_is_allowed(pagenr); + if (region_intersects(PFN_PHYS(pagenr), PAGE_SIZE, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) != REGION_DISJOINT) { /* - * For disallowed memory regions in the low 1MB range, - * request that the page be shown as all zeros. + * For disallowed memory regions in the low 1MB range, request + * that the page be shown as all zeros for read()/write(), fail + * mmap() */ if (pagenr < 256) return 2; @@ -885,14 +890,20 @@ int devmem_is_allowed(unsigned long pagenr) * restricted resource under CONFIG_STRICT_DEVMEM. */ if (iomem_is_exclusive(pagenr << PAGE_SHIFT)) { - /* Low 1MB bypasses iomem restrictions. */ - if (pagenr < 256) + /* + * Low 1MB bypasses iomem restrictions unless the platform says + * the physical address is not suitable for direct access. + */ + if (pagenr < 256) { + if (!platform_allowed) + return 2; return 1; + }
return 0; }
- return 1; + return platform_allowed; }
void free_init_pages(const char *what, unsigned long begin, unsigned long end)
On Thu, Apr 10, 2025 at 06:22:38PM -0700, Dan Williams wrote:
Nikolay reports [1] that accessing BIOS data (first 1MB of the physical address space) via /dev/mem results in an SEPT violation.
The cause is ioremap() (via xlate_dev_mem_ptr()) establishes an unencrypted mapping where the kernel had established an encrypted mapping previously.
An initial attempt to fix this revealed that TDX and SEV-SNP have different expectations about which and when address ranges can be mapped via /dev/mem.
Rather than develop a precise set of allowed /dev/mem capable TVM address ranges, teach devmem_is_allowed() to always restrict access to the BIOS data space.
This patch does more than just restrict the BIOS data space - it rejects all accesses to /dev/mem _apart_ from the first 1MB. That should be made clear here.
This means return 0s for read(), drop write(), and -EPERM mmap(). This can still be later relaxed as specific needs arise, but in the meantime, close off this source of mismatched IORES_MAP_ENCRYPTED expectations.
Cc: x86@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Vishal Annapurve vannapurve@google.com Cc: Kirill Shutemov kirill.shutemov@linux.intel.com Reported-by: Nikolay Borisov nik.borisov@suse.com Closes: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [1] Reviewed-by: Nikolay Borisov nik.borisov@suse.com Fixes: 9aa6ea69852c ("x86/tdx: Make pages shared in ioremap()") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ 4 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4b9f378e05f6..12a1b5acd55b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -891,6 +891,7 @@ config INTEL_TDX_GUEST depends on X86_X2APIC depends on EFI_STUB depends on PARAVIRT
- depends on STRICT_DEVMEM select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE
@@ -1510,6 +1511,7 @@ config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD depends on EFI_STUB
- depends on STRICT_DEVMEM select DMA_COHERENT_POOL select ARCH_USE_MEMREMAP_PROT select INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 213cf5379a5a..0ae436b34b88 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -305,6 +305,7 @@ struct x86_hyper_runtime {
semantics.
- @realmode_reserve: reserve memory for realmode trampoline
- @realmode_init: initialize realmode trampoline
*/
- @devmem_is_allowed restrict /dev/mem and PCI sysfs resource access
- @hyper: x86 hypervisor specific runtime callbacks
struct x86_platform_ops { @@ -323,6 +324,7 @@ struct x86_platform_ops { void (*set_legacy_features)(void); void (*realmode_reserve)(void); void (*realmode_init)(void);
- bool (*devmem_is_allowed)(unsigned long pfn); struct x86_hyper_runtime hyper; struct x86_guest guest;
}; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index 0a2bbd674a6d..346301375bd4 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -143,6 +143,11 @@ static void enc_kexec_begin_noop(void) {} static void enc_kexec_finish_noop(void) {} static bool is_private_mmio_noop(u64 addr) {return false; } +static bool platform_devmem_is_allowed(unsigned long pfn) +{
- return !cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT);
+}
struct x86_platform_ops x86_platform __ro_after_init = { .calibrate_cpu = native_calibrate_cpu_early, .calibrate_tsc = native_calibrate_tsc, @@ -156,6 +161,7 @@ struct x86_platform_ops x86_platform __ro_after_init = { .restore_sched_clock_state = tsc_restore_sched_clock_state, .realmode_reserve = reserve_real_mode, .realmode_init = init_real_mode,
- .devmem_is_allowed = platform_devmem_is_allowed, .hyper.pin_vcpu = x86_op_int_noop, .hyper.is_private_mmio = is_private_mmio_noop,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bfa444a7dbb0..df5435c8dbea 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -861,18 +861,23 @@ void __init poking_init(void)
- area traditionally contains BIOS code and data regions used by X, dosemu,
- and similar apps. Since they map the entire memory range, the whole range
- must be allowed (for mapping), but any areas that would otherwise be
- disallowed are flagged as being "zero filled" instead of rejected.
- disallowed are flagged as being "zero filled" instead of rejected, for
- read()/write().
*/
- Access has to be given to non-kernel-ram areas as well, these contain the
- PCI mmio resources as well as potential bios/acpi data regions.
int devmem_is_allowed(unsigned long pagenr) {
- bool platform_allowed = x86_platform.devmem_is_allowed(pagenr);
If we are going to do this, I don't see the point of having an x86_platform_op. It may be better to simply gate this on cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) directly here.
Thanks, Naveen
if (region_intersects(PFN_PHYS(pagenr), PAGE_SIZE, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) != REGION_DISJOINT) { /*
* For disallowed memory regions in the low 1MB range,
* request that the page be shown as all zeros.
* For disallowed memory regions in the low 1MB range, request
* that the page be shown as all zeros for read()/write(), fail
*/ if (pagenr < 256) return 2;* mmap()
@@ -885,14 +890,20 @@ int devmem_is_allowed(unsigned long pagenr) * restricted resource under CONFIG_STRICT_DEVMEM. */ if (iomem_is_exclusive(pagenr << PAGE_SHIFT)) {
/* Low 1MB bypasses iomem restrictions. */
if (pagenr < 256)
/*
* Low 1MB bypasses iomem restrictions unless the platform says
* the physical address is not suitable for direct access.
*/
if (pagenr < 256) {
if (!platform_allowed)
return 2; return 1;
}
return 0; }
- return 1;
- return platform_allowed;
} void free_init_pages(const char *what, unsigned long begin, unsigned long end)
On 11.04.25 г. 4:22 ч., Dan Williams wrote:
Changes since v1 [1]:
- Fix the fact that devmem_is_allowed() == 2 does not prevent mmap access (Kees)
- Rather than teach devmem_is_allowed() == 2 to map zero pages in the mmap case, just fail (Nikolay)
The story starts with Nikolay reporting an SEPT violation due to mismatched encrypted/non-encrypted mappings of the BIOS data space [2].
An initial suggestion to just make sure that the BIOS data space is mapped consistently [3] ran into another issue that TDX and SEV-SNP disagree about when that space can be mapped as encrypted.
Then, in response to a partial patch to allow SEV-SNP to block BIOS data space for other reasons [4], Dave asked why not just give up on /dev/mem access entirely in the confidential VM case [5].
Enter this series to:
1/ Close a subtle hole whereby /dev/mem that is supposed return zeros in lieu of access only enforces that for read()/write()
2/ Use that new closed hole to reliably disable all /dev/mem access for confidential x86 VMs
[5]: http://lore.kernel.org/fd683daa-d953-48ca-8c5d-6f4688ad442c@intel.com
Dan Williams (3): x86/devmem: Remove duplicate range_is_allowed() definition devmem: Block mmap access when read/write access is restricted x86/devmem: Restrict /dev/mem access for potentially unaccepted memory by default
arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ arch/x86/mm/pat/memtype.c | 31 ++++--------------------------- drivers/char/mem.c | 18 ------------------ include/linux/io.h | 26 ++++++++++++++++++++++++++ 7 files changed, 57 insertions(+), 51 deletions(-)
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
Reviewed-by: Nikolay Borisov nik.borisov@suse.com
linux-stable-mirror@lists.linaro.org