From: Ira Weiny ira.weiny@intel.com
Changes from V2 [4] Rebased on tip-tree/core/entry From Thomas Gleixner Address bisectability Drop Patch: x86/entry: Move nmi entry/exit into common code From Greg KH Remove WARN_ON's From Dan Williams Add __must_check to pks_key_alloc() New patch: x86/pks: Add PKS defines and config options Split from Enable patch to build on through the series Fix compile errors
Changes from V1 Rebase to TIP master; resolve conflicts and test Clean up some kernel docs updates missed in V1 Add irqentry_state_t kernel doc for PKRS field Removed redundant irq_state->pkrs This is only needed when we add the global state and somehow ended up in this patch series. That will come back when we add the global functionality in. From Thomas Gleixner Update commit messages Add kernel doc for struct irqentry_state_t From Dave Hansen add flags to pks_key_alloc()
Changes from RFC V3[3] Rebase to TIP master Update test error output Standardize on 'irq_state' for state variables From Dave Hansen Update commit messages Add/clean up comments Add X86_FEATURE_PKS to disabled-features.h and remove some explicit CONFIG checks Move saved_pkrs member of thread_struct Remove superfluous preempt_disable() s/irq_save_pks/irq_save_set_pks/ Ensure PKRS is not seen in faults if not configured or not supported s/pks_mknoaccess/pks_mk_noaccess/ s/pks_mkread/pks_mk_readonly/ s/pks_mkrdwr/pks_mk_readwrite/ Change pks_key_alloc return to -EOPNOTSUPP when not supported From Peter Zijlstra Clean up Attribution Remove superfluous preempt_disable() Add union to differentiate exit_rcu/lockdep use in irqentry_state_t From Thomas Gleixner Add preliminary clean up patch and adjust series as needed
Introduce a new page protection mechanism for supervisor pages, Protection Key Supervisor (PKS).
2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys is a newer use case which is still being explored. PMEM was submitted as part of the RFC (v2) series[1]. However, since then it was found that some callers of kmap() require a global implementation of PKS. Specifically some users of kmap() expect mappings to be available to all kernel threads. While global use of PKS is rare it needs to be included for correctness. Unfortunately the kmap() updates required a large patch series to make the needed changes at the various kmap() call sites so that patch set has been split out. Because the global PKS feature is only required for that use case it will be deferred to that set as well.[2] This patch set is being submitted as a precursor to both of the use cases.
For an overview of the entire PKS ecosystem, a git tree including this series and 2 proposed use cases can be found here:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/ https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/
PKS enables protections on 'domains' of supervisor pages to limit supervisor mode access to those pages beyond the normal paging protections. PKS works in a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are checked in addition to normal paging protections and Access or Writes can be disabled via a MSR update without TLB flushes when permissions change. Also like PKU, a page mapping is assigned to a domain by setting pkey bits in the page table entry for that mapping.
Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
XSAVE is not supported for the PKRS MSR. Therefore the implementation saves/restores the MSR across context switches and during exceptions. Nested exceptions are supported by each exception getting a new PKS state.
For consistent behavior with current paging protections, pkey 0 is reserved and configured to allow full access via the pkey mechanism, thus preserving the default paging protections on mappings with the default pkey value of 0.
Other keys, (1-15) are allocated by an allocator which prepares us for key contention from day one. Kernel users should be prepared for the allocator to fail either because of key exhaustion or due to PKS not being supported on the arch and/or CPU instance.
The following are key attributes of PKS.
1) Fast switching of permissions 1a) Prevents access without page table manipulations 1b) No TLB flushes required 2) Works on a per thread basis
PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from the PTE to store the pkey within the entry.
[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/ [2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.weiny@intel.com/ [3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.weiny@intel.com/ [4] https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
Fenghua Yu (2): x86/pks: Add PKS kernel API x86/pks: Enable Protection Keys Supervisor (PKS)
Ira Weiny (8): x86/pkeys: Create pkeys_common.h x86/fpu: Refactor arch_set_user_pkey_access() for PKS support x86/pks: Add PKS defines and Kconfig options x86/pks: Preserve the PKRS MSR on context switch x86/entry: Pass irqentry_state_t by reference x86/entry: Preserve PKRS MSR across exceptions x86/fault: Report the PKRS state on fault x86/pks: Add PKS test code
Documentation/core-api/protection-keys.rst | 103 ++- arch/x86/Kconfig | 1 + arch/x86/entry/common.c | 46 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 25 +- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/pgtable.h | 13 +- arch/x86/include/asm/pgtable_types.h | 12 + arch/x86/include/asm/pkeys.h | 15 + arch/x86/include/asm/pkeys_common.h | 40 ++ arch/x86/include/asm/processor.h | 18 +- arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/cpu/common.c | 15 + arch/x86/kernel/cpu/mce/core.c | 4 +- arch/x86/kernel/fpu/xstate.c | 22 +- arch/x86/kernel/kvm.c | 6 +- arch/x86/kernel/nmi.c | 4 +- arch/x86/kernel/process.c | 26 + arch/x86/kernel/traps.c | 21 +- arch/x86/mm/fault.c | 87 ++- arch/x86/mm/pkeys.c | 196 +++++- include/linux/entry-common.h | 31 +- include/linux/pgtable.h | 4 + include/linux/pkeys.h | 24 + kernel/entry/common.c | 44 +- lib/Kconfig.debug | 12 + lib/Makefile | 3 + lib/pks/Makefile | 3 + lib/pks/pks_test.c | 692 ++++++++++++++++++++ mm/Kconfig | 2 + tools/testing/selftests/x86/Makefile | 3 +- tools/testing/selftests/x86/test_pks.c | 66 ++ 33 files changed, 1410 insertions(+), 140 deletions(-) create mode 100644 arch/x86/include/asm/pkeys_common.h create mode 100644 lib/pks/Makefile create mode 100644 lib/pks/pks_test.c create mode 100644 tools/testing/selftests/x86/test_pks.c
From: Ira Weiny ira.weiny@intel.com
Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in similar fashions and can share common defines. Specifically PKS and PKU each have:
1. A single control register 2. The same number of keys 3. The same number of bits in the register per key 4. Access and Write disable in the same bit locations
That means that we can share all the macros that synthesize and manipulate register values between the two features. Normally, these macros would be put in asm/pkeys.h to be used internally and externally to the arch code. However, the defines are required in pgtable.h and inclusion of pkeys.h in that header creates complex dependencies which are best resolved in a separate header.
Share these defines by moving them into a new header, change their names to reflect the common use, and include the header where needed.
Reviewed-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Ira Weiny ira.weiny@intel.com
--- NOTE: The initialization of init_pkru_value cause checkpatch errors because of the space after the '(' in the macros. We leave this as is because it is more readable in this format. And it was existing code.
--- Changes from RFC V3 Per Dave Hansen Update commit message Add comment to PKR_AD_KEY macro --- arch/x86/include/asm/pgtable.h | 13 ++++++------- arch/x86/include/asm/pkeys.h | 2 ++ arch/x86/include/asm/pkeys_common.h | 15 +++++++++++++++ arch/x86/kernel/fpu/xstate.c | 8 ++++---- arch/x86/mm/pkeys.c | 14 ++++++-------- 5 files changed, 33 insertions(+), 19 deletions(-) create mode 100644 arch/x86/include/asm/pkeys_common.h
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a02c67291cfc..bfbfb951fe65 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1360,9 +1360,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
-#define PKRU_AD_BIT 0x1 -#define PKRU_WD_BIT 0x2 -#define PKRU_BITS_PER_PKEY 2 +#include <asm/pkeys_common.h>
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS extern u32 init_pkru_value; @@ -1372,18 +1370,19 @@ extern u32 init_pkru_value;
static inline bool __pkru_allows_read(u32 pkru, u16 pkey) { - int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY; - return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits)); + int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY; + + return !(pkru & (PKR_AD_BIT << pkru_pkey_bits)); }
static inline bool __pkru_allows_write(u32 pkru, u16 pkey) { - int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY; + int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY; /* * Access-disable disables writes too so we need to check * both bits here. */ - return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits)); + return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits)); }
static inline u16 pte_flags_pkey(unsigned long pte_flags) diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index 2ff9b98812b7..f9feba80894b 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -2,6 +2,8 @@ #ifndef _ASM_X86_PKEYS_H #define _ASM_X86_PKEYS_H
+#include <asm/pkeys_common.h> + #define ARCH_DEFAULT_PKEY 0
/* diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h new file mode 100644 index 000000000000..737d916f476c --- /dev/null +++ b/arch/x86/include/asm/pkeys_common.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_PKEYS_INTERNAL_H +#define _ASM_X86_PKEYS_INTERNAL_H + +#define PKR_AD_BIT 0x1 +#define PKR_WD_BIT 0x2 +#define PKR_BITS_PER_PKEY 2 + +/* + * Generate an Access-Disable mask for the given pkey. Several of these can be + * OR'd together to generate pkey register values. + */ +#define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY)) + +#endif /*_ASM_X86_PKEYS_INTERNAL_H */ diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 5d8047441a0a..a99afc70cc0a 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -995,7 +995,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val) { u32 old_pkru; - int pkey_shift = (pkey * PKRU_BITS_PER_PKEY); + int pkey_shift = (pkey * PKR_BITS_PER_PKEY); u32 new_pkru_bits = 0;
/* @@ -1014,16 +1014,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
/* Set the bits we need in PKRU: */ if (init_val & PKEY_DISABLE_ACCESS) - new_pkru_bits |= PKRU_AD_BIT; + new_pkru_bits |= PKR_AD_BIT; if (init_val & PKEY_DISABLE_WRITE) - new_pkru_bits |= PKRU_WD_BIT; + new_pkru_bits |= PKR_WD_BIT;
/* Shift the bits in to the correct place in PKRU for pkey: */ new_pkru_bits <<= pkey_shift;
/* Get old PKRU and mask off any old bits in place: */ old_pkru = read_pkru(); - old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift); + old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
/* Write old part along with new part: */ write_pkru(old_pkru | new_pkru_bits); diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 8873ed1438a9..f5efb4007e74 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -111,19 +111,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey return vma_pkey(vma); }
-#define PKRU_AD_KEY(pkey) (PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY)) - /* * Make the default PKRU value (at execve() time) as restrictive * as possible. This ensures that any threads clone()'d early * in the process's lifetime will not accidentally get access * to data which is pkey-protected later on. */ -u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) | - PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) | - PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) | - PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) | - PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15); +u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) | + PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) | + PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) | + PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | + PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
/* * Called from the FPU code when creating a fresh set of FPU @@ -173,7 +171,7 @@ static ssize_t init_pkru_write_file(struct file *file, * up immediately if someone attempts to disable access * or writes to pkey 0. */ - if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT)) + if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT)) return -EINVAL;
WRITE_ONCE(init_pkru_value, new_init_pkru);
From: Ira Weiny ira.weiny@intel.com
Define a helper, update_pkey_val(), which will be used to support both Protection Key User (PKU) and the new Protection Key for Supervisor (PKS) in subsequent patches.
Co-developed-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from RFC V3: Per Dave Hansen Update and add comments per Dave's review Per Peter Correct attribution --- arch/x86/include/asm/pkeys.h | 2 ++ arch/x86/kernel/fpu/xstate.c | 22 ++++------------------ arch/x86/mm/pkeys.c | 23 +++++++++++++++++++++++ 3 files changed, 29 insertions(+), 18 deletions(-)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index f9feba80894b..4526245b03e5 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma) return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT; }
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags); + #endif /*_ASM_X86_PKEYS_H */ diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index a99afc70cc0a..a3bca3211eba 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -994,9 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr) int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val) { - u32 old_pkru; - int pkey_shift = (pkey * PKR_BITS_PER_PKEY); - u32 new_pkru_bits = 0; + u32 pkru;
/* * This check implies XSAVE support. OSPKE only gets @@ -1012,21 +1010,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, */ WARN_ON_ONCE(pkey >= arch_max_pkey());
- /* Set the bits we need in PKRU: */ - if (init_val & PKEY_DISABLE_ACCESS) - new_pkru_bits |= PKR_AD_BIT; - if (init_val & PKEY_DISABLE_WRITE) - new_pkru_bits |= PKR_WD_BIT; - - /* Shift the bits in to the correct place in PKRU for pkey: */ - new_pkru_bits <<= pkey_shift; - - /* Get old PKRU and mask off any old bits in place: */ - old_pkru = read_pkru(); - old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift); - - /* Write old part along with new part: */ - write_pkru(old_pkru | new_pkru_bits); + pkru = read_pkru(); + pkru = update_pkey_val(pkru, pkey, init_val); + write_pkru(pkru);
return 0; } diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index f5efb4007e74..d1dfe743e79f 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -208,3 +208,26 @@ static __init int setup_init_pkru(char *opt) return 1; } __setup("init_pkru=", setup_init_pkru); + +/* + * Replace disable bits for @pkey with values from @flags + * + * Kernel users use the same flags as user space: + * PKEY_DISABLE_ACCESS + * PKEY_DISABLE_WRITE + */ +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) +{ + int pkey_shift = pkey * PKR_BITS_PER_PKEY; + + /* Mask out old bit values */ + pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift); + + /* Or in new values */ + if (flags & PKEY_DISABLE_ACCESS) + pk_reg |= PKR_AD_BIT << pkey_shift; + if (flags & PKEY_DISABLE_WRITE) + pk_reg |= PKR_WD_BIT << pkey_shift; + + return pk_reg; +}
From: Ira Weiny ira.weiny@intel.com
Protection Keys for Supervisor pages (PKS) enables fast, hardware thread specific, manipulation of permission restrictions on supervisor page mappings. It uses the same mechanism of Protection Keys as those on User mappings but applies that mechanism to supervisor mappings using a supervisor specific MSR.
Kernel users can thus defines 'domains' of page mappings which have an extra level of protection beyond those specified in the supervisor page table entries.
Add the Kconfig ARCH_HAS_SUPERVISOR_PKEYS to indicate to core code that an architecture support pkeys. Select it for x86.
Define the CPU features bit needed but leave DISABLE_PKS set to disable the feature until the implementation can be completed and enabled in a final patch.
Co-developed-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from V2 New patch for V3: Split this off from the enable patch to be able to create cleaner bisectability --- arch/x86/Kconfig | 1 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 4 +++- arch/x86/include/uapi/asm/processor-flags.h | 2 ++ mm/Kconfig | 2 ++ 5 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f6946b81f74a..78c4c749c6a9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1876,6 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD) select ARCH_USES_HIGH_VMA_FLAGS select ARCH_HAS_PKEYS + select ARCH_HAS_SUPERVISOR_PKEYS help Memory Protection Keys provides a mechanism for enforcing page-based protections, but without requiring modification of the diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index dad350d42ecf..4deb580324e8 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -356,6 +356,7 @@ #define X86_FEATURE_MOVDIRI (16*32+27) /* MOVDIRI instruction */ #define X86_FEATURE_MOVDIR64B (16*32+28) /* MOVDIR64B instruction */ #define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS instructions */ +#define X86_FEATURE_PKS (16*32+31) /* Protection Keys for Supervisor pages */
/* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */ #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery support */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 5861d34f9771..164587177152 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -44,6 +44,8 @@ # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31)) #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+#define DISABLE_PKS (1<<(X86_FEATURE_PKS & 31)) + #ifdef CONFIG_X86_5LEVEL # define DISABLE_LA57 0 #else @@ -82,7 +84,7 @@ #define DISABLED_MASK14 0 #define DISABLED_MASK15 0 #define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \ - DISABLE_ENQCMD) + DISABLE_ENQCMD|DISABLE_PKS) #define DISABLED_MASK17 0 #define DISABLED_MASK18 0 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19) diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index bcba3c643e63..191c574b2390 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -130,6 +130,8 @@ #define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT) #define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */ #define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT) +#define X86_CR4_PKS_BIT 24 /* enable Protection Keys for Supervisor */ +#define X86_CR4_PKS _BITUL(X86_CR4_PKS_BIT)
/* * x86-64 Task Priority Register, CR8 diff --git a/mm/Kconfig b/mm/Kconfig index d42423f884a7..fc9ce7f65683 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -826,6 +826,8 @@ config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS bool +config ARCH_HAS_SUPERVISOR_PKEYS + bool
config PERCPU_STATS bool "Collect percpu memory statistics"
From: Ira Weiny ira.weiny@intel.com
The PKRS MSR is defined as a per-logical-processor register. This isolates memory access by logical CPU. Unfortunately, the MSR is not managed by XSAVE. Therefore, tasks must save/restore the MSR value on context switch.
Define a saved PKRS value in the task struct, as well as a cached per-logical-processor MSR value which mirrors the MSR value of the current CPU. Initialize all tasks with the default MSR value. Then, on schedule in, check the saved task MSR vs the per-cpu value. If different proceed to write the MSR. If not avoid the overhead of the MSR write and continue.
Follow on patches will update the saved PKRS as well as the MSR if needed.
Finally it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing but still maintains ordering properties similar to WRPKRU. The current SDM section on PKRS needs updating but should be the same as that of WRPKRU. So to quote from the WRPKRU text:
WRPKRU will never execute transiently. Memory accesses affected by PKRU register will not execute (even transiently) until all prior executions of WRPKRU have completed execution and updated the PKRU register.
Co-developed-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Fenghua Yu fenghua.yu@intel.com Co-developed-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from V2 Adjust for PKS enable being final patch.
Changes from V1 Rebase to latest tip/master Resolve conflicts with INIT_THREAD changes
Changes since RFC V3 Per Dave Hansen Update commit message move saved_pkrs to be in a nicer place Per Peter Zijlstra Add Comment from Peter Clean up white space Update authorship --- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/pkeys_common.h | 20 +++++++++++++++++++ arch/x86/include/asm/processor.h | 18 ++++++++++++++++- arch/x86/kernel/process.c | 26 ++++++++++++++++++++++++ arch/x86/mm/pkeys.c | 31 +++++++++++++++++++++++++++++ 5 files changed, 95 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 972a34d93505..ddb125e44408 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -754,6 +754,7 @@
#define MSR_IA32_TSC_DEADLINE 0x000006E0
+#define MSR_IA32_PKRS 0x000006E1
#define MSR_TSX_FORCE_ABORT 0x0000010F
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h index 737d916f476c..801a75615209 100644 --- a/arch/x86/include/asm/pkeys_common.h +++ b/arch/x86/include/asm/pkeys_common.h @@ -12,4 +12,24 @@ */ #define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+/* + * Define a default PKRS value for each task. + * + * Key 0 has no restriction. All other keys are set to the most restrictive + * value which is access disabled (AD=1). + * + * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro. + */ +#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \ + PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \ + PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \ + PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \ + PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15)) + +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +void write_pkrs(u32 new_pkrs); +#else +static inline void write_pkrs(u32 new_pkrs) { } +#endif + #endif /*_ASM_X86_PKEYS_INTERNAL_H */ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 82a08b585818..e9c65368b0b2 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -18,6 +18,7 @@ struct vm86; #include <asm/cpufeatures.h> #include <asm/page.h> #include <asm/pgtable_types.h> +#include <asm/pkeys_common.h> #include <asm/percpu.h> #include <asm/msr.h> #include <asm/desc_defs.h> @@ -520,6 +521,12 @@ struct thread_struct { unsigned long cr2; unsigned long trap_nr; unsigned long error_code; + +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS + /* Saved Protection key register for supervisor mappings */ + u32 saved_pkrs; +#endif + #ifdef CONFIG_VM86 /* Virtual 86 mode info */ struct vm86 *vm86; @@ -785,7 +792,16 @@ static inline void spin_lock_prefetch(const void *x) #define KSTK_ESP(task) (task_pt_regs(task)->sp)
#else -#define INIT_THREAD { } + +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#define INIT_THREAD_PKRS .saved_pkrs = INIT_PKRS_VALUE +#else +#define INIT_THREAD_PKRS 0 +#endif + +#define INIT_THREAD { \ + INIT_THREAD_PKRS, \ +}
extern unsigned long KSTK_ESP(struct task_struct *task);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index ba4593a913fa..aa2ae5292ff1 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -43,6 +43,7 @@ #include <asm/io_bitmap.h> #include <asm/proto.h> #include <asm/frame.h> +#include <asm/pkeys_common.h>
#include "process.h"
@@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, return ret; }
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +DECLARE_PER_CPU(u32, pkrs_cache); +static inline void pks_init_task(struct task_struct *tsk) +{ + /* New tasks get the most restrictive PKRS value */ + tsk->thread.saved_pkrs = INIT_PKRS_VALUE; +} +static inline void pks_sched_in(void) +{ + /* + * PKRS is only temporarily changed during specific code paths. Only a + * preemption during these windows away from the default value would + * require updating the MSR. write_pkrs() handles this optimization. + */ + write_pkrs(current->thread.saved_pkrs); +} +#else +static inline void pks_init_task(struct task_struct *tsk) { } +static inline void pks_sched_in(void) { } +#endif + void flush_thread(void) { struct task_struct *tsk = current; @@ -195,6 +217,8 @@ void flush_thread(void) memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
fpu__clear_all(&tsk->thread.fpu); + + pks_init_task(tsk); }
void disable_TSC(void) @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
if ((tifp ^ tifn) & _TIF_SLD) switch_to_sld(tifn); + + pks_sched_in(); }
/* diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index d1dfe743e79f..76a62419c446 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
return pk_reg; } + +DEFINE_PER_CPU(u32, pkrs_cache); + +/** + * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can + * be checked quickly. + * + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not + * serializing but still maintains ordering properties similar to WRPKRU. + * The current SDM section on PKRS needs updating but should be the same as + * that of WRPKRU. So to quote from the WRPKRU text: + * + * WRPKRU will never execute transiently. Memory accesses + * affected by PKRU register will not execute (even transiently) + * until all prior executions of WRPKRU have completed execution + * and updated the PKRU register. + */ +void write_pkrs(u32 new_pkrs) +{ + u32 *pkrs; + + if (!static_cpu_has(X86_FEATURE_PKS)) + return; + + pkrs = get_cpu_ptr(&pkrs_cache); + if (*pkrs != new_pkrs) { + *pkrs = new_pkrs; + wrmsrl(MSR_IA32_PKRS, new_pkrs); + } + put_cpu_ptr(pkrs); +}
From: Ira Weiny ira.weiny@intel.com
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches propose to add information to this struct, for example the PKRS register/thread state.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference.
While at it, make the code easier to follow by changing all the usage sites to consistently use the variable name 'irq_state'.
Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from V1 From Thomas: Update commit message Further clean up Kernel doc and comments Missed some 'return' comments which are no longer valid
Changes from RFC V3 Clean up @irq_state comments Standardize on 'irq_state' for the state variable name Refactor based on new patch from Thomas Gleixner Also addresses Peter Zijlstra's comment --- arch/x86/entry/common.c | 8 ++++---- arch/x86/include/asm/idtentry.h | 25 ++++++++++++++---------- arch/x86/kernel/cpu/mce/core.c | 4 ++-- arch/x86/kernel/kvm.c | 6 +++--- arch/x86/kernel/nmi.c | 4 ++-- arch/x86/kernel/traps.c | 21 ++++++++++++-------- arch/x86/mm/fault.c | 6 +++--- include/linux/entry-common.h | 18 +++++++++-------- kernel/entry/common.c | 34 +++++++++++++-------------------- 9 files changed, 65 insertions(+), 61 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 18d8f17f755c..87dea56a15d2 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs) { struct pt_regs *old_regs; bool inhcall; - irqentry_state_t state; + irqentry_state_t irq_state;
- state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state); old_regs = set_irq_regs(regs);
instrumentation_begin(); @@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs) set_irq_regs(old_regs);
inhcall = get_and_clear_inhcall(); - if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) { + if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) { instrumentation_begin(); irqentry_exit_cond_resched(); instrumentation_end(); restore_inhcall(inhcall); } else { - irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); } } #endif /* CONFIG_XEN_PV */ diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 247a60a47331..282d2413b6a1 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ __##func (regs); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs) @@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs, \ __visible noinstr void func(struct pt_regs *regs, \ unsigned long error_code) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ __##func (regs, error_code); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs, \ @@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \ __visible noinstr void func(struct pt_regs *regs, \ unsigned long error_code) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ irq_enter_rcu(); \ kvm_set_cpu_l1tf_flush_l1d(); \ __##func (regs, (u8)error_code); \ irq_exit_rcu(); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs, u8 vector) @@ -234,15 +237,16 @@ static void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ irq_enter_rcu(); \ kvm_set_cpu_l1tf_flush_l1d(); \ run_sysvec_on_irqstack_cond(__##func, regs); \ irq_exit_rcu(); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static noinline void __##func(struct pt_regs *regs) @@ -263,15 +267,16 @@ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ __irq_enter_raw(); \ kvm_set_cpu_l1tf_flush_l1d(); \ __##func (regs); \ __irq_exit_raw(); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index f5c860b1a50b..6ed2fa2ea321 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs) mce_check_crashing_cpu()) return;
- irq_state = irqentry_nmi_enter(regs); + irqentry_nmi_enter(regs, &irq_state); /* * The call targets are marked noinstr, but objtool can't figure * that out because it's an indirect call. Annotate it. @@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs) if (regs->flags & X86_EFLAGS_IF) trace_hardirqs_on_prepare(); instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state); }
static __always_inline void exc_machine_check_user(struct pt_regs *regs) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 7f57ede3cb8e..ed7427c6e74d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags); noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token) { u32 flags = kvm_read_and_reset_apf_flags(); - irqentry_state_t state; + irqentry_state_t irq_state;
if (!flags) return false;
- state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state); instrumentation_begin();
/* @@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token) }
instrumentation_end(); - irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); return true; }
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index bf250a339655..1fd7780e99dd 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
this_cpu_write(nmi_dr7, local_db_save());
- irq_state = irqentry_nmi_enter(regs); + irqentry_nmi_enter(regs, &irq_state);
inc_irq_stat(__nmi_count);
if (!ignore_nmis) default_do_nmi(regs);
- irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state);
local_db_restore(this_cpu_read(nmi_dr7));
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index e1b78829d909..8481cc373794 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
DEFINE_IDTENTRY_RAW(exc_invalid_op) { - irqentry_state_t state; + irqentry_state_t irq_state;
/* * We use UD2 as a short encoding for 'CALL __WARN', as such @@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op) if (!user_mode(regs) && handle_bug(regs)) return;
- state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state); instrumentation_begin(); handle_invalid_op(regs); instrumentation_end(); - irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); }
DEFINE_IDTENTRY(exc_coproc_segment_overrun) @@ -343,6 +343,7 @@ __visible void __noreturn handle_stack_overflow(const char *message, */ DEFINE_IDTENTRY_DF(exc_double_fault) { + irqentry_state_t irq_state; static const char str[] = "double fault"; struct task_struct *tsk = current;
@@ -405,7 +406,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault) } #endif
- irqentry_nmi_enter(regs); + irqentry_nmi_enter(regs, &irq_state); instrumentation_begin(); notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
@@ -651,13 +652,15 @@ DEFINE_IDTENTRY_RAW(exc_int3) instrumentation_end(); irqentry_exit_to_user_mode(regs); } else { - irqentry_state_t irq_state = irqentry_nmi_enter(regs); + irqentry_state_t irq_state; + + irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin(); if (!do_int3(regs)) die("int3", regs, 0); instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state); } }
@@ -852,7 +855,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs, * includes the entry stack is excluded for everything. */ unsigned long dr7 = local_db_save(); - irqentry_state_t irq_state = irqentry_nmi_enter(regs); + irqentry_state_t irq_state; + + irqentry_nmi_enter(regs, &irq_state); instrumentation_begin();
/* @@ -909,7 +914,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs, regs->flags &= ~X86_EFLAGS_TF; out: instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state);
local_db_restore(dr7); } diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 82bf37a5c9ec..8d20c4c13abf 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1441,7 +1441,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code, DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) { unsigned long address = read_cr2(); - irqentry_state_t state; + irqentry_state_t irq_state;
prefetchw(¤t->mm->mmap_lock);
@@ -1479,11 +1479,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) * code reenabled RCU to avoid subsequent wreckage which helps * debugability. */ - state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state);
instrumentation_begin(); handle_page_fault(regs, error_code, address); instrumentation_end();
- irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); } diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 66938121c4b1..1193a70bcf1b 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -372,6 +372,8 @@ typedef struct irqentry_state { /** * irqentry_enter - Handle state tracking on ordinary interrupt entries * @regs: Pointer to pt_regs of interrupted context + * @irq_state: Pointer to an opaque object to store state information; to be + * passed back to irqentry_exit() * * Invokes: * - lockdep irqflag state tracking as low level ASM entry disabled @@ -397,10 +399,8 @@ typedef struct irqentry_state { * For user mode entries irqentry_enter_from_user_mode() is invoked to * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit * would not be possible. - * - * Returns: An opaque object that must be passed to idtentry_exit() */ -irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs); +void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state);
/** * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt @@ -412,7 +412,7 @@ void irqentry_exit_cond_resched(void); /** * irqentry_exit - Handle return from exception that used irqentry_enter() * @regs: Pointer to pt_regs (exception entry regs) - * @state: Return value from matching call to irqentry_enter() + * @irq_state: Pointer to state information passed to irqentry_enter() * * Depending on the return target (kernel/user) this runs the necessary * preemption and work checks if possible and required and returns to @@ -423,25 +423,27 @@ void irqentry_exit_cond_resched(void); * * Counterpart to irqentry_enter(). */ -void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state); +void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state);
/** * irqentry_nmi_enter - Handle NMI entry * @regs: Pointer to currents pt_regs + * @irq_state: Pointer to an opaque object to store state information; to be + * passed back to irqentry_nmi_exit() * * Similar to irqentry_enter() but taking care of the NMI constraints. */ -irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs); +void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state);
/** * irqentry_nmi_exit - Handle return from NMI handling * @regs: Pointer to pt_regs (NMI entry regs) - * @irq_state: Return value from matching call to irqentry_nmi_enter() + * @irq_state: Pointer to state information passed to irqentry_nmi_enter() * * Last action before returning to the low level assmebly code. * * Counterpart to irqentry_nmi_enter(). */ -void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state); +void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state);
#endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index fa17baadf63e..5ed9d45fb41b 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -289,15 +289,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) exit_to_user_mode(); }
-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) +noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state) { - irqentry_state_t ret = { - .exit_rcu = false, - }; + irq_state->exit_rcu = false;
if (user_mode(regs)) { irqentry_enter_from_user_mode(regs); - return ret; + return; }
/* @@ -335,8 +333,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) trace_hardirqs_off_finish(); instrumentation_end();
- ret.exit_rcu = true; - return ret; + irq_state->exit_rcu = true; + return; }
/* @@ -350,8 +348,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) rcu_irq_enter_check_tick(); trace_hardirqs_off_finish(); instrumentation_end(); - - return ret; }
void irqentry_exit_cond_resched(void) @@ -366,7 +362,7 @@ void irqentry_exit_cond_resched(void) } }
-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) +noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state) { lockdep_assert_irqs_disabled();
@@ -379,7 +375,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) * carefully and needs the same ordering of lockdep/tracing * and RCU as the return to user mode path. */ - if (state.exit_rcu) { + if (irq_state->exit_rcu) { instrumentation_begin(); /* Tell the tracer that IRET will enable interrupts */ trace_hardirqs_on_prepare(); @@ -401,16 +397,14 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) * IRQ flags state is correct already. Just tell RCU if it * was not watching on entry. */ - if (state.exit_rcu) + if (irq_state->exit_rcu) rcu_irq_exit(); } }
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs) +void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state) { - irqentry_state_t irq_state; - - irq_state.lockdep = lockdep_hardirqs_enabled(); + irq_state->lockdep = lockdep_hardirqs_enabled();
__nmi_enter(); lockdep_hardirqs_off(CALLER_ADDR0); @@ -421,15 +415,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs) trace_hardirqs_off_finish(); ftrace_nmi_enter(); instrumentation_end(); - - return irq_state; }
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state) +void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state) { instrumentation_begin(); ftrace_nmi_exit(); - if (irq_state.lockdep) { + if (irq_state->lockdep) { trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(CALLER_ADDR0); } @@ -437,7 +429,7 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
rcu_nmi_exit(); lockdep_hardirq_exit(); - if (irq_state.lockdep) + if (irq_state->lockdep) lockdep_hardirqs_on(CALLER_ADDR0); __nmi_exit(); }
From: Ira Weiny ira.weiny@intel.com
The PKRS MSR is not managed by XSAVE. It is preserved through a context switch but this support leaves exception handling code open to memory accesses during exceptions.
2 possible places for preserving this state were considered, irqentry_state_t or pt_regs.[1] pt_regs was much more complicated and was potentially fraught with unintended consequences.[2] irqentry_state_t was already an object being used in the exception handling and is straightforward. It is also easy for any number of nested states to be tracked and eventually can be enhanced to store the reference counting required to support PKS through kmap reentry
Preserve the current task's PKRS values in irqentry_state_t on exception entry and restoring them on exception exit.
Each nested exception is further saved allowing for any number of levels of exception handling.
Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.
[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M3... [2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Suggested-by: Peter Zijlstra peterz@infradead.org Suggested-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from V1 remove redundant irq_state->pkrs This value is only needed for the global tracking. So it should be included in that patch and not in this one.
Changes from RFC V3 Standardize on 'irq_state' variable name Per Dave Hansen irq_save_pkrs() -> irq_save_set_pkrs() Rebased based on clean up patch by Thomas Gleixner This includes moving irq_[save_set|restore]_pkrs() to the core as well. --- arch/x86/entry/common.c | 38 +++++++++++++++++++++++++++++ arch/x86/include/asm/pkeys_common.h | 5 ++-- arch/x86/mm/pkeys.c | 2 +- include/linux/entry-common.h | 13 ++++++++++ kernel/entry/common.c | 14 +++++++++-- 5 files changed, 67 insertions(+), 5 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 87dea56a15d2..1b6a419a6fac 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -19,6 +19,7 @@ #include <linux/nospec.h> #include <linux/syscalls.h> #include <linux/uaccess.h> +#include <linux/pkeys.h>
#ifdef CONFIG_XEN_PV #include <xen/xen-ops.h> @@ -209,6 +210,41 @@ SYSCALL_DEFINE0(ni_syscall) return -ENOSYS; }
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +/* + * PKRS is a per-logical-processor MSR which overlays additional protection for + * pages which have been mapped with a protection key. + * + * The register is not maintained with XSAVE so we have to maintain the MSR + * value in software during context switch and exception handling. + * + * Context switches save the MSR in the task struct thus taking that value to + * other processors if necessary. + * + * To protect against exceptions having access to this memory we save the + * current running value and set the PKRS value for the duration of the + * exception. Thus preventing exception handlers from having the elevated + * access of the interrupted task. + */ +noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val) +{ + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + irq_state->thread_pkrs = current->thread.saved_pkrs; + write_pkrs(INIT_PKRS_VALUE); +} + +noinstr void irq_restore_pkrs(irqentry_state_t *irq_state) +{ + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + write_pkrs(irq_state->thread_pkrs); + current->thread.saved_pkrs = irq_state->thread_pkrs; +} +#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ + #ifdef CONFIG_XEN_PV #ifndef CONFIG_PREEMPTION /* @@ -272,6 +308,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
inhcall = get_and_clear_inhcall(); if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) { + /* Normally called by irqentry_exit, we must restore pkrs here */ + irq_restore_pkrs(&irq_state); instrumentation_begin(); irqentry_exit_cond_resched(); instrumentation_end(); diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h index 801a75615209..11a95e6efd2d 100644 --- a/arch/x86/include/asm/pkeys_common.h +++ b/arch/x86/include/asm/pkeys_common.h @@ -27,9 +27,10 @@ PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS -void write_pkrs(u32 new_pkrs); +DECLARE_PER_CPU(u32, pkrs_cache); +noinstr void write_pkrs(u32 new_pkrs); #else -static inline void write_pkrs(u32 new_pkrs) { } +static __always_inline void write_pkrs(u32 new_pkrs) { } #endif
#endif /*_ASM_X86_PKEYS_INTERNAL_H */ diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 76a62419c446..6892d4524868 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -248,7 +248,7 @@ DEFINE_PER_CPU(u32, pkrs_cache); * until all prior executions of WRPKRU have completed execution * and updated the PKRU register. */ -void write_pkrs(u32 new_pkrs) +noinstr void write_pkrs(u32 new_pkrs) { u32 *pkrs;
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 1193a70bcf1b..21eae007061d 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -348,6 +348,8 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state /** * struct irqentry_state - Opaque object for exception state storage + * @thread_pkrs: Thread Supervisor Pkey value to be restored when exception is + * complete. * @exit_rcu: Used exclusively in the irqentry_*() calls; signals whether the * exit path has to invoke rcu_irq_exit(). * @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures that @@ -362,6 +364,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); * the maintenance of the irqentry_*() functions. */ typedef struct irqentry_state { +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS + u32 thread_pkrs; +#endif union { bool exit_rcu; bool lockdep; @@ -369,6 +374,14 @@ typedef struct irqentry_state { } irqentry_state_t; #endif
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val); +noinstr void irq_restore_pkrs(irqentry_state_t *irq_state); +#else +static __always_inline void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val) { } +static __always_inline void irq_restore_pkrs(irqentry_state_t *irq_state) { } +#endif + /** * irqentry_enter - Handle state tracking on ordinary interrupt entries * @regs: Pointer to pt_regs of interrupted context diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 5ed9d45fb41b..3590f5bb5870 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -334,7 +334,7 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state) instrumentation_end();
irq_state->exit_rcu = true; - return; + goto done; }
/* @@ -348,6 +348,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state) rcu_irq_enter_check_tick(); trace_hardirqs_off_finish(); instrumentation_end(); + +done: + irq_save_set_pkrs(irq_state, INIT_PKRS_VALUE); }
void irqentry_exit_cond_resched(void) @@ -369,7 +372,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state) /* Check whether this returns to user mode */ if (user_mode(regs)) { irqentry_exit_to_user_mode(regs); - } else if (!regs_irqs_disabled(regs)) { + return; + } + + irq_restore_pkrs(irq_state); + + if (!regs_irqs_disabled(regs)) { /* * If RCU was not watching on entry this needs to be done * carefully and needs the same ordering of lockdep/tracing @@ -415,10 +423,12 @@ void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_stat trace_hardirqs_off_finish(); ftrace_nmi_enter(); instrumentation_end(); + irq_save_set_pkrs(irq_state, INIT_PKRS_VALUE); }
void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state) { + irq_restore_pkrs(irq_state); instrumentation_begin(); ftrace_nmi_exit(); if (irq_state->lockdep) {
From: Ira Weiny ira.weiny@intel.com
When only user space pkeys are enabled faulting within the kernel was an unexpected condition which should never happen. Therefore a WARN_ON in the kernel fault handler would detect if it ever did. Now this is no longer the case if PKS is enabled and supported.
Report a Pkey fault with a normal splat and add the PKRS state to the fault splat text. Note the PKS register is reset during an exception therefore the saved PKRS value from before the beginning of the exception is passed down.
If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from before.
Because each fault has its own state the pkrs information will be correctly reported even if a fault 'faults'.
Suggested-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from V2 Fix compilation error
Changes from RFC V3 Update commit message Per Dave Hansen Don't print PKRS if !cpu_feature_enabled(X86_FEATURE_PKS) Fix comment Remove check on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS in favor of disabled-features.h --- arch/x86/mm/fault.c | 58 ++++++++++++++++++++++++++------------------- 1 file changed, 33 insertions(+), 25 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 8d20c4c13abf..90029ce9b0da 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index) }
static void -show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address) +show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address, + irqentry_state_t *irq_state) { if (!oops_may_print()) return; @@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad (error_code & X86_PF_PK) ? "protection keys violation" : "permissions violation");
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS + if (cpu_feature_enabled(X86_FEATURE_PKS) && irq_state && (error_code & X86_PF_PK)) + pr_alert("PKRS: 0x%x\n", irq_state->thread_pkrs); +#endif + if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt; u16 ldtr, tr; @@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,
static noinline void no_context(struct pt_regs *regs, unsigned long error_code, - unsigned long address, int signal, int si_code) + unsigned long address, int signal, int si_code, + irqentry_state_t *irq_state) { struct task_struct *tsk = current; unsigned long flags; @@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code, */ flags = oops_begin();
- show_fault_oops(regs, error_code, address); + show_fault_oops(regs, error_code, address, irq_state);
if (task_stack_end_corrupted(tsk)) printk(KERN_EMERG "Thread overran stack, or stack corrupted\n"); @@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)
static void __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, - unsigned long address, u32 pkey, int si_code) + unsigned long address, u32 pkey, int si_code, + irqentry_state_t *irq_state) { struct task_struct *tsk = current;
@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, if (is_f00f_bug(regs, address)) return;
- no_context(regs, error_code, address, SIGSEGV, si_code); + no_context(regs, error_code, address, SIGSEGV, si_code, irq_state); }
static noinline void bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, - unsigned long address) + unsigned long address, irqentry_state_t *irq_state) { - __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR); + __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, irq_state); }
static void @@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code, */ mmap_read_unlock(mm);
- __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); + __bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL); }
static noinline void @@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & X86_PF_USER)) { - no_context(regs, error_code, address, SIGBUS, BUS_ADRERR); + no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL); return; }
@@ -957,7 +965,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, vm_fault_t fault) { if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) { - no_context(regs, error_code, address, 0, 0); + no_context(regs, error_code, address, 0, 0, NULL); return; }
@@ -965,7 +973,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, /* Kernel mode? Handle exceptions or die: */ if (!(error_code & X86_PF_USER)) { no_context(regs, error_code, address, - SIGSEGV, SEGV_MAPERR); + SIGSEGV, SEGV_MAPERR, NULL); return; }
@@ -980,7 +988,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, VM_FAULT_HWPOISON_LARGE)) do_sigbus(regs, error_code, address, fault); else if (fault & VM_FAULT_SIGSEGV) - bad_area_nosemaphore(regs, error_code, address); + bad_area_nosemaphore(regs, error_code, address, NULL); else BUG(); } @@ -1148,14 +1156,14 @@ bool fault_in_kernel_space(unsigned long address) */ static void do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, - unsigned long address) + unsigned long address, irqentry_state_t *irq_state) { /* - * Protection keys exceptions only happen on user pages. We - * have no user pages in the kernel portion of the address - * space, so do not expect them here. + * PF_PK is only expected on kernel addresses when supervisor pkeys are + * enabled. */ - WARN_ON_ONCE(hw_error_code & X86_PF_PK); + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + WARN_ON_ONCE(hw_error_code & X86_PF_PK);
#ifdef CONFIG_X86_32 /* @@ -1204,7 +1212,7 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, * Don't take the mm semaphore here. If we fixup a prefetch * fault we could otherwise deadlock: */ - bad_area_nosemaphore(regs, hw_error_code, address); + bad_area_nosemaphore(regs, hw_error_code, address, irq_state); } NOKPROBE_SYMBOL(do_kern_addr_fault);
@@ -1245,7 +1253,7 @@ void do_user_addr_fault(struct pt_regs *regs, !(hw_error_code & X86_PF_USER) && !(regs->flags & X86_EFLAGS_AC))) { - bad_area_nosemaphore(regs, hw_error_code, address); + bad_area_nosemaphore(regs, hw_error_code, address, NULL); return; }
@@ -1254,7 +1262,7 @@ void do_user_addr_fault(struct pt_regs *regs, * in a region with pagefaults disabled then we must not take the fault */ if (unlikely(faulthandler_disabled() || !mm)) { - bad_area_nosemaphore(regs, hw_error_code, address); + bad_area_nosemaphore(regs, hw_error_code, address, NULL); return; }
@@ -1316,7 +1324,7 @@ void do_user_addr_fault(struct pt_regs *regs, * Fault from code in kernel from * which we do not expect faults. */ - bad_area_nosemaphore(regs, hw_error_code, address); + bad_area_nosemaphore(regs, hw_error_code, address, NULL); return; } retry: @@ -1375,7 +1383,7 @@ void do_user_addr_fault(struct pt_regs *regs, if (fault_signal_pending(fault, regs)) { if (!user_mode(regs)) no_context(regs, hw_error_code, address, SIGBUS, - BUS_ADRERR); + BUS_ADRERR, NULL); return; }
@@ -1415,7 +1423,7 @@ trace_page_fault_entries(struct pt_regs *regs, unsigned long error_code,
static __always_inline void handle_page_fault(struct pt_regs *regs, unsigned long error_code, - unsigned long address) + unsigned long address, irqentry_state_t *irq_state) { trace_page_fault_entries(regs, error_code, address);
@@ -1424,7 +1432,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
/* Was the fault on kernel-controlled part of the address space? */ if (unlikely(fault_in_kernel_space(address))) { - do_kern_addr_fault(regs, error_code, address); + do_kern_addr_fault(regs, error_code, address, irq_state); } else { do_user_addr_fault(regs, error_code, address); /* @@ -1482,7 +1490,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) irqentry_enter(regs, &irq_state);
instrumentation_begin(); - handle_page_fault(regs, error_code, address); + handle_page_fault(regs, error_code, address, &irq_state); instrumentation_end();
irqentry_exit(regs, &irq_state);
From: Fenghua Yu fenghua.yu@intel.com
PKS allows kernel users to define domains of page mappings which have additional protections beyond the paging protections.
Add an API to allocate, use, and free a protection key which identifies such a domain. Export 5 new symbols pks_key_alloc(), pks_mknoaccess(), pks_mkread(), pks_mkrdwr(), and pks_key_free(). Add 2 new macros; PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).
Update the protection key documentation to cover pkeys on supervisor pages.
Co-developed-by: Ira Weiny ira.weiny@intel.com Signed-off-by: Ira Weiny ira.weiny@intel.com Signed-off-by: Fenghua Yu fenghua.yu@intel.com
--- Changes from V2 From Greg KH Replace all WARN_ON_ONCE() uses with pr_err() From Dan Williams Add __must_check to pks_key_alloc() to help ensure users are using the API correctly
Changes from V1 Per Dave Hansen Add flags to pks_key_alloc() to help future proof the interface if/when the key space is exhausted.
Changes from RFC V3 Per Dave Hansen Put WARN_ON_ONCE in pks_key_free() s/pks_mknoaccess/pks_mk_noaccess/ s/pks_mkread/pks_mk_readonly/ s/pks_mkrdwr/pks_mk_readwrite/ Change return pks_key_alloc() to EOPNOTSUPP when not supported or configured Per Peter Zijlstra Remove unneeded preempt disable/enable --- Documentation/core-api/protection-keys.rst | 102 +++++++++++++--- arch/x86/include/asm/pgtable_types.h | 12 ++ arch/x86/include/asm/pkeys.h | 11 ++ arch/x86/include/asm/pkeys_common.h | 4 + arch/x86/mm/pkeys.c | 128 +++++++++++++++++++++ include/linux/pgtable.h | 4 + include/linux/pkeys.h | 24 ++++ 7 files changed, 267 insertions(+), 18 deletions(-)
diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index ec575e72d0b2..c4e6c480562f 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -4,25 +4,33 @@ Memory Protection Keys ======================
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake (and later) "Scalable Processor" -Server CPUs. It will be available in future non-server Intel parts -and future AMD processors. - -For anyone wishing to test or use this feature, it is available in -Amazon's EC2 C5 instances and is known to work there using an Ubuntu -17.04 image. - Memory Protection Keys provides a mechanism for enforcing page-based protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. +when an application changes protection domains. + +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable +Processor" Server CPUs and later. And It will be available in future +non-server Intel parts and future AMD processors. + +Future Intel processors will support Protection Keys for Supervisor pages +(PKS). + +For anyone wishing to test or use user space pkeys, it is available in Amazon's +EC2 C5 instances and is known to work there using an Ubuntu 17.04 image. + +pkeys work by dedicating 4 previously Reserved bits in each page table entry to +a "protection key", giving 16 possible keys. User and Supervisor pages are +treated separately. + +Protections for each page are controlled with per CPU registers for each type +of page User and Supervisor. Each of these 32 bit register stores two separate +bits (Access Disable and Write Disable) for each key.
-There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every other thread. +For Userspace the register is user-accessible (rdpkru/wrpkru). For +Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel. + +Being a CPU register, pkeys are inherently thread-local, potentially giving +each thread an independent set of protections from every other thread.
There are two new instructions (RDPKRU/WRPKRU) for reading and writing to the new register. The feature is only available in 64-bit mode, @@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs. These permissions are enforced on data access only and have no effect on instruction fetches.
-Syscalls -======== +For kernel space rdmsr/wrmsr are used to access the kernel MSRs. + + +Syscalls for user space keys +============================
There are 3 system calls which directly interact with pkeys::
@@ -98,3 +109,58 @@ with a read():: The kernel will send a SIGSEGV in both cases, but si_code will be set to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when the plain mprotect() permissions are violated. + + +Kernel API for PKS support +========================== + +The following interface is used to allocate, use, and free a pkey which defines +a 'protection domain' within the kernel. Setting a pkey value in a supervisor +mapping adds that mapping to the protection domain. + + int pks_key_alloc(const char * const pkey_user, int flags); + #define PAGE_KERNEL_PKEY(pkey) + #define _PAGE_KEY(pkey) + void pks_mk_noaccess(int pkey); + void pks_mk_readonly(int pkey); + void pks_mk_readwrite(int pkey); + void pks_key_free(int pkey); + +pks_key_alloc() allocates keys dynamically to allow better use of the limited +key space. 'flags' alter the allocation based on the users need. Currently +they can request an exclusive key. + +Callers of pks_key_alloc() _must_ be prepared for it to fail and take +appropriate action. This is due mainly to the fact that PKS may not be +available on all arch's. Failure to check the return of pks_key_alloc() and +using any of the rest of the API is undefined. + +Kernel users must set the PTE permissions in the page table entries for the +mappings they want to protect. This can be done with PAGE_KERNEL_PKEY() or +_PAGE_KEY(). + +The pks_mk*() family of calls allows kernel users the ability to change the +protections for the domain identified by the pkey specified. 3 states are +available pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which +set the access to none, read, and read/write respectively. + +Finally, pks_key_free() allows a user to return the key to the allocator for +use by others. + +The interface maintains pks_mk_noaccess() (Access Disabled (AD=1)) for all keys +not currently allocated. Therefore, the user can depend on access being +disabled when pks_key_alloc() returns a key and the user should remove mappings +from the domain (remove the pkey from the PTE) prior to calling pks_key_free(). + +It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing +but still maintains ordering properties similar to WRPKRU. Thus it is safe to +immediately use a mapping when the pks_mk*() functions returns. + +The current SDM section on PKRS needs updating but should be the same as that +of WRPKRU. So to quote from the WRPKRU text: + + WRPKRU will never execute transiently. Memory accesses + affected by PKRU register will not execute (even transiently) + until all prior executions of WRPKRU have completed execution + and updated the PKRU register. + diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 816b31c68550..c9fdfbdcbbfb 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -73,6 +73,12 @@ _PAGE_PKEY_BIT2 | \ _PAGE_PKEY_BIT3)
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#define _PAGE_PKEY(pkey) (_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0) +#else +#define _PAGE_PKEY(pkey) (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED) #else @@ -229,6 +235,12 @@ enum page_cache_mode { #define PAGE_KERNEL_IO __pgprot_mask(__PAGE_KERNEL_IO) #define PAGE_KERNEL_IO_NOCACHE __pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#define PAGE_KERNEL_PKEY(pkey) __pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey)) +#else +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL +#endif + #endif /* __ASSEMBLY__ */
/* xwr */ diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index 4526245b03e5..990fe9c4787c 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -3,6 +3,7 @@ #define _ASM_X86_PKEYS_H
#include <asm/pkeys_common.h> +#include <asm-generic/mman-common.h>
#define ARCH_DEFAULT_PKEY 0
@@ -138,4 +139,14 @@ static inline int vma_pkey(struct vm_area_struct *vma)
u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +__must_check int pks_key_alloc(const char *const pkey_user, int flags); +void pks_key_free(int pkey); + +void pks_mk_noaccess(int pkey); +void pks_mk_readonly(int pkey); +void pks_mk_readwrite(int pkey); + +#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ + #endif /*_ASM_X86_PKEYS_H */ diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h index 11a95e6efd2d..f921c58793f9 100644 --- a/arch/x86/include/asm/pkeys_common.h +++ b/arch/x86/include/asm/pkeys_common.h @@ -26,6 +26,10 @@ PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \ PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+/* PKS supports 16 keys. Key 0 is reserved for the kernel. */ +#define PKS_KERN_DEFAULT_KEY 0 +#define PKS_NUM_KEYS 16 + #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS DECLARE_PER_CPU(u32, pkrs_cache); noinstr void write_pkrs(u32 new_pkrs); diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 6892d4524868..57718716cc70 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -3,6 +3,9 @@ * Intel Memory Protection Keys management * Copyright (c) 2015, Intel Corporation. */ +#undef pr_fmt +#define pr_fmt(fmt) "x86/pkeys: " fmt + #include <linux/debugfs.h> /* debugfs_create_u32() */ #include <linux/mm_types.h> /* mm_struct, vma, etc... */ #include <linux/pkeys.h> /* PKEY_* */ @@ -231,6 +234,7 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
return pk_reg; } +EXPORT_SYMBOL_GPL(update_pkey_val);
DEFINE_PER_CPU(u32, pkrs_cache);
@@ -262,3 +266,127 @@ noinstr void write_pkrs(u32 new_pkrs) } put_cpu_ptr(pkrs); } +EXPORT_SYMBOL_GPL(write_pkrs); + +/** + * Do not call this directly, see pks_mk*() below. + * + * @pkey: Key for the domain to change + * @protection: protection bits to be used + * + * Protection utilizes the same protection bits specified for User pkeys + * PKEY_DISABLE_ACCESS + * PKEY_DISABLE_WRITE + * + */ +static inline void pks_update_protection(int pkey, unsigned long protection) +{ + current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs, + pkey, protection); + write_pkrs(current->thread.saved_pkrs); +} + +/** + * PKS access control functions + * + * Change the access of the domain specified by the pkey. These are global + * updates. They only affects the current running thread. It is undefined and + * a bug for users to call this without having allocated a pkey and using it as + * pkey here. + * + * pks_mk_noaccess() + * Disable all access to the domain + * pks_mk_readonly() + * Make the domain Read only + * pks_mk_readwrite() + * Make the domain Read/Write + * + * @pkey the pkey for which the access should change. + * + */ +void pks_mk_noaccess(int pkey) +{ + pks_update_protection(pkey, PKEY_DISABLE_ACCESS); +} +EXPORT_SYMBOL_GPL(pks_mk_noaccess); + +void pks_mk_readonly(int pkey) +{ + pks_update_protection(pkey, PKEY_DISABLE_WRITE); +} +EXPORT_SYMBOL_GPL(pks_mk_readonly); + +void pks_mk_readwrite(int pkey) +{ + pks_update_protection(pkey, 0); +} +EXPORT_SYMBOL_GPL(pks_mk_readwrite); + +static const char pks_key_user0[] = "kernel"; + +/* Store names of allocated keys for debug. Key 0 is reserved for the kernel. */ +static const char *pks_key_users[PKS_NUM_KEYS] = { + pks_key_user0 +}; + +/* + * Each key is represented by a bit. Bit 0 is set for key 0 and reserved for + * its use. We use ulong for the bit operations but only 16 bits are used. + */ +static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY; + +/* + * pks_key_alloc - Allocate a PKS key + * @pkey_user: String stored for debugging of key exhaustion. The caller is + * responsible to maintain this memory until pks_key_free(). + * @flags: Flags to modify behavior: see pks_alloc_flags + * + * Returns: pkey if success + * -EOPNOTSUPP if pks is not supported or not enabled + * -ENOSPC if no keys are available (even for sharing) + */ +__must_check int pks_key_alloc(const char * const pkey_user, int flags) +{ + int nr; + + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return -EOPNOTSUPP; + + while (1) { + nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS); + if (nr >= PKS_NUM_KEYS) { + pr_info("Cannot allocate supervisor key for %s.\n", + pkey_user); + return -ENOSPC; + } + if (!test_and_set_bit_lock(nr, &pks_key_allocation_map)) + break; + } + + /* for debugging key exhaustion */ + pks_key_users[nr] = pkey_user; + + return nr; +} +EXPORT_SYMBOL_GPL(pks_key_alloc); + +/* + * pks_key_free - Free a previously allocate PKS key + * @pkey: Key to be free'ed + */ +void pks_key_free(int pkey) +{ + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY) { + pr_err("Invalid PKey value: %d\n", pkey); + return; + } + + /* Restore to default of no access */ + pks_mk_noaccess(pkey); + pks_key_users[pkey] = NULL; + __clear_bit(pkey, &pks_key_allocation_map); +} +EXPORT_SYMBOL_GPL(pks_key_free); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 38c33eabea89..cd72d73e8e1c 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1402,6 +1402,10 @@ static inline bool arch_has_pfn_modify_check(void) # define PAGE_KERNEL_EXEC PAGE_KERNEL #endif
+#ifndef PAGE_KERNEL_PKEY +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL +#endif + /* * Page Table Modification bits for pgtbl_mod_mask. * diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 2955ba976048..bed0e293f13b 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -50,4 +50,28 @@ static inline void copy_init_pkru_to_fpregs(void)
#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+#define PKS_FLAG_EXCLUSIVE 0x00 + +#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +static inline __must_check int pks_key_alloc(const char * const pkey_user, int flags) +{ + return -EOPNOTSUPP; +} +static inline void pks_key_free(int pkey) +{ +} +static inline void pks_mk_noaccess(int pkey) +{ + pr_err("%s is not valid without PKS support\n", __func__); +} +static inline void pks_mk_readonly(int pkey) +{ + pr_err("%s is not valid without PKS support\n", __func__); +} +static inline void pks_mk_readwrite(int pkey) +{ + pr_err("%s is not valid without PKS support\n", __func__); +} +#endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ + #endif /* _LINUX_PKEYS_H */
From: Fenghua Yu fenghua.yu@intel.com
Protection Keys for Supervisor pages (PKS) enables fast, hardware thread specific, manipulation of permission restrictions on supervisor page mappings. It uses the same mechanism of Protection Keys as those on User mappings but applies that mechanism to supervisor mappings using a supervisor specific MSR.
Kernel users can thus defines 'domains' of page mappings which have an extra level of protection beyond those specified in the supervisor page table entries.
Enable PKS on supported CPUS.
Co-developed-by: Ira Weiny ira.weiny@intel.com Signed-off-by: Ira Weiny ira.weiny@intel.com Signed-off-by: Fenghua Yu fenghua.yu@intel.com
--- Changes from V2 From Thomas: Make this patch last so PKS is not enabled until all the PKS mechanisms are in place. Specifically: 1) Modify setup_pks() to call write_pkrs() to properly set up the initial value when enabled.
2) Split this patch into two. 1) a precursor patch with the required defines/config options and 2) this patch which actually enables feature on CPUs which support it.
Changes since RFC V3 Per Dave Hansen Update comment Add X86_FEATURE_PKS to disabled-features.h Rebase based on latest TIP tree --- arch/x86/include/asm/disabled-features.h | 6 +++++- arch/x86/kernel/cpu/common.c | 15 +++++++++++++++ 2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 164587177152..82540f0c5b6c 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -44,7 +44,11 @@ # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31)) #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
-#define DISABLE_PKS (1<<(X86_FEATURE_PKS & 31)) +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +# define DISABLE_PKS 0 +#else +# define DISABLE_PKS (1<<(X86_FEATURE_PKS & 31)) +#endif
#ifdef CONFIG_X86_5LEVEL # define DISABLE_LA57 0 diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 35ad8480c464..f8929a557d72 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -58,6 +58,7 @@ #include <asm/intel-family.h> #include <asm/cpu_device_id.h> #include <asm/uv/uv.h> +#include <linux/pkeys.h>
#include "cpu.h"
@@ -1494,6 +1495,19 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c) #endif }
+/* + * PKS is independent of PKU and either or both may be supported on a CPU. + * Configure PKS if the CPU supports the feature. + */ +static void setup_pks(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + write_pkrs(INIT_PKRS_VALUE); + cr4_set_bits(X86_CR4_PKS); +} + /* * This does the hard work of actually picking apart the CPU stuff... */ @@ -1591,6 +1605,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
x86_init_rdrand(c); setup_pku(c); + setup_pks();
/* * Clear/Set all flags overridden by options, need do it
From: Ira Weiny ira.weiny@intel.com
The core PKS functionality provides an interface for kernel users to reserve keys to their domains set up the page tables with those keys and control access to those domains when needed.
Define test code which exercises the core functionality of PKS via a debugfs entry. Basic checks can be triggered on boot with a kernel command line option while both basic and preemption checks can be triggered with separate debugfs values.
debugfs controls are:
'0' -- Run access tests with a single pkey '1' -- Set up the pkey register with no access for the pkey allocated to this fd '2' -- Check that the pkey register updated in '1' is still the same. (To be used after a forced context switch.) '3' -- Allocate all pkeys possible and run tests on each pkey allocated. DEFAULT when run at boot.
Closing the fd will cleanup and release the pkey, therefore to exercise context switch testing a user space program is provided in:
.../tools/testing/selftests/x86/test_pks.c
Reviewed-by: Dave Hansen dave.hansen@linux.intel.com Co-developed-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes for V2 Fix compilation errors
Changes for V1 Update for new pks_key_alloc()
Changes from RFC V3 Comments from Dave Hansen clean up whitespace dmanage Clean up Kconfig help Clean up user test error output s/pks_mknoaccess/pks_mk_noaccess/ s/pks_mkread/pks_mk_readonly/ s/pks_mkrdwr/pks_mk_readwrite/ Comments from Jing Han Remove duplicate stdio.h --- Documentation/core-api/protection-keys.rst | 1 + arch/x86/mm/fault.c | 23 + lib/Kconfig.debug | 12 + lib/Makefile | 3 + lib/pks/Makefile | 3 + lib/pks/pks_test.c | 692 +++++++++++++++++++++ tools/testing/selftests/x86/Makefile | 3 +- tools/testing/selftests/x86/test_pks.c | 66 ++ 8 files changed, 802 insertions(+), 1 deletion(-) create mode 100644 lib/pks/Makefile create mode 100644 lib/pks/pks_test.c create mode 100644 tools/testing/selftests/x86/test_pks.c
diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index c4e6c480562f..8ffdfbff013c 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -164,3 +164,4 @@ of WRPKRU. So to quote from the WRPKRU text: until all prior executions of WRPKRU have completed execution and updated the PKRU register.
+Example code can be found in lib/pks/pks_test.c diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 90029ce9b0da..916b2d18ed57 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -18,6 +18,7 @@ #include <linux/uaccess.h> /* faulthandler_disabled() */ #include <linux/efi.h> /* efi_recover_from_page_fault()*/ #include <linux/mm_types.h> +#include <linux/pkeys.h>
#include <asm/cpufeature.h> /* boot_cpu_has, ... */ #include <asm/traps.h> /* dotraplinkage, ... */ @@ -1149,6 +1150,25 @@ bool fault_in_kernel_space(unsigned long address) return address >= TASK_SIZE_MAX; }
+#ifdef CONFIG_PKS_TESTING +bool pks_test_callback(irqentry_state_t *irq_state); +static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t *irq_state) +{ + /* + * If we get a protection key exception it could be because we + * are running the PKS test. If so, pks_test_callback() will + * clear the protection mechanism and return true to indicate + * the fault was handled. + */ + return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state); +} +#else +static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t *irq_state) +{ + return false; +} +#endif + /* * Called for all faults where 'address' is part of the kernel address * space. Might get called for faults that originate from *code* that @@ -1165,6 +1185,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, if (!cpu_feature_enabled(X86_FEATURE_PKS)) WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+ if (handle_pks_testing(hw_error_code, irq_state)) + return; + #ifdef CONFIG_X86_32 /* * We can fault-in kernel-space virtual memory on-demand. The diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index c789b39ed527..e90e06f5a3b9 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2444,6 +2444,18 @@ config HYPERV_TESTING help Select this option to enable Hyper-V vmbus testing.
+config PKS_TESTING + bool "PKey (S)upervisor testing" + default n + depends on ARCH_HAS_SUPERVISOR_PKEYS + help + Select this option to enable testing of PKS core software and + hardware. The PKS core provides a mechanism to allocate keys as well + as maintain the protection settings across context switches. + Answer N if you don't know what supervisor keys are. + + If unsure, say N. + endmenu # "Kernel Testing and Coverage"
source "Documentation/Kconfig" diff --git a/lib/Makefile b/lib/Makefile index ce45af50983a..6a402bc1b9a0 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -352,3 +352,6 @@ obj-$(CONFIG_BITFIELD_KUNIT) += bitfield_kunit.o obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o obj-$(CONFIG_BITS_TEST) += test_bits.o + +# PKS test +obj-y += pks/ diff --git a/lib/pks/Makefile b/lib/pks/Makefile new file mode 100644 index 000000000000..7d1df7563db9 --- /dev/null +++ b/lib/pks/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_PKS_TESTING) += pks_test.o diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c new file mode 100644 index 000000000000..0666be260169 --- /dev/null +++ b/lib/pks/pks_test.c @@ -0,0 +1,692 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright(c) 2020 Intel Corporation. All rights reserved. + * + * Implement PKS testing + * Access to run this test can be with a command line parameter + * ("pks-test-on-boot") or more detailed tests can be triggered through: + * + * /sys/kernel/debug/x86/run_pks + * + * debugfs controls are: + * + * '0' -- Run access tests with a single pkey + * + * '1' -- Set up the pkey register with no access for the pkey allocated to + * this fd + * '2' -- Check that the pkey register updated in '1' is still the same. (To + * be used after a forced context switch.) + * + * '3' -- Allocate all pkeys possible and run tests on each pkey allocated. + * DEFAULT when run at boot. + * + * Closing the fd will cleanup and release the pkey. + * + * A companion user space program is provided in: + * + * .../tools/testing/selftests/x86/test_pks.c + * + * which will better test the context switching. + * + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/debugfs.h> +#include <linux/delay.h> +#include <linux/entry-common.h> +#include <linux/fs.h> +#include <linux/list.h> +#include <linux/mman.h> +#include <linux/module.h> +#include <linux/mutex.h> +#include <linux/percpu-defs.h> +#include <linux/pgtable.h> +#include <linux/pkeys.h> +#include <linux/slab.h> +#include <linux/vmalloc.h> + +#define PKS_TEST_MEM_SIZE (PAGE_SIZE) + +/* + * run_on_boot default '= false' which checkpatch complains about initializing; + * so we don't + */ +static bool run_on_boot; +static struct dentry *pks_test_dentry; +static bool run_9; + +/* + * We must lock the following globals for brief periods while the fault handler + * checks/updates them. + */ +static DEFINE_SPINLOCK(test_lock); +static int test_armed_key; +static unsigned long prev_cnt; +static unsigned long fault_cnt; + +struct pks_test_ctx { + bool pass; + bool pks_cpu_enabled; + int pkey; + char data[64]; +}; +static struct pks_test_ctx *test_exception_ctx; + +static pte_t *walk_table(void *ptr) +{ + struct page *page = NULL; + pgd_t *pgdp; + p4d_t *p4dp; + pud_t *pudp; + pmd_t *pmdp; + pte_t *ret = NULL; + + pgdp = pgd_offset_k((unsigned long)ptr); + if (pgd_none(*pgdp) || pgd_bad(*pgdp)) + goto error; + + p4dp = p4d_offset(pgdp, (unsigned long)ptr); + if (p4d_none(*p4dp) || p4d_bad(*p4dp)) + goto error; + + pudp = pud_offset(p4dp, (unsigned long)ptr); + if (pud_none(*pudp) || pud_bad(*pudp)) + goto error; + + pmdp = pmd_offset(pudp, (unsigned long)ptr); + if (pmd_none(*pmdp) || pmd_bad(*pmdp)) + goto error; + + ret = pte_offset_map(pmdp, (unsigned long)ptr); + if (pte_present(*ret)) { + page = pte_page(*ret); + if (!page) { + pte_unmap(ret); + goto error; + } + pr_info("page 0x%lx; flags 0x%lx\n", + (unsigned long)page, page->flags); + } + +error: + return ret; +} + +static bool check_pkey_val(u32 pk_reg, int pkey, u32 expected) +{ + u32 pkey_shift = pkey * PKR_BITS_PER_PKEY; + u32 pkey_mask = ((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift; + + pk_reg = (pk_reg & pkey_mask) >> pkey_shift; + return (pk_reg == expected); +} + +/* + * Check if the register @pkey value matches @expected value + * + * Both the cached and actual MSR must match. + */ +static bool check_pkrs(int pkey, u32 expected) +{ + bool ret = true; + u64 pkrs; + u32 *tmp_cache; + + tmp_cache = get_cpu_ptr(&pkrs_cache); + if (!check_pkey_val(*tmp_cache, pkey, expected)) + ret = false; + put_cpu_ptr(tmp_cache); + + rdmsrl(MSR_IA32_PKRS, pkrs); + if (!check_pkey_val(pkrs, pkey, expected)) + ret = false; + + return ret; +} + +static void check_exception(irqentry_state_t *irq_state) +{ + /* Check the thread saved state */ + if (!check_pkey_val(irq_state->thread_pkrs, test_armed_key, PKEY_DISABLE_WRITE)) { + pr_err(" FAIL: checking irq_state->thread_pkrs\n"); + test_exception_ctx->pass = false; + } + + /* Check the exception state */ + if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) { + pr_err(" FAIL: PKRS cache and MSR\n"); + test_exception_ctx->pass = false; + } + + /* + * Check we can update the value during exception without affecting the + * calling thread. The calling thread is checked after exception... + */ + pks_mk_readwrite(test_armed_key); + if (!check_pkrs(test_armed_key, 0)) { + pr_err(" FAIL: exception did not change register to 0\n"); + test_exception_ctx->pass = false; + } + pks_mk_noaccess(test_armed_key); + if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) { + pr_err(" FAIL: exception did not change register to 0x%x\n", + PKEY_DISABLE_ACCESS); + test_exception_ctx->pass = false; + } +} + +/* Silence prototype warning */ +bool pks_test_callback(irqentry_state_t *irq_state); + +/** + * pks_test_callback() is exported so that the fault handler can detect + * and report back status of intentional faults. + * + * NOTE: It clears the protection key from the page such that the fault handler + * will not re-trigger. + */ +bool pks_test_callback(irqentry_state_t *irq_state) +{ + bool armed = (test_armed_key != 0); + + if (test_exception_ctx) { + check_exception(irq_state); + /* + * We stop this check within the exception because the + * fault handler clean up code will call us 2x while checking + * the PMD entry and we don't need to check this again + */ + test_exception_ctx = NULL; + } + + if (armed) { + /* Enable read and write to stop faults */ + irq_state->thread_pkrs = update_pkey_val(irq_state->thread_pkrs, + test_armed_key, 0); + fault_cnt++; + } + + return armed; +} +EXPORT_SYMBOL(pks_test_callback); + +static bool exception_caught(void) +{ + bool ret = (fault_cnt != prev_cnt); + + prev_cnt = fault_cnt; + return ret; +} + +static void report_pkey_settings(void *unused) +{ + u8 pkey; + unsigned long long msr = 0; + unsigned int cpu = smp_processor_id(); + + rdmsrl(MSR_IA32_PKRS, msr); + + pr_info("for CPU %d : 0x%llx\n", cpu, msr); + for (pkey = 0; pkey < PKS_NUM_KEYS; pkey++) { + int ad, wd; + + ad = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_ACCESS; + wd = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_WRITE; + pr_info(" %u: A:%d W:%d\n", pkey, ad, wd); + } +} + +enum pks_access_mode { + PKS_TEST_NO_ACCESS, + PKS_TEST_RDWR, + PKS_TEST_RDONLY +}; + +static char *get_mode_str(enum pks_access_mode mode) +{ + switch (mode) { + case PKS_TEST_NO_ACCESS: + return "No Access"; + case PKS_TEST_RDWR: + return "Read Write"; + case PKS_TEST_RDONLY: + return "Read Only"; + default: + pr_err("BUG in test invalid mode\n"); + break; + } + + return ""; +} + +struct pks_access_test { + enum pks_access_mode mode; + bool write; + bool exception; +}; + +static struct pks_access_test pkey_test_ary[] = { + /* disable both */ + { PKS_TEST_NO_ACCESS, true, true }, + { PKS_TEST_NO_ACCESS, false, true }, + + /* enable both */ + { PKS_TEST_RDWR, true, false }, + { PKS_TEST_RDWR, false, false }, + + /* enable read only */ + { PKS_TEST_RDONLY, true, true }, + { PKS_TEST_RDONLY, false, false }, +}; + +static int test_it(struct pks_test_ctx *ctx, struct pks_access_test *test, void *ptr) +{ + bool exception; + int ret = 0; + + spin_lock(&test_lock); + WRITE_ONCE(test_armed_key, ctx->pkey); + + if (test->write) + memcpy(ptr, ctx->data, 8); + else + memcpy(ctx->data, ptr, 8); + + exception = exception_caught(); + + WRITE_ONCE(test_armed_key, 0); + spin_unlock(&test_lock); + + if (test->exception != exception) { + pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s\n", + get_mode_str(test->mode), + test->write ? "TRUE" : "FALSE", + test->exception ? "TRUE" : "FALSE", + exception ? "TRUE" : "FALSE"); + ret = -EFAULT; + } + + return ret; +} + +static int run_access_test(struct pks_test_ctx *ctx, + struct pks_access_test *test, + void *ptr) +{ + switch (test->mode) { + case PKS_TEST_NO_ACCESS: + pks_mk_noaccess(ctx->pkey); + break; + case PKS_TEST_RDWR: + pks_mk_readwrite(ctx->pkey); + break; + case PKS_TEST_RDONLY: + pks_mk_readonly(ctx->pkey); + break; + default: + pr_err("BUG in test invalid mode\n"); + break; + } + + return test_it(ctx, test, ptr); +} + +static void *alloc_test_page(int pkey) +{ + return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END, + GFP_KERNEL, PAGE_KERNEL_PKEY(pkey), 0, + NUMA_NO_NODE, __builtin_return_address(0)); +} + +static void test_mem_access(struct pks_test_ctx *ctx) +{ + int i, rc; + u8 pkey; + void *ptr = NULL; + pte_t *ptep; + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err("Failed to vmalloc page???\n"); + ctx->pass = false; + return; + } + + ptep = walk_table(ptr); + if (!ptep) { + pr_err("Failed to walk table???\n"); + ctx->pass = false; + goto done; + } + + pkey = pte_flags_pkey(ptep->pte); + pr_info("ptep flags 0x%lx pkey %u\n", + (unsigned long)ptep->pte, pkey); + + if (pkey != ctx->pkey) { + pr_err("invalid pkey found: %u, test_pkey: %u\n", + pkey, ctx->pkey); + ctx->pass = false; + goto unmap; + } + + if (!ctx->pks_cpu_enabled) { + pr_err("not CPU enabled; skipping access tests...\n"); + ctx->pass = true; + goto unmap; + } + + for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) { + rc = run_access_test(ctx, &pkey_test_ary[i], ptr); + + /* only save last error is fine */ + if (rc) + ctx->pass = false; + } + +unmap: + pte_unmap(ptep); +done: + vfree(ptr); +} + +static void pks_run_test(struct pks_test_ctx *ctx) +{ + ctx->pass = true; + + pr_info("\n"); + pr_info("\n"); + pr_info(" ***** BEGIN: Testing (CPU enabled : %s) *****\n", + ctx->pks_cpu_enabled ? "TRUE" : "FALSE"); + + if (ctx->pks_cpu_enabled) + on_each_cpu(report_pkey_settings, NULL, 1); + + pr_info(" BEGIN: pkey %d Testing\n", ctx->pkey); + test_mem_access(ctx); + pr_info(" END: PAGE_KERNEL_PKEY Testing : %s\n", + ctx->pass ? "PASS" : "FAIL"); + + pr_info(" ***** END: Testing *****\n"); + pr_info("\n"); + pr_info("\n"); +} + +static ssize_t pks_read_file(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct pks_test_ctx *ctx = file->private_data; + char buf[32]; + unsigned int len; + + if (!ctx) + len = sprintf(buf, "not run\n"); + else + len = sprintf(buf, "%s\n", ctx->pass ? "PASS" : "FAIL"); + + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static struct pks_test_ctx *alloc_ctx(const char *name) +{ + struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + + if (!ctx) { + pr_err("Failed to allocate memory for test context\n"); + return ERR_PTR(-ENOMEM); + } + + ctx->pkey = pks_key_alloc(name, PKS_FLAG_EXCLUSIVE); + if (ctx->pkey <= 0) { + pr_err("Failed to allocate memory for test context\n"); + kfree(ctx); + return ERR_PTR(-ENOMEM); + } + + ctx->pks_cpu_enabled = cpu_feature_enabled(X86_FEATURE_PKS); + sprintf(ctx->data, "%s", "DEADBEEF"); + return ctx; +} + +static void free_ctx(struct pks_test_ctx *ctx) +{ + pks_key_free(ctx->pkey); + kfree(ctx); +} + +static void run_exception_test(void) +{ + void *ptr = NULL; + bool pass = true; + struct pks_test_ctx *ctx; + + pr_info(" ***** BEGIN: exception checking\n"); + + ctx = alloc_ctx("Exception test"); + if (IS_ERR(ctx)) { + pr_err(" FAIL: no context\n"); + pass = false; + goto result; + } + ctx->pass = true; + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err(" FAIL: no vmalloc page\n"); + pass = false; + goto free_context; + } + + pks_mk_readonly(ctx->pkey); + + spin_lock(&test_lock); + WRITE_ONCE(test_exception_ctx, ctx); + WRITE_ONCE(test_armed_key, ctx->pkey); + + memcpy(ptr, ctx->data, 8); + + if (!exception_caught()) { + pr_err(" FAIL: did not get an exception\n"); + pass = false; + } + + /* + * NOTE The exception code has to enable access (b00) to keep the + * fault from looping forever. So we don't see the write disabled + * restored but rather full access restored. Also note that as part + * of this test the exception callback attempted to disable access + * completely (b11) and so we ensure that we are seeing the proper + * thread value restored here. + */ + if (!check_pkrs(test_armed_key, 0)) { + pr_err(" FAIL: PKRS not restored\n"); + pass = false; + } + + if (!ctx->pass) + pass = false; + + WRITE_ONCE(test_armed_key, 0); + spin_unlock(&test_lock); + + vfree(ptr); +free_context: + free_ctx(ctx); +result: + pr_info(" ***** END: exception checking : %s\n", + pass ? "PASS" : "FAIL"); +} + +static void run_all(void) +{ + struct pks_test_ctx *ctx[PKS_NUM_KEYS]; + static char name[PKS_NUM_KEYS][64]; + int i; + + for (i = 1; i < PKS_NUM_KEYS; i++) { + sprintf(name[i], "pks ctx %d", i); + ctx[i] = alloc_ctx((const char *)name[i]); + } + + for (i = 1; i < PKS_NUM_KEYS; i++) { + if (!IS_ERR(ctx[i])) + pks_run_test(ctx[i]); + } + + for (i = 1; i < PKS_NUM_KEYS; i++) { + if (!IS_ERR(ctx[i])) + free_ctx(ctx[i]); + } + + run_exception_test(); +} + +static void crash_it(void) +{ + struct pks_test_ctx *ctx; + void *ptr; + + pr_warn(" ***** BEGIN: Unhandled fault test *****\n"); + + ctx = alloc_ctx("crashing kernel\n"); + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err("Failed to vmalloc page???\n"); + ctx->pass = false; + return; + } + + pks_mk_noaccess(ctx->pkey); + + spin_lock(&test_lock); + WRITE_ONCE(test_armed_key, 0); + /* This purposely faults */ + memcpy(ptr, ctx->data, 8); + spin_unlock(&test_lock); + + vfree(ptr); + free_ctx(ctx); +} + +static ssize_t pks_write_file(struct file *file, const char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[2]; + struct pks_test_ctx *ctx = file->private_data; + + if (copy_from_user(buf, user_buf, 1)) + return -EFAULT; + buf[1] = '\0'; + + /* + * WARNING: Test "9" will crash the kernel. + * + * So we arm the test and print a warning. A second "9" will run the + * test. + */ + if (!strcmp(buf, "9")) { + if (run_9) { + crash_it(); + run_9 = false; + } else { + pr_warn("CAUTION: Test 9 will crash in the kernel.\n"); + pr_warn(" Specify 9 a second time to run\n"); + pr_warn(" run any other test to clear\n"); + run_9 = true; + } + } else { + run_9 = false; + } + + /* + * Test "3" will test allocating all keys. Do it first without + * using "ctx". + */ + if (!strcmp(buf, "3")) + run_all(); + + if (!ctx) { + ctx = alloc_ctx("pks test"); + if (IS_ERR(ctx)) + return -ENOMEM; + file->private_data = ctx; + } + + if (!strcmp(buf, "0")) + pks_run_test(ctx); + + /* start of context switch test */ + if (!strcmp(buf, "1")) { + /* Ensure a known state to test context switch */ + pks_mk_noaccess(ctx->pkey); + } + + /* After context switch msr should be restored */ + if (!strcmp(buf, "2") && ctx->pks_cpu_enabled) { + unsigned long reg_pkrs; + int access; + + rdmsrl(MSR_IA32_PKRS, reg_pkrs); + + access = (reg_pkrs >> (ctx->pkey * PKR_BITS_PER_PKEY)) & + PKEY_ACCESS_MASK; + if (access != (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)) { + ctx->pass = false; + pr_err("Context switch check failed\n"); + } + } + + return count; +} + +static int pks_release_file(struct inode *inode, struct file *file) +{ + struct pks_test_ctx *ctx = file->private_data; + + if (!ctx) + return 0; + + free_ctx(ctx); + return 0; +} + +static const struct file_operations fops_init_pks = { + .read = pks_read_file, + .write = pks_write_file, + .llseek = default_llseek, + .release = pks_release_file, +}; + +static int __init parse_pks_test_options(char *str) +{ + run_on_boot = true; + + return 0; +} +early_param("pks-test-on-boot", parse_pks_test_options); + +static int __init pks_test_init(void) +{ + if (cpu_feature_enabled(X86_FEATURE_PKS)) { + if (run_on_boot) + run_all(); + + pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir, + NULL, &fops_init_pks); + } + + return 0; +} +late_initcall(pks_test_init); + +static void __exit pks_test_exit(void) +{ + debugfs_remove(pks_test_dentry); + pr_info("test exit\n"); +} +module_exit(pks_test_exit); + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index 6703c7906b71..f5c80f952eab 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -13,7 +13,8 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie) TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ check_initial_reg_state sigreturn iopl ioperm \ test_vdso test_vsyscall mov_ss_trap \ - syscall_arg_fault fsgsbase_restore + syscall_arg_fault fsgsbase_restore test_pks + TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c new file mode 100644 index 000000000000..cd40f930b172 --- /dev/null +++ b/tools/testing/selftests/x86/test_pks.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#define _GNU_SOURCE +#include <sched.h> +#include <stdlib.h> +#include <unistd.h> +#include <assert.h> +#include <stdio.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> + +#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks" + +int main(void) +{ + cpu_set_t cpuset; + char result[32]; + pid_t pid; + int fd; + + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + /* Two processes run on CPU 0 so that they go through context switch. */ + sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset); + + pid = fork(); + if (pid == 0) { + fd = open(PKS_TEST_FILE, O_RDWR); + if (fd < 0) { + printf("cannot open %s\n", PKS_TEST_FILE); + return -1; + } + + /* Allocate test_pkey1 and run test. */ + write(fd, "0", 1); + + /* Arm for context switch test */ + write(fd, "1", 1); + + /* Context switch out... */ + sleep(4); + + /* Check msr restored */ + write(fd, "2", 1); + } else { + sleep(2); + + fd = open(PKS_TEST_FILE, O_RDWR); + if (fd < 0) { + printf("cannot open %s\n", PKS_TEST_FILE); + return -1; + } + + /* run test with alternate pkey */ + write(fd, "0", 1); + } + + read(fd, result, 10); + printf("#PF, context switch, pkey allocation and free tests: %s\n", + result); + + close(fd); + + return 0; +}
Ira,
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
Subject prefix wants to 'entry:'. This changes generic code and the x86 part is just required to fix the generic code change.
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches propose to add information to this struct, for example the PKRS register/thread state.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference.
The PKRS muck needs to add an u32 to that struct. So how is that a problem?
The resulting struct still fits into 64bit which is by far more efficiently passed by value than by reference. So which problem are you solving here?
Thanks
tglx
On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote:
Ira,
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
Subject prefix wants to 'entry:'. This changes generic code and the x86 part is just required to fix the generic code change.
Sorry, yes that was carried incorrectly from earlier versions.
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches propose to add information to this struct, for example the PKRS register/thread state.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference.
The PKRS muck needs to add an u32 to that struct. So how is that a problem?
There are more fields to be added for the kmap/pmem support. So this will be needed eventually. Even though it is not strictly necessary in the next patch.
The resulting struct still fits into 64bit which is by far more efficiently passed by value than by reference. So which problem are you solving here?
I'm getting ahead of myself a bit. I will be adding more fields for the kmap/pmem tracking.
Would you accept just a clean up for the variable names in this patch? I could then add the pass by reference when I add the new fields later. Or would an update to the commit message be ok to land this now?
Ira
Thanks
tglx
Ira,
On Mon, Nov 16 2020 at 10:49, Ira Weiny wrote:
On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote:
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches propose to add information to this struct, for example the PKRS register/thread state.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference.
The PKRS muck needs to add an u32 to that struct. So how is that a problem?
There are more fields to be added for the kmap/pmem support. So this will be needed eventually. Even though it is not strictly necessary in the next patch.
if you folks could start to write coherent changelogs with proper explanations why something is needed and why it's going beyond the immediate use case in the next patch, then getting stuff sorted would be way easier.
Sorry my crystalball got lost years ago and I'm just not able to keep up with the flurry of patch sets which have dependencies in one way or the other.
The resulting struct still fits into 64bit which is by far more efficiently passed by value than by reference. So which problem are you solving here?
I'm getting ahead of myself a bit. I will be adding more fields for the kmap/pmem tracking.
Would you accept just a clean up for the variable names in this patch? I could then add the pass by reference when I add the new fields later. Or would an update to the commit message be ok to land this now?
Can you provide a coherent explanation for the full set of things which needs to be added here first please?
Thanks,
tglx
From: Ira Weiny ira.weiny@intel.com
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches add information to this struct. This includes the PKRS thread state, included in this series, as well as information to store kmap reference tracking and PKS global state outside this series. In total, we anticipate 2 new 32 bit fields and an integer field to be added to the struct beyond the existing bool value.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference in preparation for the changes which follow.
While at it, make the code easier to follow by changing all the usage sites to consistently use the variable name 'irq_state'.
Signed-off-by: Ira Weiny ira.weiny@intel.com
--- Changes from V3 Clarify commit message regarding the need for this patch
Changes from V1 From Thomas: Update commit message Further clean up Kernel doc and comments Missed some 'return' comments which are no longer valid
Changes from RFC V3 Clean up @irq_state comments Standardize on 'irq_state' for the state variable name Refactor based on new patch from Thomas Gleixner Also addresses Peter Zijlstra's comment --- arch/x86/entry/common.c | 8 ++++---- arch/x86/include/asm/idtentry.h | 25 ++++++++++++++---------- arch/x86/kernel/cpu/mce/core.c | 4 ++-- arch/x86/kernel/kvm.c | 6 +++--- arch/x86/kernel/nmi.c | 4 ++-- arch/x86/kernel/traps.c | 21 ++++++++++++-------- arch/x86/mm/fault.c | 6 +++--- include/linux/entry-common.h | 18 +++++++++-------- kernel/entry/common.c | 34 +++++++++++++-------------------- 9 files changed, 65 insertions(+), 61 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 18d8f17f755c..87dea56a15d2 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs) { struct pt_regs *old_regs; bool inhcall; - irqentry_state_t state; + irqentry_state_t irq_state;
- state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state); old_regs = set_irq_regs(regs);
instrumentation_begin(); @@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs) set_irq_regs(old_regs);
inhcall = get_and_clear_inhcall(); - if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) { + if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) { instrumentation_begin(); irqentry_exit_cond_resched(); instrumentation_end(); restore_inhcall(inhcall); } else { - irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); } } #endif /* CONFIG_XEN_PV */ diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 247a60a47331..282d2413b6a1 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ __##func (regs); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs) @@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs, \ __visible noinstr void func(struct pt_regs *regs, \ unsigned long error_code) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ __##func (regs, error_code); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs, \ @@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \ __visible noinstr void func(struct pt_regs *regs, \ unsigned long error_code) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ irq_enter_rcu(); \ kvm_set_cpu_l1tf_flush_l1d(); \ __##func (regs, (u8)error_code); \ irq_exit_rcu(); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs, u8 vector) @@ -234,15 +237,16 @@ static void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ irq_enter_rcu(); \ kvm_set_cpu_l1tf_flush_l1d(); \ run_sysvec_on_irqstack_cond(__##func, regs); \ irq_exit_rcu(); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static noinline void __##func(struct pt_regs *regs) @@ -263,15 +267,16 @@ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ - irqentry_state_t state = irqentry_enter(regs); \ + irqentry_state_t irq_state; \ \ + irqentry_enter(regs, &irq_state); \ instrumentation_begin(); \ __irq_enter_raw(); \ kvm_set_cpu_l1tf_flush_l1d(); \ __##func (regs); \ __irq_exit_raw(); \ instrumentation_end(); \ - irqentry_exit(regs, state); \ + irqentry_exit(regs, &irq_state); \ } \ \ static __always_inline void __##func(struct pt_regs *regs) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index f5c860b1a50b..6ed2fa2ea321 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs) mce_check_crashing_cpu()) return;
- irq_state = irqentry_nmi_enter(regs); + irqentry_nmi_enter(regs, &irq_state); /* * The call targets are marked noinstr, but objtool can't figure * that out because it's an indirect call. Annotate it. @@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs) if (regs->flags & X86_EFLAGS_IF) trace_hardirqs_on_prepare(); instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state); }
static __always_inline void exc_machine_check_user(struct pt_regs *regs) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 7f57ede3cb8e..ed7427c6e74d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags); noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token) { u32 flags = kvm_read_and_reset_apf_flags(); - irqentry_state_t state; + irqentry_state_t irq_state;
if (!flags) return false;
- state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state); instrumentation_begin();
/* @@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token) }
instrumentation_end(); - irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); return true; }
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index bf250a339655..1fd7780e99dd 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
this_cpu_write(nmi_dr7, local_db_save());
- irq_state = irqentry_nmi_enter(regs); + irqentry_nmi_enter(regs, &irq_state);
inc_irq_stat(__nmi_count);
if (!ignore_nmis) default_do_nmi(regs);
- irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state);
local_db_restore(this_cpu_read(nmi_dr7));
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index e1b78829d909..8481cc373794 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
DEFINE_IDTENTRY_RAW(exc_invalid_op) { - irqentry_state_t state; + irqentry_state_t irq_state;
/* * We use UD2 as a short encoding for 'CALL __WARN', as such @@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op) if (!user_mode(regs) && handle_bug(regs)) return;
- state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state); instrumentation_begin(); handle_invalid_op(regs); instrumentation_end(); - irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); }
DEFINE_IDTENTRY(exc_coproc_segment_overrun) @@ -343,6 +343,7 @@ __visible void __noreturn handle_stack_overflow(const char *message, */ DEFINE_IDTENTRY_DF(exc_double_fault) { + irqentry_state_t irq_state; static const char str[] = "double fault"; struct task_struct *tsk = current;
@@ -405,7 +406,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault) } #endif
- irqentry_nmi_enter(regs); + irqentry_nmi_enter(regs, &irq_state); instrumentation_begin(); notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
@@ -651,13 +652,15 @@ DEFINE_IDTENTRY_RAW(exc_int3) instrumentation_end(); irqentry_exit_to_user_mode(regs); } else { - irqentry_state_t irq_state = irqentry_nmi_enter(regs); + irqentry_state_t irq_state; + + irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin(); if (!do_int3(regs)) die("int3", regs, 0); instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state); } }
@@ -852,7 +855,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs, * includes the entry stack is excluded for everything. */ unsigned long dr7 = local_db_save(); - irqentry_state_t irq_state = irqentry_nmi_enter(regs); + irqentry_state_t irq_state; + + irqentry_nmi_enter(regs, &irq_state); instrumentation_begin();
/* @@ -909,7 +914,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs, regs->flags &= ~X86_EFLAGS_TF; out: instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + irqentry_nmi_exit(regs, &irq_state);
local_db_restore(dr7); } diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 82bf37a5c9ec..8d20c4c13abf 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1441,7 +1441,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code, DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) { unsigned long address = read_cr2(); - irqentry_state_t state; + irqentry_state_t irq_state;
prefetchw(¤t->mm->mmap_lock);
@@ -1479,11 +1479,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) * code reenabled RCU to avoid subsequent wreckage which helps * debugability. */ - state = irqentry_enter(regs); + irqentry_enter(regs, &irq_state);
instrumentation_begin(); handle_page_fault(regs, error_code, address); instrumentation_end();
- irqentry_exit(regs, state); + irqentry_exit(regs, &irq_state); } diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 49b26b216e4e..dd473137e728 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -358,6 +358,8 @@ typedef struct irqentry_state { /** * irqentry_enter - Handle state tracking on ordinary interrupt entries * @regs: Pointer to pt_regs of interrupted context + * @irq_state: Pointer to an opaque object to store state information; to be + * passed back to irqentry_exit() * * Invokes: * - lockdep irqflag state tracking as low level ASM entry disabled @@ -383,10 +385,8 @@ typedef struct irqentry_state { * For user mode entries irqentry_enter_from_user_mode() is invoked to * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit * would not be possible. - * - * Returns: An opaque object that must be passed to idtentry_exit() */ -irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs); +void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state);
/** * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt @@ -398,7 +398,7 @@ void irqentry_exit_cond_resched(void); /** * irqentry_exit - Handle return from exception that used irqentry_enter() * @regs: Pointer to pt_regs (exception entry regs) - * @state: Return value from matching call to irqentry_enter() + * @irq_state: Pointer to state information passed to irqentry_enter() * * Depending on the return target (kernel/user) this runs the necessary * preemption and work checks if possible and required and returns to @@ -409,25 +409,27 @@ void irqentry_exit_cond_resched(void); * * Counterpart to irqentry_enter(). */ -void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state); +void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state);
/** * irqentry_nmi_enter - Handle NMI entry * @regs: Pointer to currents pt_regs + * @irq_state: Pointer to an opaque object to store state information; to be + * passed back to irqentry_nmi_exit() * * Similar to irqentry_enter() but taking care of the NMI constraints. */ -irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs); +void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state);
/** * irqentry_nmi_exit - Handle return from NMI handling * @regs: Pointer to pt_regs (NMI entry regs) - * @irq_state: Return value from matching call to irqentry_nmi_enter() + * @irq_state: Pointer to state information passed to irqentry_nmi_enter() * * Last action before returning to the low level assembly code. * * Counterpart to irqentry_nmi_enter(). */ -void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state); +void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state);
#endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 91e8fd50adf4..c5d586d5cf5b 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -290,15 +290,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) exit_to_user_mode(); }
-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) +noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state) { - irqentry_state_t ret = { - .exit_rcu = false, - }; + irq_state->exit_rcu = false;
if (user_mode(regs)) { irqentry_enter_from_user_mode(regs); - return ret; + return; }
/* @@ -336,8 +334,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) trace_hardirqs_off_finish(); instrumentation_end();
- ret.exit_rcu = true; - return ret; + irq_state->exit_rcu = true; + return; }
/* @@ -351,8 +349,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) rcu_irq_enter_check_tick(); trace_hardirqs_off_finish(); instrumentation_end(); - - return ret; }
void irqentry_exit_cond_resched(void) @@ -367,7 +363,7 @@ void irqentry_exit_cond_resched(void) } }
-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) +noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state) { lockdep_assert_irqs_disabled();
@@ -380,7 +376,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) * carefully and needs the same ordering of lockdep/tracing * and RCU as the return to user mode path. */ - if (state.exit_rcu) { + if (irq_state->exit_rcu) { instrumentation_begin(); /* Tell the tracer that IRET will enable interrupts */ trace_hardirqs_on_prepare(); @@ -402,16 +398,14 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) * IRQ flags state is correct already. Just tell RCU if it * was not watching on entry. */ - if (state.exit_rcu) + if (irq_state->exit_rcu) rcu_irq_exit(); } }
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs) +void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state) { - irqentry_state_t irq_state; - - irq_state.lockdep = lockdep_hardirqs_enabled(); + irq_state->lockdep = lockdep_hardirqs_enabled();
__nmi_enter(); lockdep_hardirqs_off(CALLER_ADDR0); @@ -422,15 +416,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs) trace_hardirqs_off_finish(); ftrace_nmi_enter(); instrumentation_end(); - - return irq_state; }
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state) +void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state) { instrumentation_begin(); ftrace_nmi_exit(); - if (irq_state.lockdep) { + if (irq_state->lockdep) { trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(CALLER_ADDR0); } @@ -438,7 +430,7 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
rcu_nmi_exit(); lockdep_hardirq_exit(); - if (irq_state.lockdep) + if (irq_state->lockdep) lockdep_hardirqs_on(CALLER_ADDR0); __nmi_exit(); }
Thomas,
Is there any chance of this landing before the kmap stuff gets sorted out?
It would be nice to have this in 5.11 to build off of.
Ira
On Fri, Nov 06, 2020 at 03:28:58PM -0800, 'Ira Weiny' wrote:
From: Ira Weiny ira.weiny@intel.com
Changes from V2 [4] Rebased on tip-tree/core/entry From Thomas Gleixner Address bisectability Drop Patch: x86/entry: Move nmi entry/exit into common code From Greg KH Remove WARN_ON's From Dan Williams Add __must_check to pks_key_alloc() New patch: x86/pks: Add PKS defines and config options Split from Enable patch to build on through the series Fix compile errors
Changes from V1 Rebase to TIP master; resolve conflicts and test Clean up some kernel docs updates missed in V1 Add irqentry_state_t kernel doc for PKRS field Removed redundant irq_state->pkrs This is only needed when we add the global state and somehow ended up in this patch series. That will come back when we add the global functionality in. From Thomas Gleixner Update commit messages Add kernel doc for struct irqentry_state_t From Dave Hansen add flags to pks_key_alloc()
Changes from RFC V3[3] Rebase to TIP master Update test error output Standardize on 'irq_state' for state variables From Dave Hansen Update commit messages Add/clean up comments Add X86_FEATURE_PKS to disabled-features.h and remove some explicit CONFIG checks Move saved_pkrs member of thread_struct Remove superfluous preempt_disable() s/irq_save_pks/irq_save_set_pks/ Ensure PKRS is not seen in faults if not configured or not supported s/pks_mknoaccess/pks_mk_noaccess/ s/pks_mkread/pks_mk_readonly/ s/pks_mkrdwr/pks_mk_readwrite/ Change pks_key_alloc return to -EOPNOTSUPP when not supported From Peter Zijlstra Clean up Attribution Remove superfluous preempt_disable() Add union to differentiate exit_rcu/lockdep use in irqentry_state_t From Thomas Gleixner Add preliminary clean up patch and adjust series as needed
Introduce a new page protection mechanism for supervisor pages, Protection Key Supervisor (PKS).
2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys is a newer use case which is still being explored. PMEM was submitted as part of the RFC (v2) series[1]. However, since then it was found that some callers of kmap() require a global implementation of PKS. Specifically some users of kmap() expect mappings to be available to all kernel threads. While global use of PKS is rare it needs to be included for correctness. Unfortunately the kmap() updates required a large patch series to make the needed changes at the various kmap() call sites so that patch set has been split out. Because the global PKS feature is only required for that use case it will be deferred to that set as well.[2] This patch set is being submitted as a precursor to both of the use cases.
For an overview of the entire PKS ecosystem, a git tree including this series and 2 proposed use cases can be found here:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/ https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/
PKS enables protections on 'domains' of supervisor pages to limit supervisor mode access to those pages beyond the normal paging protections. PKS works in a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are checked in addition to normal paging protections and Access or Writes can be disabled via a MSR update without TLB flushes when permissions change. Also like PKU, a page mapping is assigned to a domain by setting pkey bits in the page table entry for that mapping.
Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
XSAVE is not supported for the PKRS MSR. Therefore the implementation saves/restores the MSR across context switches and during exceptions. Nested exceptions are supported by each exception getting a new PKS state.
For consistent behavior with current paging protections, pkey 0 is reserved and configured to allow full access via the pkey mechanism, thus preserving the default paging protections on mappings with the default pkey value of 0.
Other keys, (1-15) are allocated by an allocator which prepares us for key contention from day one. Kernel users should be prepared for the allocator to fail either because of key exhaustion or due to PKS not being supported on the arch and/or CPU instance.
The following are key attributes of PKS.
- Fast switching of permissions
1a) Prevents access without page table manipulations 1b) No TLB flushes required 2) Works on a per thread basis
PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from the PTE to store the pkey within the entry.
[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/ [2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.weiny@intel.com/ [3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.weiny@intel.com/ [4] https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
Fenghua Yu (2): x86/pks: Add PKS kernel API x86/pks: Enable Protection Keys Supervisor (PKS)
Ira Weiny (8): x86/pkeys: Create pkeys_common.h x86/fpu: Refactor arch_set_user_pkey_access() for PKS support x86/pks: Add PKS defines and Kconfig options x86/pks: Preserve the PKRS MSR on context switch x86/entry: Pass irqentry_state_t by reference x86/entry: Preserve PKRS MSR across exceptions x86/fault: Report the PKRS state on fault x86/pks: Add PKS test code
Documentation/core-api/protection-keys.rst | 103 ++- arch/x86/Kconfig | 1 + arch/x86/entry/common.c | 46 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 25 +- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/pgtable.h | 13 +- arch/x86/include/asm/pgtable_types.h | 12 + arch/x86/include/asm/pkeys.h | 15 + arch/x86/include/asm/pkeys_common.h | 40 ++ arch/x86/include/asm/processor.h | 18 +- arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/cpu/common.c | 15 + arch/x86/kernel/cpu/mce/core.c | 4 +- arch/x86/kernel/fpu/xstate.c | 22 +- arch/x86/kernel/kvm.c | 6 +- arch/x86/kernel/nmi.c | 4 +- arch/x86/kernel/process.c | 26 + arch/x86/kernel/traps.c | 21 +- arch/x86/mm/fault.c | 87 ++- arch/x86/mm/pkeys.c | 196 +++++- include/linux/entry-common.h | 31 +- include/linux/pgtable.h | 4 + include/linux/pkeys.h | 24 + kernel/entry/common.c | 44 +- lib/Kconfig.debug | 12 + lib/Makefile | 3 + lib/pks/Makefile | 3 + lib/pks/pks_test.c | 692 ++++++++++++++++++++ mm/Kconfig | 2 + tools/testing/selftests/x86/Makefile | 3 +- tools/testing/selftests/x86/test_pks.c | 66 ++ 33 files changed, 1410 insertions(+), 140 deletions(-) create mode 100644 arch/x86/include/asm/pkeys_common.h create mode 100644 lib/pks/Makefile create mode 100644 lib/pks/pks_test.c create mode 100644 tools/testing/selftests/x86/test_pks.c
-- 2.28.0.rc0.12.gb6a658bd00c9
Ira,
On Mon, Dec 07 2020 at 14:14, Ira Weiny wrote:
Is there any chance of this landing before the kmap stuff gets sorted out?
I have marked this as needs an update because the change log of 5/10 sucks. https://lore.kernel.org/r/87lff1xcmv.fsf@nanos.tec.linutronix.de
It would be nice to have this in 5.11 to build off of.
It would be nice if people follow up on review request :)
Thanks,
tglx
On Tue, Dec 08, 2020 at 04:55:54PM +0100, Thomas Gleixner wrote:
Ira,
On Mon, Dec 07 2020 at 14:14, Ira Weiny wrote:
Is there any chance of this landing before the kmap stuff gets sorted out?
I have marked this as needs an update because the change log of 5/10 sucks. https://lore.kernel.org/r/87lff1xcmv.fsf@nanos.tec.linutronix.de
It would be nice to have this in 5.11 to build off of.
It would be nice if people follow up on review request :)
I did, but just as an update to that patch.[1] Sorry if this caused you to miss the response. It would have been better for me to ping you on that patch. :-/
I was trying to avoid a whole new series just for that single commit message. Is that generally ok?
Is that commit message still lacking?
Ira
[1] https://lore.kernel.org/linux-doc/20201124060956.1405768-1-ira.weiny@intel.c...
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote:
From: Ira Weiny ira.weiny@intel.com
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches add information to this struct. This includes the PKRS thread state, included in this series, as well as information to store kmap reference tracking and PKS global state outside this series. In total, we anticipate 2 new 32 bit fields and an integer field to be added to the struct beyond the existing bool value.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference in preparation for the changes which follow.
While at it, make the code easier to follow by changing all the usage sites to consistently use the variable name 'irq_state'.
After contemplating this for a bit, I think this isn't really the right approach. It *works*, but we've mostly just created a bit of an unfortunate situation. Our stack, on a (possibly nested) entry looks like:
previous frame (or empty if we came from usermode) --- SS RSP FLAGS CS RIP rest of pt_regs
C frame
irqentry_state_t (maybe -- the compiler is within its rights to play almost arbitrary games here)
more C stuff
So what we've accomplished is having two distinct arch register regions, one called pt_regs and the other stuck in irqentry_state_t. This is annoying because it means that, if we want to access this thing without passing a pointer around or access it at all from outer frames, we need to do something terrible with the unwinder, and we don't want to go there.
So I propose a somewhat different solution: lay out the stack like this.
SS RSP FLAGS CS RIP rest of pt_regs PKS ^^^^^^^^ extended_pt_regs points here
C frame more C stuff ...
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer. And we have a little accessor like:
struct extended_pt_regs *extended_regs(struct pt_regs *) { return container_of(...); }
And we tell eBPF that extended_pt_regs is NOT ABI, and we will change it whenever we feel like just to keep you on your toes, thank you very much.
Does this seem reasonable? You get to drop patch 7 and instead modify the show_regs() stuff to just display one extra register.
On Fri, Dec 11, 2020 at 02:14:28PM -0800, Andy Lutomirski wrote:
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote:
From: Ira Weiny ira.weiny@intel.com
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches add information to this struct. This includes the PKRS thread state, included in this series, as well as information to store kmap reference tracking and PKS global state outside this series. In total, we anticipate 2 new 32 bit fields and an integer field to be added to the struct beyond the existing bool value.
Adding information to irqentry_state_t makes passing by value less efficient. Therefore, change the entry/exit calls to pass irq_state by reference in preparation for the changes which follow.
While at it, make the code easier to follow by changing all the usage sites to consistently use the variable name 'irq_state'.
After contemplating this for a bit, I think this isn't really the right approach. It *works*, but we've mostly just created a bit of an unfortunate situation.
First off please forgive my ignorance on how this code works.
Our stack, on a (possibly nested) entry looks like:
previous frame (or empty if we came from usermode)
SS RSP FLAGS CS RIP rest of pt_regs
C frame
irqentry_state_t (maybe -- the compiler is within its rights to play almost arbitrary games here)
more C stuff
So what we've accomplished is having two distinct arch register regions, one called pt_regs and the other stuck in irqentry_state_t. This is annoying because it means that, if we want to access this thing without passing a pointer around or access it at all from outer frames, we need to do something terrible with the unwinder, and we don't want to go there.
So I propose a somewhat different solution: lay out the stack like this.
SS RSP FLAGS CS RIP rest of pt_regs PKS ^^^^^^^^ extended_pt_regs points here
C frame more C stuff ...
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer. And we have a little accessor like:
struct extended_pt_regs *extended_regs(struct pt_regs *) { return container_of(...); }
And we tell eBPF that extended_pt_regs is NOT ABI, and we will change it whenever we feel like just to keep you on your toes, thank you very much.
Does this seem reasonable?
Conceptually yes. But I'm failing to see how this implementation can be made generic for the generic fields. The pks fields, assuming they stay x86 specific, would be reasonable to add in PUSH_AND_CLEAR_REGS. But the rcu/lockdep field is generic. Wouldn't we have to modify every architecture to add space for the rcu/lockdep bool?
If not, where is a generic place that could be done? Basically I'm missing how the effective stack structure can look like this:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
It seems more reasonable to make it look like:
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS struct extended_pt_regs { unsigned long pkrs; struct pt_regs regs; }; #endif
And leave the rcu/lockdep bool passed by value as before (still in C).
Is that what you mean? Or am I missing something with the way pt_regs is set up? Which is entirely possible because I'm pretty ignorant about how this code works... :-/
Ira
On Tue, Dec 15, 2020 at 5:32 PM Ira Weiny ira.weiny@intel.com wrote:
On Fri, Dec 11, 2020 at 02:14:28PM -0800, Andy Lutomirski wrote:
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote:
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer. And we have a little accessor like:
struct extended_pt_regs *extended_regs(struct pt_regs *) { return container_of(...); }
And we tell eBPF that extended_pt_regs is NOT ABI, and we will change it whenever we feel like just to keep you on your toes, thank you very much.
Does this seem reasonable?
Conceptually yes. But I'm failing to see how this implementation can be made generic for the generic fields. The pks fields, assuming they stay x86 specific, would be reasonable to add in PUSH_AND_CLEAR_REGS. But the rcu/lockdep field is generic. Wouldn't we have to modify every architecture to add space for the rcu/lockdep bool?
If not, where is a generic place that could be done? Basically I'm missing how the effective stack structure can look like this:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
It seems more reasonable to make it look like:
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS struct extended_pt_regs { unsigned long pkrs; struct pt_regs regs; }; #endif
And leave the rcu/lockdep bool passed by value as before (still in C).
We could certainly do this, but we could also allocate some generic space. PUSH_AND_CLEAR_REGS would get an extra instruction like:
subq %rsp, $GENERIC_PTREGS_SIZE
or however this should be written. That field would be defined in asm-offsets.c. And yes, all the generic-entry architectures would need to get onboard.
If we wanted to be fancy, we could split the generic area into initialize-to-zero and uninitialized for debugging purposes, but that might be more complication than is worthwhile.
On Tue, Dec 15, 2020 at 06:09:02PM -0800, Andy Lutomirski wrote:
On Tue, Dec 15, 2020 at 5:32 PM Ira Weiny ira.weiny@intel.com wrote:
On Fri, Dec 11, 2020 at 02:14:28PM -0800, Andy Lutomirski wrote:
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote:
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer. And we have a little accessor like:
struct extended_pt_regs *extended_regs(struct pt_regs *) { return container_of(...); }
And we tell eBPF that extended_pt_regs is NOT ABI, and we will change it whenever we feel like just to keep you on your toes, thank you very much.
Does this seem reasonable?
Conceptually yes. But I'm failing to see how this implementation can be made generic for the generic fields. The pks fields, assuming they stay x86 specific, would be reasonable to add in PUSH_AND_CLEAR_REGS. But the rcu/lockdep field is generic. Wouldn't we have to modify every architecture to add space for the rcu/lockdep bool?
If not, where is a generic place that could be done? Basically I'm missing how the effective stack structure can look like this:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
It seems more reasonable to make it look like:
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS struct extended_pt_regs { unsigned long pkrs; struct pt_regs regs; }; #endif
And leave the rcu/lockdep bool passed by value as before (still in C).
We could certainly do this,
I'm going to start with this basic support. Because I have 0 experience in most of these architectures.
but we could also allocate some generic space. PUSH_AND_CLEAR_REGS would get an extra instruction like:
subq %rsp, $GENERIC_PTREGS_SIZE
or however this should be written. That field would be defined in asm-offsets.c. And yes, all the generic-entry architectures would need to get onboard.
What do you mean by 'generic-entry' architectures? I thought they all used the generic entry code?
Regardless I would need to start another thread on this topic with any of those architecture maintainers to see what the work load would be for this. I don't think I can do it on my own.
FWIW I think it is a bit unfair to hold up the PKS support in x86 for making these generic fields part of the stack frame. So perhaps that could be made a follow on to the PKS series?
If we wanted to be fancy, we could split the generic area into initialize-to-zero and uninitialized for debugging purposes, but that might be more complication than is worthwhile.
Ok, agreed, but this is step 3 or 4 at the earliest.
Ira
On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote: After contemplating this for a bit, I think this isn't really the right approach. It *works*, but we've mostly just created a bit of an unfortunate situation. Our stack, on a (possibly nested) entry looks like:
previous frame (or empty if we came from usermode)
SS RSP FLAGS CS RIP rest of pt_regs
C frame
irqentry_state_t (maybe -- the compiler is within its rights to play almost arbitrary games here)
more C stuff
So what we've accomplished is having two distinct arch register regions, one called pt_regs and the other stuck in irqentry_state_t. This is annoying because it means that, if we want to access this thing without passing a pointer around or access it at all from outer frames, we need to do something terrible with the unwinder, and we don't want to go there.
So I propose a somewhat different solution: lay out the stack like this.
SS RSP FLAGS CS RIP rest of pt_regs PKS ^^^^^^^^ extended_pt_regs points here
C frame more C stuff ...
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer.
While I agree vs. PKS which is architecture specific state and needed in other places e.g. #PF, I'm not convinced that sticking the existing state into the same area buys us anything more than an indirect access.
Peter?
Thanks,
tglx
On Thu, Dec 17, 2020 at 02:07:01PM +0100, Thomas Gleixner wrote:
On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote: After contemplating this for a bit, I think this isn't really the right approach. It *works*, but we've mostly just created a bit of an unfortunate situation. Our stack, on a (possibly nested) entry looks like:
previous frame (or empty if we came from usermode)
SS RSP FLAGS CS RIP rest of pt_regs
C frame
irqentry_state_t (maybe -- the compiler is within its rights to play almost arbitrary games here)
more C stuff
So what we've accomplished is having two distinct arch register regions, one called pt_regs and the other stuck in irqentry_state_t. This is annoying because it means that, if we want to access this thing without passing a pointer around or access it at all from outer frames, we need to do something terrible with the unwinder, and we don't want to go there.
So I propose a somewhat different solution: lay out the stack like this.
SS RSP FLAGS CS RIP rest of pt_regs PKS ^^^^^^^^ extended_pt_regs points here
C frame more C stuff ...
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer.
While I agree vs. PKS which is architecture specific state and needed in other places e.g. #PF, I'm not convinced that sticking the existing state into the same area buys us anything more than an indirect access.
Peter?
Agreed; that immediately solves the confusion Ira had as well. While extending pt_regs sounds scary, I think we've isolated our pt_regs implementation from actual ABI pretty well, but of course, that would need an audit. We don't want to leak this into signals for example.
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
--- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -43,6 +43,7 @@ #include <asm/io_bitmap.h> #include <asm/proto.h> #include <asm/frame.h> +#include <asm/pkeys_common.h> #include "process.h" @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, return ret; } +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +DECLARE_PER_CPU(u32, pkrs_cache); +static inline void pks_init_task(struct task_struct *tsk)
First of all. I asked several times now not to glue stuff onto a function without a newline inbetween. It's unreadable.
But what's worse is that the declaration of pkrs_cache which is global is in a C file and not in a header. And pkrs_cache is not even used in this file. So what?
+{
- /* New tasks get the most restrictive PKRS value */
- tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
+} +static inline void pks_sched_in(void)
Newline between functions. It's fine for stubs, but not for a real implementation.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index d1dfe743e79f..76a62419c446 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) return pk_reg; }
+DEFINE_PER_CPU(u32, pkrs_cache);
Again, why is this global?
+void write_pkrs(u32 new_pkrs) +{
- u32 *pkrs;
- if (!static_cpu_has(X86_FEATURE_PKS))
return;
- pkrs = get_cpu_ptr(&pkrs_cache);
So this is called from various places including schedule and also from the low level entry/exit code. Why do we need to have an extra preempt_disable/enable() there via get/put_cpu_ptr()?
Just because performance in those code paths does not matter?
- if (*pkrs != new_pkrs) {
*pkrs = new_pkrs;
wrmsrl(MSR_IA32_PKRS, new_pkrs);
- }
- put_cpu_ptr(pkrs);
Now back to the context switch:
@@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
if ((tifp ^ tifn) & _TIF_SLD) switch_to_sld(tifn);
- pks_sched_in();
}
How is this supposed to work?
switch_to() { .... switch_to_extra() { .... if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT || prev_tif & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev, next);
I.e. __switch_to_xtra() is only invoked when the above condition is true, which is not guaranteed at all.
While I have to admit that I dropped the ball on the update for the entry patch, I'm not too sorry about it anymore when looking at this.
Are you still sure that this is ready for merging?
Thanks,
tglx
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +/*
- PKRS is a per-logical-processor MSR which overlays additional protection for
- pages which have been mapped with a protection key.
- The register is not maintained with XSAVE so we have to maintain the MSR
- value in software during context switch and exception handling.
- Context switches save the MSR in the task struct thus taking that value to
- other processors if necessary.
- To protect against exceptions having access to this memory we save the
- current running value and set the PKRS value for the duration of the
- exception. Thus preventing exception handlers from having the elevated
- access of the interrupted task.
- */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val) +{
- if (!cpu_feature_enabled(X86_FEATURE_PKS))
return;
- irq_state->thread_pkrs = current->thread.saved_pkrs;
- write_pkrs(INIT_PKRS_VALUE);
Why is this noinstr? Just because it's called from a noinstr function?
Of course the function itself violates the noinstr constraints:
vmlinux.o: warning: objtool: write_pkrs()+0x36: call to do_trace_write_msr() leaves .noinstr.text section
There is absolutely no reason to have this marked noinstr.
Thanks,
tglx
On Dec 17, 2020, at 5:19 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Dec 17, 2020 at 02:07:01PM +0100, Thomas Gleixner wrote:
On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
On Mon, Nov 23, 2020 at 10:10 PM ira.weiny@intel.com wrote:
After contemplating this for a bit, I think this isn't really the right approach. It *works*, but we've mostly just created a bit of an unfortunate situation. Our stack, on a (possibly nested) entry looks like:
previous frame (or empty if we came from usermode)
SS RSP FLAGS CS RIP rest of pt_regs
C frame
irqentry_state_t (maybe -- the compiler is within its rights to play almost arbitrary games here)
more C stuff
So what we've accomplished is having two distinct arch register regions, one called pt_regs and the other stuck in irqentry_state_t. This is annoying because it means that, if we want to access this thing without passing a pointer around or access it at all from outer frames, we need to do something terrible with the unwinder, and we don't want to go there.
So I propose a somewhat different solution: lay out the stack like this.
SS RSP FLAGS CS RIP rest of pt_regs PKS ^^^^^^^^ extended_pt_regs points here
C frame more C stuff ...
IOW we have:
struct extended_pt_regs { bool rcu_whatever; other generic fields here; struct arch_extended_pt_regs arch_regs; struct pt_regs regs; };
and arch_extended_pt_regs has unsigned long pks;
and instead of passing a pointer to irqentry_state_t to the generic entry/exit code, we just pass a pt_regs pointer.
While I agree vs. PKS which is architecture specific state and needed in other places e.g. #PF, I'm not convinced that sticking the existing state into the same area buys us anything more than an indirect access.
Peter?
Agreed; that immediately solves the confusion Ira had as well. While extending pt_regs sounds scary, I think we've isolated our pt_regs implementation from actual ABI pretty well, but of course, that would need an audit. We don't want to leak this into signals for example.
I’m okay with this.
My suggestion for having an extended pt_regs that contains pt_regs is to keep extensions like this invisible to unsuspecting parts of the kernel. In particular, BPF seems to pass around struct pt_regs *, and I don’t know what the implications of effectively offsetting all the registers relative to the pointer would be.
Anything that actually broke the signal regs ABI should be noticed by the x86 selftests — the tests read and write registers through ucontext.
On Mon, Nov 23 2020 at 22:09, ira weiny wrote:
From: Ira Weiny ira.weiny@intel.com
Currently struct irqentry_state_t only contains a single bool value which makes passing it by value is reasonable. However, future patches add information to this struct. This includes the PKRS thread state, included in this series, as well as information to store kmap reference tracking and PKS global state outside this series. In total, we anticipate 2 new 32 bit fields and an integer field to be added to the struct beyond the existing bool value.
Well yes, but why can't you provide at least in the comment section below the '---' a pointer to the latest version of this reference muck and PKS global state if you can't explain at least the concept of the two things here?
It's one thing that you anticipate something but a different thing whether it's the right thing to do.
Thanks,
tglx
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
void disable_TSC(void) @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) if ((tifp ^ tifn) & _TIF_SLD) switch_to_sld(tifn);
- pks_sched_in();
}
Does the selftest for this ever actually schedule()?
I see it talking about context switching, but I don't immediately see how it would.
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
/* Arm for context switch test */
write(fd, "1", 1);
/* Context switch out... */
sleep(4);
/* Check msr restored */
write(fd, "2", 1);
These are always tricky. What you ideally want here is:
1. Switch away from this task to a non-PKS task, or 2. Switch from this task to a PKS-using task, but one which has a different PKS value
then, switch back to this task and make sure PKS maintained its value.
*But*, there's no absolute guarantee that another task will run. It would not be totally unreasonable to have the kernel just sit in a loop without context switching here if no other tasks can run.
The only way you *know* there is a context switch is by having two tasks bound to the same logical CPU and make sure they run one after another. This just gets itself into a state where it *CAN* context switch and prays that one will happen.
You can also run a bunch of these in parallel bound to a single CPU. That would also give you higher levels of assurance that *some* context switch happens at sleep().
One critical thing with these tests is to sabotage the kernel and then run them and make *sure* they fail. Basically, if you screw up, do they actually work to catch it?
On Thu, Dec 17 2020 at 15:50, Thomas Gleixner wrote:
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
+void write_pkrs(u32 new_pkrs) +{
- u32 *pkrs;
- if (!static_cpu_has(X86_FEATURE_PKS))
return;
- pkrs = get_cpu_ptr(&pkrs_cache);
So this is called from various places including schedule and also from the low level entry/exit code. Why do we need to have an extra preempt_disable/enable() there via get/put_cpu_ptr()?
Just because performance in those code paths does not matter?
- if (*pkrs != new_pkrs) {
*pkrs = new_pkrs;
wrmsrl(MSR_IA32_PKRS, new_pkrs);
- }
- put_cpu_ptr(pkrs);
Which made me look at the other branch of your git repo just because I wanted to know about the 'other' storage requirements and I found this gem:
update_global_pkrs() ... /* * If we are preventing access from the old value. Force the * update on all running threads. */ if (((old_val == 0) && protection) || ((old_val & PKR_WD_BIT) && (protection & PKEY_DISABLE_ACCESS))) { int cpu;
for_each_online_cpu(cpu) { u32 *ptr = per_cpu_ptr(&pkrs_cache, cpu); *ptr = update_pkey_val(*ptr, pkey, protection); wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr); put_cpu_ptr(ptr);
1) per_cpu_ptr() -> put_cpu_ptr() is broken as per_cpu_ptr() is not disabling preemption while put_cpu_ptr() enables it which wreckages the preemption count.
How was that ever tested at all with any debug option enabled?
Answer: Not at all
2) How is that sequence:
ptr = per_cpu_ptr(&pkrs_cache, cpu); *ptr = update_pkey_val(*ptr, pkey, protection); wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr);
supposed to be correct vs. a concurrent modification of the pkrs_cache of the remote CPU?
Answer: Not at all
Also doing a wrmsrl_on_cpu() on _each_ online CPU is insane at best.
A smp function call on a remote CPU takes ~3-5us when the remote CPU is not idle and can immediately respond. If the remote CPU is deep in idle it can take up to 100us depending on C-State it is in.
Even if the remote CPU is not not idle and just has interrupts disabled for a few dozen of microseconds this adds up.
So on a 256 CPU system depending on the state of the remote CPUs this stalls the CPU doing the update for anything between 1 and 25ms worst case.
Of course that also violates _all_ CPU isolation mechanisms.
What for?
Just for the theoretical chance that _all_ remote CPUs have seen that global permission and have it still active?
You're not serious about that, right?
The only use case for this in your tree is: kmap() and the possible usage of that mapping outside of the thread context which sets it up.
The only hint for doing this at all is:
Some users, such as kmap(), sometimes requires PKS to be global.
'sometime requires' is really _not_ a technical explanation.
Where is the explanation why kmap() usage 'sometimes' requires this global trainwreck in the first place and where is the analysis why this can't be solved differently?
Detailed use case analysis please.
Thanks,
tglx
On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote:
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
/* Arm for context switch test */
write(fd, "1", 1);
/* Context switch out... */
sleep(4);
/* Check msr restored */
write(fd, "2", 1);
These are always tricky. What you ideally want here is:
- Switch away from this task to a non-PKS task, or
- Switch from this task to a PKS-using task, but one which has a different PKS value
Or both...
then, switch back to this task and make sure PKS maintained its value.
*But*, there's no absolute guarantee that another task will run. It would not be totally unreasonable to have the kernel just sit in a loop without context switching here if no other tasks can run.
The only way you *know* there is a context switch is by having two tasks bound to the same logical CPU and make sure they run one after another.
Ah... We do that.
... + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + /* Two processes run on CPU 0 so that they go through context switch. */ + sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset); ...
I think this should be ensuring that both the parent and the child are running on CPU 0. At least according to the man page they should be.
<man> A child created via fork(2) inherits its parent's CPU affinity mask. </man>
Perhaps a better method would be to synchronize the 2 threads more to ensure that we are really running at the 'same time' and forcing the context switch.
This just gets itself into a state where it *CAN* context switch and prays that one will happen.
Not sure what you mean by 'This'? Do you mean that running on the same CPU will sometimes not force a context switch? Or do you mean that the sleeps could be badly timed and the 2 threads could run 1 after the other on the same CPU? The latter is AFAICT the most likely case.
You can also run a bunch of these in parallel bound to a single CPU. That would also give you higher levels of assurance that *some* context switch happens at sleep().
I think more cycles is a good idea for sure. But I'm more comfortable with forcing the test to be more synchronized so that it is actually running in the order we think/want it to be.
One critical thing with these tests is to sabotage the kernel and then run them and make *sure* they fail. Basically, if you screw up, do they actually work to catch it?
I'll try and come up with a more stressful test.
Ira
On Thu, Dec 17, 2020 at 03:50:55PM +0100, Thomas Gleixner wrote:
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
--- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -43,6 +43,7 @@ #include <asm/io_bitmap.h> #include <asm/proto.h> #include <asm/frame.h> +#include <asm/pkeys_common.h> #include "process.h" @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, return ret; } +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +DECLARE_PER_CPU(u32, pkrs_cache); +static inline void pks_init_task(struct task_struct *tsk)
First of all. I asked several times now not to glue stuff onto a function without a newline inbetween. It's unreadable.
Fixed.
But what's worse is that the declaration of pkrs_cache which is global is in a C file and not in a header. And pkrs_cache is not even used in this file. So what?
OK, this was just a complete rebase/refactor mess up on my part. The global'ness is not required until we need a global update of the pkrs which was not part of this series.
I've removed it from this patch. And cleaned it up in patch 6/10 as well. And cleaned it up in the global pkrs patch which you found in my git tree.
+{
- /* New tasks get the most restrictive PKRS value */
- tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
+} +static inline void pks_sched_in(void)
Newline between functions. It's fine for stubs, but not for a real implementation.
Again my apologies.
Fixed.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index d1dfe743e79f..76a62419c446 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) return pk_reg; }
+DEFINE_PER_CPU(u32, pkrs_cache);
Again, why is this global?
In this patch it does not need to be. I've changed it to static.
+void write_pkrs(u32 new_pkrs) +{
- u32 *pkrs;
- if (!static_cpu_has(X86_FEATURE_PKS))
return;
- pkrs = get_cpu_ptr(&pkrs_cache);
So this is called from various places including schedule and also from the low level entry/exit code. Why do we need to have an extra preempt_disable/enable() there via get/put_cpu_ptr()?
Just because performance in those code paths does not matter?
Honestly I don't recall the full history at this point. The preempt_disable/enable() is required when this is called from pks_update_protection() AKA when a user is trying to update the protections of their key. What I do remember is that this was originally not preempt safe and we had a comment to that effect in the early patches.[1]
Somewhere along the line the preempt discussion lead us to make write_pkrs() 'self contained' with the preemption protection here. I just did not think about any performance issues. It is safe to call preempt_disable() from a preempt disabled region, correct? I seem to recall asking that and the answer was 'yes'.
I will audit the calls again and adjust the preemption disable as needed.
[1] https://lore.kernel.org/lkml/20200717072056.73134-5-ira.weiny@intel.com/#t
- if (*pkrs != new_pkrs) {
*pkrs = new_pkrs;
wrmsrl(MSR_IA32_PKRS, new_pkrs);
- }
- put_cpu_ptr(pkrs);
Now back to the context switch:
@@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
if ((tifp ^ tifn) & _TIF_SLD) switch_to_sld(tifn);
- pks_sched_in();
}
How is this supposed to work?
switch_to() { .... switch_to_extra() { .... if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT || prev_tif & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev, next);
I.e. __switch_to_xtra() is only invoked when the above condition is true, which is not guaranteed at all.
I did not know that. I completely missunderstood what __switch_to_xtra() meant. I thought it was arch specific 'extra' stuff so it seemed reasonable to me.
Also, our test seemed to work. I'm still investigating what may be wrong.
While I have to admit that I dropped the ball on the update for the entry patch, I'm not too sorry about it anymore when looking at this.
Are you still sure that this is ready for merging?
Nope...
Thanks for the review, Ira
Thanks,
tglx
On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote:
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
void disable_TSC(void) @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) if ((tifp ^ tifn) & _TIF_SLD) switch_to_sld(tifn);
- pks_sched_in();
}
Does the selftest for this ever actually schedule()?
At this point I'm not sure. This code has been in since the beginning. So its seen a lot of soak time.
I see it talking about context switching, but I don't immediately see how it would.
We were trying to force parent and child to run on the same CPU. I suspect something is wrong in the timing of that test.
Ira
On Thu, Dec 17 2020 at 23:43, Thomas Gleixner wrote:
The only use case for this in your tree is: kmap() and the possible usage of that mapping outside of the thread context which sets it up.
The only hint for doing this at all is:
Some users, such as kmap(), sometimes requires PKS to be global.
'sometime requires' is really _not_ a technical explanation.
Where is the explanation why kmap() usage 'sometimes' requires this global trainwreck in the first place and where is the analysis why this can't be solved differently?
Detailed use case analysis please.
A lengthy conversation with Dan and Dave over IRC confirmed what I was suspecting.
The approach of this whole PKS thing is to make _all_ existing code magically "work". That means aside of the obvious thread local mappings, the kmap() part is needed to solve the problem of async handling where the mapping is handed to some other context which then uses it and notifies the context which created the mapping when done. That's the principle which was used to make highmem work long time ago.
IMO that was a mistake back then. The right thing would have been to change the code so that it does not rely on a temporary mapping created by the initiator. Instead let the initiator hand the page over to the other context which then creates a temporary mapping for fiddling with it. Water under the bridge...
Glueing PKS on to that kmap() thing is horrible and global PKS is pretty much the opposite of what PKS wants to achieve. It's disabling protection systemwide for an unspecified amount of time and for all contexts.
So instead of trying to make global PKS "work" we really should go and take a smarter approach.
1) Many kmap() use cases are strictly thread local and the mapped address is never handed to some other context, which means this can be replaced with kmap_local() now, which preserves the mapping accross preemption. PKS just works nicely on top of that.
2) Modify kmap() so that it marks the to be mapped page as 'globaly unprotected' instead of doing this global unprotect PKS dance. kunmap() undoes that. That obviously needs some thought vs. refcounting if there are concurrent users, but that's a solvable problem either as part of struct page itself or stored in some global hash.
3) Have PKS modes:
- STRICT: No pardon
- RELAXED: Warn and unprotect temporary for the current context
- SILENT: Like RELAXED, but w/o warning to make sysadmins happy. Default should be RELAXED.
- OFF: Disable the whole PKS thing
4) Have a smart #PF mechanism which does:
if (error_code & X86_PF_PK) { page = virt_to_page(address);
if (!page || !page_is_globaly_unprotected(page)) goto die;
if (pks_mode == PKS_MODE_STRICT) goto die;
WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ...");
temporary_unprotect(page, regs); return; }
temporary_unprotect(page, regs) { key = page_to_key(page);
/* Return from #PF will establish this for the faulting context */ extended_state(regs)->pks &= ~PKS_MASK(key); }
This temporary unprotect is undone when the context is left, so depending on the context (thread, interrupt, softirq) the unprotected section might be way wider than actually needed, but that's still orders of magnitudes better than having this fully unrestricted global PKS mode which is completely scopeless.
The above is at least restricted to the pages which are in use for a particular operation. Stray pointers during that time are obviously not caught, but that's not any different from that proposed global thingy.
The warning allows to find the non-obvious places so they can be analyzed and worked on.
5) The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites.
The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place?
Aside of that this global thing would never work at all because the refcounting is per thread and not global.
So that DAX use case is just a matter of:
grant/revoke_access(DEV_PKS_KEY, READ/WRITE)
which is effective for the current execution context and really wants to be a distinct READ/WRITE protection and not the magic global thing which just has on/off. All usage sites know whether they want to read or write.
That leaves the question about the refcount. AFAICT, nothing nests in that use case for a given execution context. I'm surely missing something subtle here.
Hmm?
Thanks,
tglx
On 12/17/20 8:10 PM, Ira Weiny wrote:
On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote:
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
void disable_TSC(void) @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) if ((tifp ^ tifn) & _TIF_SLD) switch_to_sld(tifn);
- pks_sched_in();
}
Does the selftest for this ever actually schedule()?
At this point I'm not sure. This code has been in since the beginning. So its seen a lot of soak time.
Think about it another way. Let's say this didn't get called on the first context switch away from the PKS-using task. Would anyone notice? How likely is this to happen?
The function tracers or kprobes tend to be a great tool for this, at least for testing whether the code path you expect to hit is getting hit.
On Thu, Dec 17, 2020 at 8:05 PM Ira Weiny ira.weiny@intel.com wrote:
On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote:
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
/* Arm for context switch test */
write(fd, "1", 1);
/* Context switch out... */
sleep(4);
/* Check msr restored */
write(fd, "2", 1);
These are always tricky. What you ideally want here is:
- Switch away from this task to a non-PKS task, or
- Switch from this task to a PKS-using task, but one which has a different PKS value
Or both...
then, switch back to this task and make sure PKS maintained its value.
*But*, there's no absolute guarantee that another task will run. It would not be totally unreasonable to have the kernel just sit in a loop without context switching here if no other tasks can run.
The only way you *know* there is a context switch is by having two tasks bound to the same logical CPU and make sure they run one after another.
Ah... We do that.
...
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
/* Two processes run on CPU 0 so that they go through context switch. */
sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
...
I think this should be ensuring that both the parent and the child are running on CPU 0. At least according to the man page they should be.
<man> A child created via fork(2) inherits its parent's CPU affinity mask. </man>
Perhaps a better method would be to synchronize the 2 threads more to ensure that we are really running at the 'same time' and forcing the context switch.
This just gets itself into a state where it *CAN* context switch and prays that one will happen.
Not sure what you mean by 'This'? Do you mean that running on the same CPU will sometimes not force a context switch? Or do you mean that the sleeps could be badly timed and the 2 threads could run 1 after the other on the same CPU? The latter is AFAICT the most likely case.
One way to guarantee that both threads run is to just pass a message between them over a pipe and wait for the submitter to receive an ack from the other end.
On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner tglx@linutronix.de wrote: [..]
The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites.
The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place?
Aside of that this global thing would never work at all because the refcounting is per thread and not global.
So that DAX use case is just a matter of:
grant/revoke_access(DEV_PKS_KEY, READ/WRITE)
which is effective for the current execution context and really wants to be a distinct READ/WRITE protection and not the magic global thing which just has on/off. All usage sites know whether they want to read or write.
I was tracking and nodding until this point. Yes, kill the global / kmap() support, but if grant/revoke_access is not integrated behind kmap_{local,atomic}() then it's not a "handful" of sites that need to be instrumented it's 100s. Are you suggesting that "relaxed" mode enforcement is a way to distribute the work of teaching driver writers that they need to incorporate explicit grant/revoke-read/write in addition to kmap? The entire reason PTE_DEVMAP exists was to allow get_user_pages() for PMEM and not require every downstream-GUP code path to specifically consider whether it was talking to PMEM or RAM pages, and certainly not whether they were reading or writing to it.
On Fri, Dec 18, 2020 at 02:57:51PM +0100, Thomas Gleixner wrote:
On Thu, Dec 17 2020 at 23:43, Thomas Gleixner wrote:
The only use case for this in your tree is: kmap() and the possible usage of that mapping outside of the thread context which sets it up.
The only hint for doing this at all is:
Some users, such as kmap(), sometimes requires PKS to be global.
'sometime requires' is really _not_ a technical explanation.
Where is the explanation why kmap() usage 'sometimes' requires this global trainwreck in the first place and where is the analysis why this can't be solved differently?
Detailed use case analysis please.
A lengthy conversation with Dan and Dave over IRC confirmed what I was suspecting.
The approach of this whole PKS thing is to make _all_ existing code magically "work". That means aside of the obvious thread local mappings, the kmap() part is needed to solve the problem of async handling where the mapping is handed to some other context which then uses it and notifies the context which created the mapping when done. That's the principle which was used to make highmem work long time ago.
IMO that was a mistake back then. The right thing would have been to change the code so that it does not rely on a temporary mapping created by the initiator. Instead let the initiator hand the page over to the other context which then creates a temporary mapping for fiddling with it. Water under the bridge...
But maybe not. We are getting rid of a lot of the kmaps and once the bulk are gone perhaps we can change this and remove kmap completely?
Glueing PKS on to that kmap() thing is horrible and global PKS is pretty much the opposite of what PKS wants to achieve. It's disabling protection systemwide for an unspecified amount of time and for all contexts.
I agree. This is why I have been working on converting kmap() call sites to kmap_local_page().[1]
So instead of trying to make global PKS "work" we really should go and take a smarter approach.
- Many kmap() use cases are strictly thread local and the mapped address is never handed to some other context, which means this can be replaced with kmap_local() now, which preserves the mapping accross preemption. PKS just works nicely on top of that.
Yes hence the massive kmap->kmap_thread patch set which is now becoming kmap_local_page().[2]
- Modify kmap() so that it marks the to be mapped page as 'globaly unprotected' instead of doing this global unprotect PKS dance. kunmap() undoes that. That obviously needs some thought vs. refcounting if there are concurrent users, but that's a solvable problem either as part of struct page itself or stored in some global hash.
How would this globally unprotected flag work? I suppose if kmap created a new PTE we could make that PTE non-PKS protected then we don't have to fiddle with the register... I think I like that idea.
Have PKS modes:
- STRICT: No pardon
- RELAXED: Warn and unprotect temporary for the current context - SILENT: Like RELAXED, but w/o warning to make sysadmins happy. Default should be RELAXED. - OFF: Disable the whole PKS thing
I'm not really sure how this solves the global problem but it is probably worth having in general.
Have a smart #PF mechanism which does:
if (error_code & X86_PF_PK) { page = virt_to_page(address);
if (!page || !page_is_globaly_unprotected(page)) goto die; if (pks_mode == PKS_MODE_STRICT) goto die; WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ..."); temporary_unprotect(page, regs); return;
}
I feel like this is very similar to what I had in the global patch you found in my git tree with the exception of the RELAXED mode. I simply had globally unprotected or die.
global_pkey_is_enabled() handles the page_is_globaly_unprotected() and temporary_unprotect().[3]
Anyway, I'm sorry (but not sorry) that you found it. I've been trying to get 0-day and other testing on it and my public tree was the easiest way to do that. Anyway...
The patch as a whole needs work. You are 100% correct that if a mapping is handed to another context it is going to suck performance wise. It has had some internal review but not much.
Regardless I think unprotecting a global context is the easy part. The code you had a problem with (and I see is fully broken) was the restriction of access. A failure to update in that direction would only result in a wider window of access. I contemplated not doing a global update at all and just leave the access open until the next context switch. But the code as it stands tries to force an update for a couple of reasons:
1) kmap_local_page() removes most of the need for global pks. So I was thinking that global PKS could be a slow path.
2) kmap()'s that are handed to other contexts they are likely to be 'long term' and should not need to be updated 'too' often. I will admit that I don't know how often 'too often' is.
But IMO these questions are best left to after the kmaps are converted. Thus this patch set was just basic support. Other uses cases beyond pmem such as trusted keys or secret mem don't need a global pks feature and could build on the patch set submitted. I was trying to break the problem down.
temporary_unprotect(page, regs) { key = page_to_key(page);
/* Return from #PF will establish this for the faulting context */ extended_state(regs)->pks &= ~PKS_MASK(key); }
This temporary unprotect is undone when the context is left, so depending on the context (thread, interrupt, softirq) the unprotected section might be way wider than actually needed, but that's still orders of magnitudes better than having this fully unrestricted global PKS mode which is completely scopeless.
I'm not sure I follow you. How would we know when the context is left?
The above is at least restricted to the pages which are in use for a particular operation. Stray pointers during that time are obviously not caught, but that's not any different from that proposed global thingy. The warning allows to find the non-obvious places so they can be analyzed and worked on.
I could add the warning for sure.
The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites.
The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place?
I'm not following. What is 'it'?
Aside of that this global thing would never work at all because the refcounting is per thread and not global. So that DAX use case is just a matter of: grant/revoke_access(DEV_PKS_KEY, READ/WRITE) which is effective for the current execution context and really wants to be a distinct READ/WRITE protection and not the magic global thing which just has on/off. All usage sites know whether they want to read or write.
That leaves the question about the refcount. AFAICT, nothing nests in that use case for a given execution context. I'm surely missing something subtle here.
The refcount is needed for non-global pks as well as global. I've not resolved if anything needs to be done with the refcount on the global update since the following is legal.
kmap() kmap_local_page() kunmap() kunmap_local()
Which would be a problem. But I don't think it is ever actually done.
Another problem would be if the kmap and kunmap happened in different contexts... :-/ I don't think that is done either but I don't know for certain.
Frankly, my main focus before any of this global support has been to get rid of as many kmaps as possible.[1] Once that is done I think more of these questions can be answered better.
Ira
[1] https://lore.kernel.org/lkml/20201210171834.2472353-1-ira.weiny@intel.com/ [2] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/ [3] Latest untested patch pushed for reference here because I can't find exactly the branch you found. https://github.com/weiny2/linux-kernel/commit/37439e91e141be58c13ccc4462f778...
Hmm?
Thanks,
tglx
On 12/18/20 11:42 AM, Ira Weiny wrote:
Another problem would be if the kmap and kunmap happened in different contexts... :-/ I don't think that is done either but I don't know for certain.
It would be really nice to put together some surveillance patches to help become more certain about these things. Even a per-task counter would be better than nothing.
On kmap: current->kmaps++; On kunmap: current->kmaps--; WARN_ON(current->kmaps < 0); On exit: WARN_ON(current->kmaps);
That would at least find imbalances. You could take it even further by having a little array, say:
struct one_kmap { struct page *page; depot_stack_handle_t handle; };
Then:
struct task_struct { ... + struct one_kmap kmaps[10]; };
On kmap() you make a new entry in current->kmaps[], and on kunmap() you try to find the corresponding entry. If you can't find one, in the current task you can even go search all the other tasks and see who might be responsible. If something goes and does more than 10 simultaneous kmap()s in one thread, dump a warning and give up. Or, dynamically allocate the kmaps[] array.
Then you can dump out the stack of the kmap() culprit if it exits after a kmap() but without a corresponding kfree().
Something like that should be low overhead enough to get it into things like the 0day debug kernel. It should be way cheaper than something like lockdep.
On Fri, Dec 18 2020 at 11:20, Dan Williams wrote:
On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner tglx@linutronix.de wrote: [..]
The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites.
The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place?
Aside of that this global thing would never work at all because the refcounting is per thread and not global.
So that DAX use case is just a matter of:
grant/revoke_access(DEV_PKS_KEY, READ/WRITE)
which is effective for the current execution context and really wants to be a distinct READ/WRITE protection and not the magic global thing which just has on/off. All usage sites know whether they want to read or write.
I was tracking and nodding until this point. Yes, kill the global / kmap() support, but if grant/revoke_access is not integrated behind kmap_{local,atomic}() then it's not a "handful" of sites that need to be instrumented it's 100s. Are you suggesting that "relaxed" mode enforcement is a way to distribute the work of teaching driver writers that they need to incorporate explicit grant/revoke-read/write in addition to kmap? The entire reason PTE_DEVMAP exists was to allow get_user_pages() for PMEM and not require every downstream-GUP code path to specifically consider whether it was talking to PMEM or RAM pages, and certainly not whether they were reading or writing to it.
kmap_local() is fine. That can work automatically because it's strict local to the context which does the mapping.
kmap() is dubious because it's a 'global' mapping as dictated per HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to identify cases where the mapped address is really handed to a different execution context. We want to see those cases and analyse whether this can't be solved in a different way. That's why I suggested to do a warning in that case.
Also vs. the DAX use case I really meant the code in fs/dax and drivers/dax/ itself which is handling this via dax_read_[un]lock.
Does that make more sense?
Thanks,
tglx
On Fri, Dec 18 2020 at 11:42, Ira Weiny wrote:
On Fri, Dec 18, 2020 at 02:57:51PM +0100, Thomas Gleixner wrote:
- Modify kmap() so that it marks the to be mapped page as 'globaly unprotected' instead of doing this global unprotect PKS dance. kunmap() undoes that. That obviously needs some thought vs. refcounting if there are concurrent users, but that's a solvable problem either as part of struct page itself or stored in some global hash.
How would this globally unprotected flag work? I suppose if kmap created a new PTE we could make that PTE non-PKS protected then we don't have to fiddle with the register... I think I like that idea.
No. Look at the highmem implementation of kmap(). It's a terrible idea, really. Don't even think about that.
There is _no_ global flag. The point is that the kmap is strictly bound to a particular struct page. So you can simply do:
kmap(page) if (page_is_access_protected(page)) atomic_inc(&page->unprotect);
kunmap(page) if (page_is_access_protected(page)) atomic_dec(&page->unprotect);
and in the #PF handler:
if (!page->unprotect) goto die;
The reason why I said: either in struct page itself or in a global hash is that struct page is already packed and people are not really happy about increasing it's size. But the principle is roughly the same.
Have a smart #PF mechanism which does:
if (error_code & X86_PF_PK) { page = virt_to_page(address);
if (!page || !page_is_globaly_unprotected(page)) goto die; if (pks_mode == PKS_MODE_STRICT) goto die; WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ..."); temporary_unprotect(page, regs); return;
}
I feel like this is very similar to what I had in the global patch you found in my git tree with the exception of the RELAXED mode. I simply had globally unprotected or die.
Your stuff depends on that global_pks_state which is not maintainable especially not the teardown side. This depends on per page state which is clearly way simpler and more focussed.
Regardless I think unprotecting a global context is the easy part. The code you had a problem with (and I see is fully broken) was the restriction of access. A failure to update in that direction would only result in a wider window of access. I contemplated not doing a global update at all and just leave the access open until the next context switch. But the code as it stands tries to force an update for a couple of reasons:
kmap_local_page() removes most of the need for global pks. So I was thinking that global PKS could be a slow path.
kmap()'s that are handed to other contexts they are likely to be 'long term' and should not need to be updated 'too' often. I will admit that I don't know how often 'too often' is.
Even once in while is not a justification for stopping the world for N milliseconds.
temporary_unprotect(page, regs) { key = page_to_key(page);
/* Return from #PF will establish this for the faulting context */ extended_state(regs)->pks &= ~PKS_MASK(key); }
This temporary unprotect is undone when the context is left, so depending on the context (thread, interrupt, softirq) the unprotected section might be way wider than actually needed, but that's still orders of magnitudes better than having this fully unrestricted global PKS mode which is completely scopeless.
I'm not sure I follow you. How would we know when the context is left?
The context goes away on it's own. Either context switch or return from interrupt. As I said there is an extended window where the external context still might have unprotected access even if the initiating context has called kunmap() already. It's not pretty, but it's not the end of the world either.
That's why I suggested to have that WARN_ONCE() so we can actually see why and where that happens and think about solutions to make this go into local context, e.g. by changing the vaddr pointer to a struct page pointer for these particular use cases and then the other context can do kmap/unmap_local().
The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites.
The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place?
I'm not following. What is 'it'?
That global argument to dev_access_enable()/disable().
That leaves the question about the refcount. AFAICT, nothing nests in that use case for a given execution context. I'm surely missing something subtle here.
The refcount is needed for non-global pks as well as global. I've not resolved if anything needs to be done with the refcount on the global update since the following is legal.
kmap() kmap_local_page() kunmap() kunmap_local()
Which would be a problem. But I don't think it is ever actually done.
If it does not exist why would we support it in the first place? We can have some warning there to catch that case.
Another problem would be if the kmap and kunmap happened in different contexts... :-/ I don't think that is done either but I don't know for certain.
Frankly, my main focus before any of this global support has been to get rid of as many kmaps as possible.[1] Once that is done I think more of these questions can be answered better.
I was expecting that you could answer these questions :)
Thanks,
tglx
On Fri, Dec 18, 2020 at 1:06 PM Thomas Gleixner tglx@linutronix.de wrote:
On Fri, Dec 18 2020 at 11:20, Dan Williams wrote:
On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner tglx@linutronix.de wrote: [..]
The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites.
The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place?
Aside of that this global thing would never work at all because the refcounting is per thread and not global.
So that DAX use case is just a matter of:
grant/revoke_access(DEV_PKS_KEY, READ/WRITE)
which is effective for the current execution context and really wants to be a distinct READ/WRITE protection and not the magic global thing which just has on/off. All usage sites know whether they want to read or write.
I was tracking and nodding until this point. Yes, kill the global / kmap() support, but if grant/revoke_access is not integrated behind kmap_{local,atomic}() then it's not a "handful" of sites that need to be instrumented it's 100s. Are you suggesting that "relaxed" mode enforcement is a way to distribute the work of teaching driver writers that they need to incorporate explicit grant/revoke-read/write in addition to kmap? The entire reason PTE_DEVMAP exists was to allow get_user_pages() for PMEM and not require every downstream-GUP code path to specifically consider whether it was talking to PMEM or RAM pages, and certainly not whether they were reading or writing to it.
kmap_local() is fine. That can work automatically because it's strict local to the context which does the mapping.
kmap() is dubious because it's a 'global' mapping as dictated per HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to identify cases where the mapped address is really handed to a different execution context. We want to see those cases and analyse whether this can't be solved in a different way. That's why I suggested to do a warning in that case.
Also vs. the DAX use case I really meant the code in fs/dax and drivers/dax/ itself which is handling this via dax_read_[un]lock.
Does that make more sense?
Yup, got it. The dax code can be precise wrt to PKS in a way that kmap_local() cannot.
On Fri, Dec 18 2020 at 13:58, Dan Williams wrote:
On Fri, Dec 18, 2020 at 1:06 PM Thomas Gleixner tglx@linutronix.de wrote:
kmap_local() is fine. That can work automatically because it's strict local to the context which does the mapping.
kmap() is dubious because it's a 'global' mapping as dictated per HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to identify cases where the mapped address is really handed to a different execution context. We want to see those cases and analyse whether this can't be solved in a different way. That's why I suggested to do a warning in that case.
Also vs. the DAX use case I really meant the code in fs/dax and drivers/dax/ itself which is handling this via dax_read_[un]lock.
Does that make more sense?
Yup, got it. The dax code can be precise wrt to PKS in a way that kmap_local() cannot.
Which makes me wonder whether we should have kmap_local_for_read() or something like that, which could be obviously only be RO enforced for the real HIGHMEM case or the (for now x86 only) enforced kmap_local() debug mechanics on 64bit.
So for the !highmem case it would not magically make the existing kernel mapping RO, but this could be forwarded to the PKS protection. Aside of that it's a nice annotation in the code.
That could be used right away for all the kmap[_atomic] -> kmap_local conversions.
Thanks,
tglx --- include/linux/highmem-internal.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
--- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -32,6 +32,10 @@ static inline void kmap_flush_tlb(unsign #define kmap_prot PAGE_KERNEL #endif
+#ifndef kmap_prot_to +#define kmap_prot PAGE_KERNEL_RO +#endif + void *kmap_high(struct page *page); void kunmap_high(struct page *page); void __kmap_flush_unused(void); @@ -73,6 +77,11 @@ static inline void *kmap_local_page(stru return __kmap_local_page_prot(page, kmap_prot); }
+static inline void *kmap_local_page_for_read(struct page *page) +{ + return __kmap_local_page_prot(page, kmap_prot_ro); +} + static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) { return __kmap_local_page_prot(page, prot); @@ -169,6 +178,11 @@ static inline void *kmap_local_page_prot { return kmap_local_page(page); } + +static inline void *kmap_local_page_for_read(struct page *page) +{ + return kmap_local_page(page); +}
static inline void *kmap_local_pfn(unsigned long pfn) {
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote:
From: Fenghua Yu fenghua.yu@intel.com
PKS allows kernel users to define domains of page mappings which have additional protections beyond the paging protections.
Add an API to allocate, use, and free a protection key which identifies such a domain. Export 5 new symbols pks_key_alloc(), pks_mknoaccess(), pks_mkread(), pks_mkrdwr(), and pks_key_free(). Add 2 new macros; PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).
Update the protection key documentation to cover pkeys on supervisor pages.
Co-developed-by: Ira Weiny ira.weiny@intel.com Signed-off-by: Ira Weiny ira.weiny@intel.com Signed-off-by: Fenghua Yu fenghua.yu@intel.com
Documentation/core-api/protection-keys.rst | 102 +++++++++++++--- arch/x86/include/asm/pgtable_types.h | 12 ++ arch/x86/include/asm/pkeys.h | 11 ++ arch/x86/include/asm/pkeys_common.h | 4 + arch/x86/mm/pkeys.c | 128 +++++++++++++++++++++ include/linux/pgtable.h | 4 + include/linux/pkeys.h | 24 ++++ 7 files changed, 267 insertions(+), 18 deletions(-)
diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index ec575e72d0b2..c4e6c480562f 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -4,25 +4,33 @@ Memory Protection Keys ====================== -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake (and later) "Scalable Processor" -Server CPUs. It will be available in future non-server Intel parts -and future AMD processors.
-For anyone wishing to test or use this feature, it is available in -Amazon's EC2 C5 instances and is known to work there using an Ubuntu -17.04 image.
Memory Protection Keys provides a mechanism for enforcing page-based
provide
protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. +when an application changes protection domains.
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable +Processor" Server CPUs and later. And It will be available in future
it
+non-server Intel parts and future AMD processors.
+Future Intel processors will support Protection Keys for Supervisor pages +(PKS).
+For anyone wishing to test or use user space pkeys, it is available in Amazon's +EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to +a "protection key", giving 16 possible keys. User and Supervisor pages are +treated separately.
+Protections for each page are controlled with per CPU registers for each type
per-CPU
+of page User and Supervisor. Each of these 32 bit register stores two separate
32-bit registers
+bits (Access Disable and Write Disable) for each key. -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every other thread. +For Userspace the register is user-accessible (rdpkru/wrpkru). For +Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+Being a CPU register, pkeys are inherently thread-local, potentially giving +each thread an independent set of protections from every other thread. There are two new instructions (RDPKRU/WRPKRU) for reading and writing to the new register. The feature is only available in 64-bit mode, @@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs. These permissions are enforced on data access only and have no effect on instruction fetches.
-Syscalls
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+Syscalls for user space keys +============================ There are 3 system calls which directly interact with pkeys:: @@ -98,3 +109,58 @@ with a read():: The kernel will send a SIGSEGV in both cases, but si_code will be set to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when the plain mprotect() permissions are violated.
+Kernel API for PKS support +==========================
+The following interface is used to allocate, use, and free a pkey which defines +a 'protection domain' within the kernel. Setting a pkey value in a supervisor +mapping adds that mapping to the protection domain.
int pks_key_alloc(const char * const pkey_user, int flags);
#define PAGE_KERNEL_PKEY(pkey)
#define _PAGE_KEY(pkey)
void pks_mk_noaccess(int pkey);
void pks_mk_readonly(int pkey);
void pks_mk_readwrite(int pkey);
void pks_key_free(int pkey);
+pks_key_alloc() allocates keys dynamically to allow better use of the limited +key space. 'flags' alter the allocation based on the users need. Currently
user's or maybe users'
+they can request an exclusive key.
+Callers of pks_key_alloc() _must_ be prepared for it to fail and take +appropriate action. This is due mainly to the fact that PKS may not be +available on all arch's. Failure to check the return of pks_key_alloc() and +using any of the rest of the API is undefined.
+Kernel users must set the PTE permissions in the page table entries for the +mappings they want to protect. This can be done with PAGE_KERNEL_PKEY() or +_PAGE_KEY().
+The pks_mk*() family of calls allows kernel users the ability to change the +protections for the domain identified by the pkey specified. 3 states are +available pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which
available:
+set the access to none, read, and read/write respectively.
+Finally, pks_key_free() allows a user to return the key to the allocator for +use by others.
+The interface maintains pks_mk_noaccess() (Access Disabled (AD=1)) for all keys +not currently allocated. Therefore, the user can depend on access being +disabled when pks_key_alloc() returns a key and the user should remove mappings +from the domain (remove the pkey from the PTE) prior to calling pks_key_free().
+It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing +but still maintains ordering properties similar to WRPKRU. Thus it is safe to +immediately use a mapping when the pks_mk*() functions returns.
return.
+The current SDM section on PKRS needs updating but should be the same as that +of WRPKRU. So to quote from the WRPKRU text:
- WRPKRU will never execute transiently. Memory accesses
- affected by PKRU register will not execute (even transiently)
- until all prior executions of WRPKRU have completed execution
- and updated the PKRU register.
linux-kselftest-mirror@lists.linaro.org