This patch set enables the Intel flexible return and event delivery
(FRED) architecture with KVM VMX to allow guests to utilize FRED.
The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:
1) Improve overall performance and response time by replacing event
delivery through the interrupt descriptor table (IDT event
delivery) and event return by the IRET instruction with lower
latency transitions.
2) Improve software robustness by ensuring that event delivery
establishes the full supervisor context and that event return
establishes the full user context.
The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.
Intel VMX architecture is extended to run FRED guests, and the major
changes are:
1) New VMCS fields for FRED context management, which includes two new
event data VMCS fields, eight new guest FRED context VMCS fields and
eight new host FRED context VMCS fields.
2) VMX nested-exception support for proper virtualization of stack
levels introduced with FRED architecture.
Search for the latest FRED spec in most search engines with this search
pattern:
site:intel.com FRED (flexible return and event delivery) specification
As the native FRED patches are committed in the tip tree "x86/fred"
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/fred,
and we have received a good amount of review comments for v1, it's time
to send out v2 based on this branch for further help from the community.
Patch 1-2 are cleanups to VMX basic and misc MSRs, which were sent
out earlier as a preparation for FRED changes:
https://lore.kernel.org/kvm/20240206182032.1596-1-xin3.li@intel.com/T/#u
Patch 3-15 add FRED support to VMX.
Patch 16-21 add FRED support to nested VMX.
Patch 22 exposes FRED and its baseline features to KVM guests.
Patch 23-25 add FRED selftests.
There is also a counterpart qemu patch set for FRED at:
https://lore.kernel.org/qemu-devel/20231109072012.8078-1-xin3.li@intel.com/…,
which works with this patch set to allow KVM to run FRED guests.
Changes since v1:
* Always load the secondary VM exit controls (Sean Christopherson).
* Remove FRED VM entry/exit controls consistency checks in
setup_vmcs_config() (Sean Christopherson).
* Clear FRED VM entry/exit controls if FRED is not enumerated (Chao Gao).
* Use guest_can_use() to trace FRED enumeration in a vcpu (Chao Gao).
* Enable FRED MSRs intercept if FRED is no longer enumerated in CPUID
(Chao Gao).
* Move guest FRED states init into __vmx_vcpu_reset() (Chao Gao).
* Don't use guest_cpuid_has() in vmx_prepare_switch_to_{host,guest}(),
which are called from IRQ-disabled context (Chao Gao).
* Reset msr_guest_fred_rsp0 in __vmx_vcpu_reset() (Chao Gao).
* Fail host requested FRED MSRs access if KVM cannot virtualize FRED
(Chao Gao).
* Handle the case FRED MSRs are valid but KVM cannot virtualize FRED
(Chao Gao).
* Add sanity checks when writing to FRED MSRs.
* Explain why it is ok to only check CR4.FRED in kvm_is_fred_enabled()
(Chao Gao).
* Document event data should be equal to CR2/DR6/IA32_XFD_ERR instead
of using WARN_ON() (Chao Gao).
* Zero event data if a #NM was not caused by extended feature disable
(Chao Gao).
* Set the nested flag when there is an original interrupt (Chao Gao).
* Dump guest FRED states only if guest has FRED enabled (Nikolay Borisov).
* Add a prerequisite to SHADOW_FIELD_R[OW] macros
* Remove hyperv TLFS related changes (Jeremi Piotrowski).
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() to decouple
KVM's capability to virtualize a feature and host's enabling of a
feature (Chao Gao).
Xin Li (25):
KVM: VMX: Cleanup VMX basic information defines and usages
KVM: VMX: Cleanup VMX misc information defines and usages
KVM: VMX: Add support for the secondary VM exit controls
KVM: x86: Mark CR4.FRED as not reserved
KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
KVM: VMX: Set intercept for FRED MSRs
KVM: VMX: Initialize VMCS FRED fields
KVM: VMX: Switch FRED RSP0 between host and guest
KVM: VMX: Add support for FRED context save/restore
KVM: x86: Add kvm_is_fred_enabled()
KVM: VMX: Handle FRED event data
KVM: VMX: Handle VMX nested exception for FRED
KVM: VMX: Disable FRED if FRED consistency checks fail
KVM: VMX: Dump FRED context in dump_vmcs()
KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
KVM: nVMX: Add support for the secondary VM exit controls
KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
KVM: nVMX: Add FRED VMCS fields
KVM: nVMX: Add support for VMX FRED controls
KVM: nVMX: Add VMCS FRED states checking
KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests
KVM: selftests: Run debug_regs test with FRED enabled
KVM: selftests: Add a new VM guest mode to run user level code
KVM: selftests: Add fred exception tests
Documentation/virt/kvm/x86/nested-vmx.rst | 19 +
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/include/asm/msr-index.h | 15 +-
arch/x86/include/asm/vmx.h | 59 ++-
arch/x86/kvm/cpuid.c | 4 +-
arch/x86/kvm/governed_features.h | 1 +
arch/x86/kvm/kvm_cache_regs.h | 17 +
arch/x86/kvm/svm/svm.c | 4 +-
arch/x86/kvm/vmx/capabilities.h | 30 +-
arch/x86/kvm/vmx/nested.c | 329 ++++++++++++---
arch/x86/kvm/vmx/nested.h | 2 +-
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmcs12.c | 19 +
arch/x86/kvm/vmx/vmcs12.h | 38 ++
arch/x86/kvm/vmx/vmcs_shadow_fields.h | 80 ++--
arch/x86/kvm/vmx/vmx.c | 385 +++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 15 +-
arch/x86/kvm/x86.c | 103 ++++-
arch/x86/kvm/x86.h | 5 +-
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/kvm_util_base.h | 1 +
.../selftests/kvm/include/x86_64/processor.h | 36 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 5 +-
.../selftests/kvm/lib/x86_64/processor.c | 15 +-
tools/testing/selftests/kvm/lib/x86_64/vmx.c | 4 +-
.../testing/selftests/kvm/x86_64/debug_regs.c | 50 ++-
.../testing/selftests/kvm/x86_64/fred_test.c | 297 ++++++++++++++
27 files changed, 1320 insertions(+), 223 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/fred_test.c
base-commit: e13841907b8fda0ae0ce1ec03684665f578416a8
--
2.43.0
Malicious guests can cause bus locks to degrade the performance of a
system. Non-WB (write-back) and misaligned locked RMW
(read-modify-write) instructions are referred to as "bus locks" and
require system wide synchronization among all processors to guarantee
the atomicity. The bus locks can impose notable performance penalties
for all processors within the system.
Support for the Bus Lock Threshold is indicated by CPUID
Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock
Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count.
VMCB intercept bit
VMCB Offset Bits Function
14h 5 Intercept bus lock operations
Bus lock threshold count
VMCB Offset Bits Function
120h 15:0 Bus lock counter
During VMRUN, the bus lock threshold count is fetched and stored in an
internal count register. Prior to executing a bus lock within the
guest, the processor verifies the count in the bus lock register. If
the count is greater than zero, the processor executes the bus lock,
reducing the count. However, if the count is zero, the bus lock
operation is not performed, and instead, a Bus Lock Threshold #VMEXIT
is triggered to transfer control to the Virtual Machine Monitor (VMM).
A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code
0xA5h, VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on
a VMEXIT_BUSLOCK. On a #VMEXIT, the processor writes the current
value of the Bus Lock Threshold Counter to the VMCB.
More details about the Bus Lock Threshold feature can be found in AMD
APM [1].
Patches are prepared on kvm-x86/svm (704ec48fc2fb)
Testing done:
- Added a selftest for the Bus Lock Threadshold functionality.
- Tested the Bus Lock Threshold functionality on SEV and SEV-ES guests.
- Tested the Bus Lock Threshold functionality on nested guests.
Qemu changes can be found on:
Repo: https://github.com/AMDESE/qemu.git
Branch: buslock_threshold
Qemu commandline to use the bus lock threshold functionality:
qemu-system-x86_64 -enable-kvm -cpu EPYC-Turin,+svm -M q35,bus-lock-ratelimit=10 \ ..
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.14.5 Bus Lock Threshold.
https://bugzilla.kernel.org/attachment.cgi?id=306250
Manali Shukla (2):
x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold
KVM: x86: nSVM: Implement support for nested Bus Lock Threshold
Nikunj A Dadhania (2):
KVM: SVM: Enable Bus lock threshold exit
KVM: selftests: Add bus lock exit test
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/svm.h | 5 +-
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/governed_features.h | 1 +
arch/x86/kvm/svm/nested.c | 25 ++++
arch/x86/kvm/svm/svm.c | 48 ++++++++
arch/x86/kvm/svm/svm.h | 1 +
arch/x86/kvm/x86.h | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/svm_buslock_test.c | 114 ++++++++++++++++++
10 files changed, 198 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/svm_buslock_test.c
base-commit: 704ec48fc2fbd4e41ec982662ad5bf1eee33eeb2
--
2.34.1
Changes v4:
- Printing SNC warnings at the start of every test.
- Printing SNC warnings at the end of every relevant test.
- Remove global snc_mode variable, consolidate snc detection functions
into one.
- Correct minor mistakes.
Changes v3:
- Reworked patch 2.
- Changed minor things in patch 1 like function name and made
corrections to the patch message.
Changes v2:
- Removed patches 2 and 3 since now this part will be supported by the
kernel.
Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory
into multiple NUMA nodes. When enabled, NUMA-aware applications can
achieve better performance on bigger server platforms.
SNC support in the kernel was merged into x86/cache [1]. With SNC enabled
and kernel support in place all the tests will function normally (aside
from effective cache size). There might be a problem when SNC is enabled
but the system is still using an older kernel version without SNC
support. Currently the only message displayed in that situation is a
guess that SNC might be enabled and is causing issues. That message also
is displayed whenever the test fails on an Intel platform.
Add a mechanism to discover kernel support for SNC which will add more
meaning and certainty to the error message.
Add runtime SNC mode detection and verify how reliable that information
is.
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2
and SNC-4. The tests were also ran with and without kernel support for
SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/20240628215619.76401-1-tony.luck@intel.com/
Previous versions of this series:
[v1] https://lore.kernel.org/all/cover.1709721159.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1715769576.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1719842207.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Adjust effective L3 cache size with SNC enabled
selftests/resctrl: Adjust SNC support messages
tools/testing/selftests/resctrl/cat_test.c | 8 ++
tools/testing/selftests/resctrl/cmt_test.c | 10 +-
tools/testing/selftests/resctrl/mba_test.c | 7 +
tools/testing/selftests/resctrl/mbm_test.c | 9 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +
.../testing/selftests/resctrl/resctrl_tests.c | 8 +-
tools/testing/selftests/resctrl/resctrlfs.c | 130 ++++++++++++++++++
7 files changed, 174 insertions(+), 5 deletions(-)
--
2.45.2
compile_commands.json is used by clangd[1] to provide code navigation
and completion functionality to editors. See [2] for an example
configuration that includes this functionality for VSCode.
It can currently be built manually when using kunit.py, by running:
./scripts/clang-tools/gen_compile_commands.py -d .kunit
With this change however, it's built automatically so you don't need to
manually keep it up to date.
Unlike the manual approach, having make build the compile_commands.json
means that it appears in the build output tree instead of at the root of
the source tree, so you'll need to add --compile-commands-dir= to your
clangd args for it to be found.
[1] https://clangd.llvm.org/
[2] https://github.com/FlorentRevest/linux-kernel-vscode
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
tools/testing/kunit/kunit_kernel.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py
index 7254c110ff23..61931c4926fd 100644
--- a/tools/testing/kunit/kunit_kernel.py
+++ b/tools/testing/kunit/kunit_kernel.py
@@ -72,7 +72,8 @@ class LinuxSourceTreeOperations:
raise ConfigError(e.output.decode())
def make(self, jobs: int, build_dir: str, make_options: Optional[List[str]]) -> None:
- command = ['make', 'ARCH=' + self._linux_arch, 'O=' + build_dir, '--jobs=' + str(jobs)]
+ command = ['make', 'all', 'compile_commands.json', 'ARCH=' + self._linux_arch,
+ 'O=' + build_dir, '--jobs=' + str(jobs)]
if make_options:
command.extend(make_options)
if self._cross_compile:
---
base-commit: 3c999d1ae3c75991902a1a7dad0cb62c2a3008b4
change-id: 20240516-kunit-compile-commands-d994074fc2be
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
PASID (Process Address Space ID) is a PCIe extension to tag the DMA
transactions out of a physical device, and most modern IOMMU hardware
have supported PASID granular address translation. So a PASID-capable
device can be attached to multiple hwpts (a.k.a. domains), each attachment
is tagged with a pasid.
This series is based on a preparation series [1], it first adds a missing
iommu API to replace domain for a pasid. Based on the iommu pasid attach/
replace/detach APIs, this series adds iommufd APIs for device drivers to
attach/replace/detach pasid to/from hwpt per userspace's request, and adds
selftest to validate the iommufd APIs.
The completed code can be found in below link [2]. Heads up! The existing
iommufd selftest was broken, there was a fix [3] to it, but not been
upstreamed yet. If want to run the iommufd selftest, please apply that fix.
Sorry for the inconvenience.
[1] https://lore.kernel.org/linux-iommu/20240628085538.47049-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[3] https://lore.kernel.org/linux-iommu/20240111073213.180020-1-baolu.lu@linux.…
Change log:
v3:
- Split the set_dev_pasid op enhancements for domain replacement to be a
separate series "Make set_dev_pasid op supportting domain replacement" [1].
The below changes are made in the separate series.
*) set_dev_pasid() callback should keep the old config if failed to attach to
a domain. This simplifies the caller a lot as caller does not need to attach
it back to old domain explicitly. This also avoids some corner cases in which
the core may do duplicated domain attachment as described in below link (Jason)
https://lore.kernel.org/linux-iommu/BN9PR11MB52768C98314A95AFCD2FA6478C0F2@…
*) Drop patch 10 of v2 as it's a bug fix and can be submitted separately (Kevin)
*) Rebase on top of Baolu's domain_alloc_paging refactor series (Jason)
- Drop the attach_data which includes attach_fn and pasid, insteadly passing the
pasid through the device attach path. (Jason)
- Add a pasid-num-bits property to mock dev to make pasid selftest work (Kevin)
v2: https://lore.kernel.org/linux-iommu/20240412081516.31168-1-yi.l.liu@intel.c…
- Domain replace for pasid should be handled in set_dev_pasid() callbacks
instead of remove_dev_pasid and call set_dev_pasid afteward in iommu
layer (Jason)
- Make xarray operations more self-contained in iommufd pasid attach/replace/detach
(Jason)
- Tweak the dev_iommu_get_max_pasids() to allow iommu driver to populate the
max_pasids. This makes the iommufd selftest simpler to meet the max_pasids
check in iommu_attach_device_pasid() (Jason)
v1: https://lore.kernel.org/kvm/20231127063428.127436-1-yi.l.liu@intel.com/#r
- Implemnet iommu_replace_device_pasid() to fall back to the original domain
if this replacement failed (Kevin)
- Add check in do_attach() to check corressponding attach_fn per the pasid value.
rfc: https://lore.kernel.org/linux-iommu/20230926092651.17041-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (7):
iommu: Introduce a replace API for device pasid
iommufd: Pass pasid through the device attach/replace path
iommufd: Support attach/replace hwpt per pasid
iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add coverage for iommufd pasid attach/detach
drivers/iommu/iommu-priv.h | 3 +
drivers/iommu/iommu.c | 80 ++++++-
drivers/iommu/iommufd/Makefile | 1 +
drivers/iommu/iommufd/device.c | 31 +--
drivers/iommu/iommufd/iommufd_private.h | 15 ++
drivers/iommu/iommufd/iommufd_test.h | 30 +++
drivers/iommu/iommufd/pasid.c | 157 +++++++++++++
drivers/iommu/iommufd/selftest.c | 206 ++++++++++++++++-
include/linux/iommufd.h | 6 +
tools/testing/selftests/iommu/iommufd.c | 207 ++++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 28 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 78 +++++++
12 files changed, 808 insertions(+), 34 deletions(-)
create mode 100644 drivers/iommu/iommufd/pasid.c
--
2.34.1
Filter out nodes that have one of its ancestors disabled as they aren't
expected to probe.
This removes the following false-positive failures on the
sc7180-trogdor-lazor-limozeen-nots-r5 platform:
/soc@0/geniqup@8c0000/i2c@894000/proximity@28
/soc@0/geniqup@ac0000/spi@a90000/ec@0
/soc@0/remoteproc@62400000/glink-edge/apr
/soc@0/remoteproc@62400000/glink-edge/apr/service@3
/soc@0/remoteproc@62400000/glink-edge/apr/service@4
/soc@0/remoteproc@62400000/glink-edge/apr/service@4/clock-controller
/soc@0/remoteproc@62400000/glink-edge/apr/service@4/dais
/soc@0/remoteproc@62400000/glink-edge/apr/service@7
/soc@0/remoteproc@62400000/glink-edge/apr/service@7/dais
/soc@0/remoteproc@62400000/glink-edge/apr/service@8
/soc@0/remoteproc@62400000/glink-edge/apr/service@8/routing
/soc@0/remoteproc@62400000/glink-edge/fastrpc
/soc@0/remoteproc@62400000/glink-edge/fastrpc/compute-cb@3
/soc@0/remoteproc@62400000/glink-edge/fastrpc/compute-cb@4
/soc@0/remoteproc@62400000/glink-edge/fastrpc/compute-cb@5
/soc@0/spmi@c440000/pmic@0/pon@800/pwrkey
Fixes: 14571ab1ad21 ("kselftest: Add new test for detecting unprobed Devicetree devices")
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
Changes in v2:
- Rebased on v6.11-rc1
- Link to v1: https://lore.kernel.org/r/20240619-dt-kselftest-parent-disabled-v1-1-b8f7a8…
---
tools/testing/selftests/dt/test_unprobed_devices.sh | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/dt/test_unprobed_devices.sh b/tools/testing/selftests/dt/test_unprobed_devices.sh
index 2d7e70c5ad2d..5e3f42ef249e 100755
--- a/tools/testing/selftests/dt/test_unprobed_devices.sh
+++ b/tools/testing/selftests/dt/test_unprobed_devices.sh
@@ -34,8 +34,21 @@ nodes_compatible=$(
# Check if node is available
if [[ -e "${node}"/status ]]; then
status=$(tr -d '\000' < "${node}"/status)
- [[ "${status}" != "okay" && "${status}" != "ok" ]] && continue
+ if [[ "${status}" != "okay" && "${status}" != "ok" ]]; then
+ if [ -n "${disabled_nodes_regex}" ]; then
+ disabled_nodes_regex="${disabled_nodes_regex}|${node}"
+ else
+ disabled_nodes_regex="${node}"
+ fi
+ continue
+ fi
fi
+
+ # Ignore this node if one of its ancestors was disabled
+ if [ -n "${disabled_nodes_regex}" ]; then
+ echo "${node}" | grep -q -E "${disabled_nodes_regex}" && continue
+ fi
+
echo "${node}" | sed -e 's|\/proc\/device-tree||'
done | sort
)
---
base-commit: 8400291e289ee6b2bf9779ff1c83a291501f017b
change-id: 20240619-dt-kselftest-parent-disabled-2282a7223d26
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
There are no maintainers specified for tools/testing/selftests/x86.
Shuah has mentioned [1] that the patches should go through x86 tree or
in special cases directly to Shuah's tree after getting ack-ed from x86
maintainers. Different people have been confused when sending patches as
correct maintainers aren't found by get_maintainer.pl script. Fix
this by adding entry to MAINTAINERS file.
[1] https://lore.kernel.org/all/90dc0dfc-4c67-4ea1-b705-0585d6e2ec47@linuxfound…
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 523d84b2d6139..f3a17e5d954a3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -24378,6 +24378,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/core
F: Documentation/arch/x86/
F: Documentation/devicetree/bindings/x86/
F: arch/x86/
+F: tools/testing/selftests/x86
X86 ENTRY CODE
M: Andy Lutomirski <luto(a)kernel.org>
--
2.39.2
In a series posted a few years ago [1], a proposal was put forward to allow the
kernel to allocate memory local to a mm and thus push it out of reach for
current and future speculation-based cross-process attacks. We still believe
this is a nice thing to have.
However, in the time passed since that post Linux mm has grown quite a few new
goodies, so we'd like to explore possibilities to implement this functionality
with less effort and churn leveraging the now available facilities.
Specifically, this is a proof-of-concept attempt to implement mm-local
allocations piggy-backing on memfd_secret(), using regular user addressess but
pinning the pages and flipping the user/supervisor flag on the respective PTEs
to make them directly accessible from kernel, and sealing the VMA to prevent
userland from taking over the address range. The approach allowed to delegate
all the heavy lifting -- address management, interactions with the direct map,
cleanup on mm teardown -- to the existing infrastructure, and required zero
architecture-specific code.
Compared to the approach used in the orignal series, where a dedicated kernel
address range and thus a dedicated PGD was used for mm-local allocations, the
one proposed here may have certain drawbacks, in particular
- using user addresses for kernel memory may violate assumptions in various
parts of kernel code which we may not have identified with smoke tests we did
- the allocated addresses are guessable by the userland (ATM they are even
visible in /proc/PID/maps but that's fixable) which may weaken the security
posture
Also included is a simple test driver and selftest to smoke test and showcase
the feature.
The code is PoC RFC and lacks a lot of checks and special case handling, but
demonstrates the idea. We'd appreciate any feedback on whether it's a viable
approach or it should better be abandoned in favor of the one with dedicated
PGD / kernel address range or yet something else.
[1] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de/
Fares Mehanna (2):
mseal: expose interface to seal / unseal user memory ranges
mm/secretmem: implement mm-local kernel allocations
Roman Kagan (1):
drivers/misc: add test driver and selftest for proclocal allocator
drivers/misc/Makefile | 1 +
tools/testing/selftests/proclocal/Makefile | 6 +
include/linux/secretmem.h | 8 +
mm/internal.h | 7 +
drivers/misc/proclocal-test.c | 200 +++++++++++++++++
mm/gup.c | 4 +-
mm/mseal.c | 81 ++++---
mm/secretmem.c | 208 ++++++++++++++++++
.../selftests/proclocal/proclocal-test.c | 150 +++++++++++++
drivers/misc/Kconfig | 15 ++
tools/testing/selftests/proclocal/.gitignore | 1 +
11 files changed, 649 insertions(+), 32 deletions(-)
create mode 100644 tools/testing/selftests/proclocal/Makefile
create mode 100644 drivers/misc/proclocal-test.c
create mode 100644 tools/testing/selftests/proclocal/proclocal-test.c
create mode 100644 tools/testing/selftests/proclocal/.gitignore
--
2.34.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Based on feedback from Linus[1] and follow-up discussions, change the
suggested file naming for KUnit tests.
Link: https://lore.kernel.org/lkml/CAHk-=wgim6pNiGTBMhP8Kd3tsB7_JTAuvNJ=XYd3wPvvk… [1]
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Signed-off-by: Kees Cook <kees(a)kernel.org>
---
v3: additional clarification
v2: https://lore.kernel.org/all/20240720165441.it.320-kees@kernel.org/
Cc: David Gow <davidgow(a)google.com>
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: Rae Moar <rmoar(a)google.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-hardening(a)vger.kernel.org
---
Documentation/dev-tools/kunit/style.rst | 29 +++++++++++++++++--------
1 file changed, 20 insertions(+), 9 deletions(-)
diff --git a/Documentation/dev-tools/kunit/style.rst b/Documentation/dev-tools/kunit/style.rst
index b6d0d7359f00..eac81a714a29 100644
--- a/Documentation/dev-tools/kunit/style.rst
+++ b/Documentation/dev-tools/kunit/style.rst
@@ -188,15 +188,26 @@ For example, a Kconfig entry might look like:
Test File and Module Names
==========================
-KUnit tests can often be compiled as a module. These modules should be named
-after the test suite, followed by ``_test``. If this is likely to conflict with
-non-KUnit tests, the suffix ``_kunit`` can also be used.
+KUnit tests are often compiled as a separate module. To avoid conflicting
+with regular modules, KUnit modules should be named after the test suite,
+followed by ``_kunit`` (e.g. if "foobar" is the core module, then
+"foobar_kunit" is the KUnit test module).
-The easiest way of achieving this is to name the file containing the test suite
-``<suite>_test.c`` (or, as above, ``<suite>_kunit.c``). This file should be
-placed next to the code under test.
+Test source files, whether compiled as a separate module or an
+``#include`` in another source file, are best kept in a ``tests/``
+subdirectory to not conflict with other source files (e.g. for
+tab-completion).
+
+Note that the ``_test`` suffix has also been used in some existing
+tests. The ``_kunit`` suffix is preferred, as it makes the distinction
+between KUnit and non-KUnit tests clearer.
+
+So for the common case, name the file containing the test suite
+``tests/<suite>_kunit.c``. The ``tests`` directory should be placed at
+the same level as the code under test. For example, tests for
+``lib/string.c`` live in ``lib/tests/string_kunit.c``.
If the suite name contains some or all of the name of the test's parent
-directory, it may make sense to modify the source filename to reduce redundancy.
-For example, a ``foo_firmware`` suite could be in the ``foo/firmware_test.c``
-file.
+directory, it may make sense to modify the source filename to reduce
+redundancy. For example, a ``foo_firmware`` suite could be in the
+``foo/tests/firmware_kunit.c`` file.
--
2.34.1