Hello,
This patch series implements IOCTL on the pagemap procfs file to get the
information about the page table entries (PTEs). The following operations
are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages atomically.
Soft-dirty PTE bit of the memory pages can be read by using the pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. There are other
methods to mimic this information entirely in userspace with poor
performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty PTE bit status and clear operation
possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages to process only the dirty pages.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project[2][3]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project[2].
The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.
Some non-dirty pages get marked as dirty because of the kernel's
internal activity (such as VMA merging as soft-dirty bit difference isn't
considered while deciding to merge VMAs). The dirty bit of the pages is
stored in the VMA flags and in the per page flags. If any of these two bits
are set, the page is considered to be soft dirty. Suppose you have cleared
the soft dirty bit of half of VMA which will be done by splitting the VMA
and clearing soft dirty bit flag in the half VMA and the pages in it. Now
kernel may decide to merge the VMAs again. So the half VMA becomes dirty
again. This splitting/merging costs performance. The application receives
a lot of pages which aren't dirty in reality but marked as dirty.
Performance is lost again here. Also sometimes user doesn't want the newly
allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
solves both the problems. It is used to not depend on the soft dirty flag
in the VMA flags. So VMA splitting and merging doesn't happen. It only
depends on the soft dirty bit of the individual pages. Thus by using this
flag, there may be a scenerio such that the new memory regions which are
just created, doesn't look dirty when seen with the IOCTL, but look dirty
when seen from procfs. This seems okay as the user of this flag know the
implication of using it.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (3):
fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
selftests: vm: add pagemap ioctl tests
fs/proc/task_mmu.c | 400 +++++++++++-
include/uapi/linux/fs.h | 53 ++
tools/include/uapi/linux/fs.h | 53 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 681 +++++++++++++++++++++
6 files changed, 1160 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2
Hi All,
Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. VM guest with TDX support is called
as a TDX Guest.
In TDX guest, attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. For example, a key server may request for attestation before
releasing the encryption keys to mount the encrypted rootfs or
secondary drive.
This patch set adds attestation support for the TDX guest. Details
about the TDX attestation process and the steps involved are explained
in Documentation/x86/tdx.rst (added by patch 2/3).
Following are the details of the patch set:
Patch 1/3 -> Preparatory patch for adding attestation support.
Patch 2/3 -> Adds user interface driver to support attestation.
Patch 3/3 -> Adds selftest support for TDREPORT feature.
Commit log history is maintained in the individual patches.
Current overall status of this series is, it has no pending issues
and can be considered for the upcoming merge cycle.
Kuppuswamy Sathyanarayanan (3):
x86/tdx: Add a wrapper to get TDREPORT from the TDX Module
virt: Add TDX guest driver
selftests: tdx: Test TDX attestation GetReport support
Documentation/virt/coco/tdx-guest.rst | 42 +++++
Documentation/virt/index.rst | 1 +
Documentation/x86/tdx.rst | 43 +++++
arch/x86/coco/tdx/tdx.c | 31 ++++
arch/x86/include/asm/tdx.h | 2 +
drivers/virt/Kconfig | 2 +
drivers/virt/Makefile | 1 +
drivers/virt/coco/tdx-guest/Kconfig | 10 ++
drivers/virt/coco/tdx-guest/Makefile | 2 +
drivers/virt/coco/tdx-guest/tdx-guest.c | 121 +++++++++++++
include/uapi/linux/tdx-guest.h | 55 ++++++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/tdx/Makefile | 7 +
tools/testing/selftests/tdx/config | 1 +
tools/testing/selftests/tdx/tdx_guest_test.c | 175 +++++++++++++++++++
15 files changed, 494 insertions(+)
create mode 100644 Documentation/virt/coco/tdx-guest.rst
create mode 100644 drivers/virt/coco/tdx-guest/Kconfig
create mode 100644 drivers/virt/coco/tdx-guest/Makefile
create mode 100644 drivers/virt/coco/tdx-guest/tdx-guest.c
create mode 100644 include/uapi/linux/tdx-guest.h
create mode 100644 tools/testing/selftests/tdx/Makefile
create mode 100644 tools/testing/selftests/tdx/config
create mode 100644 tools/testing/selftests/tdx/tdx_guest_test.c
--
2.34.1
From: Jeff Xu <jeffxu(a)chromium.org>
Hi,
This v2 series MFD_NOEXEC, this series includes:
1> address comments in V1
2> add sysctl (vm.mfd_noexec) to change the default file permissions
of memfd_create to be non-executable.
Below are cover-level for v1:
The default file permissions on a memfd include execute bits, which
means that such a memfd can be filled with a executable and passed to
the exec() family of functions. This is undesirable on systems where all
code is verified and all filesystems are intended to be mounted noexec,
since an attacker may be able to use a memfd to load unverified code and
execute it.
Additionally, execution via memfd is a common way to avoid scrutiny for
malicious code, since it allows execution of a program without a file
ever appearing on disk. This attack vector is not totally mitigated with
this new flag, since the default memfd file permissions must remain
executable to avoid breaking existing legitimate uses, but it should be
possible to use other security mechanisms to prevent memfd_create calls
without MFD_NOEXEC on systems where it is known that executable memfds
are not necessary.
This patch series adds a new MFD_NOEXEC flag for memfd_create(), which
allows creation of non-executable memfds, and as part of the
implementation of this new flag, it also adds a new F_SEAL_EXEC seal,
which will prevent modification of any of the execute bits of a sealed
memfd.
I am not sure if this is the best way to implement the desired behavior
(for example, the F_SEAL_EXEC seal is really more of an implementation
detail and feels a bit clunky to expose), so suggestions are welcome
for alternate approaches.
v1: https://lwn.net/Articles/890096/
Daniel Verkamp (4):
mm/memfd: add F_SEAL_EXEC
mm/memfd: add MFD_NOEXEC flag to memfd_create
selftests/memfd: add tests for F_SEAL_EXEC
selftests/memfd: add tests for MFD_NOEXEC
Jeff Xu (1):
sysctl: add support for mfd_noexec
include/linux/mm.h | 4 +
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 1 +
kernel/sysctl.c | 9 ++
mm/memfd.c | 39 ++++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/memfd_test.c | 163 ++++++++++++++++++++-
7 files changed, 221 insertions(+), 2 deletions(-)
base-commit: 9e2f40233670c70c25e0681cb66d50d1e2742829
--
2.37.1.559.g78731f0fdb-goog
Hi all:
First, rename amd-pstate-ut.sh to basic.sh as a basic test, mainly for
AMD P-State kernel drivers. The purpose of this modification is to
facilitate the subsequent addition of gitsource, tbench and other tests.
Second, split basic.sh into run.sh and basic.sh.
The modification makes basic.sh more pure, just for test basic kernel
functions. The file of run.sh mainly contains functions such as test
entry, parameter check, prerequisite and log clearing etc.
Third, add tbench.sh trigger the tbench testing and monitor the cpu.
Fourth, add gitsource.sh trigger the gitsource testing and monitor the cpu
information.
Finally, modify rst document to introduce test steps and results etc.
See patch series in below git repo:
V1:https://lore.kernel.org/lkml/20220706073622.672135-1-li.meng@amd.com/V2:https://lore.kernel.org/lkml/20220804054414.1510764-1-li.meng@amd.com/V3:https://lore.kernel.org/lkml/20220914061105.1982477-1-li.meng@amd.com/V4:https://lore.kernel.org/lkml/20221024013356.1639489-1-li.meng@amd.com/
Changes from V1->V2:
- selftests: amd-pstate: basic
- - delete main.sh and merge funtions into run.sh
- selftests: amd-pstate: tbench
- - modify ppw to performance per watt for tbench.
- - add comments for performance per watt for tbench.
- - add comparative test on acpi-cpufreq for tbench.
- - calculate drop between amd-pstate and acpi-cpufreq etc.
- - plot images about perfrmance,energy and ppw for tbench.
- selftests: amd-pstate: gitsource
- - modify ppw to performance per watt for gitsource.
- - add comments for performance per watt for gitsource.
- - add comparative test on acpi-cpufreq for gitsource.
- - calculate drop between amd-pstate and acpi-cpufreq etc.
- - plot images about perfrmance,energy and ppw for gitsource.
- Documentation: amd-pstate:
- - modify rst doc, introduce comparative test etc.
Changes from V2->V3:
- selftests: amd-pstate:
- - reduce print logs for governor.
- - add a check to see if tbench and the perf tools are already installed.
- - install tbench package from apt or yum.
- - correct spelling errors from comprison to comparison.
Changes from V3->V4:
- selftests: amd-pstate:
- - modify cover letter and commit logs.
- Documentation: amd-pstate:
- - modify some format questions.
Changes from V4->V5:
- selftests: amd-pstate:
- - rename amd-pstate-ut.sh to basic.sh.
- - split basic.sh into run.sh and basic.sh.
- - modify tbench.sh to prompt to install tbench.
- - modify commit messages and description informations of shell files.
- Documentation: amd-pstate:
- - correct spell errors.
Thanks,
Jasmine
Meng Li (5):
selftests: amd-pstate: Rename amd-pstate-ut.sh to basic.sh.
selftests: amd-pstate: Split basic.sh into run.sh and basic.sh.
selftests: amd-pstate: Trigger tbench benchmark and test cpus
selftests: amd-pstate: Trigger gitsource benchmark and test cpus
Documentation: amd-pstate: Add tbench and gitsource test introduction
Documentation/admin-guide/pm/amd-pstate.rst | 194 ++++++++-
tools/testing/selftests/amd-pstate/Makefile | 11 +-
.../selftests/amd-pstate/amd-pstate-ut.sh | 56 ---
tools/testing/selftests/amd-pstate/basic.sh | 38 ++
.../testing/selftests/amd-pstate/gitsource.sh | 354 ++++++++++++++++
tools/testing/selftests/amd-pstate/run.sh | 387 ++++++++++++++++++
tools/testing/selftests/amd-pstate/tbench.sh | 339 +++++++++++++++
7 files changed, 1302 insertions(+), 77 deletions(-)
delete mode 100755 tools/testing/selftests/amd-pstate/amd-pstate-ut.sh
create mode 100755 tools/testing/selftests/amd-pstate/basic.sh
create mode 100755 tools/testing/selftests/amd-pstate/gitsource.sh
create mode 100755 tools/testing/selftests/amd-pstate/run.sh
create mode 100755 tools/testing/selftests/amd-pstate/tbench.sh
--
2.34.1
Changes from RFC
(https://lore.kernel.org/damon/20221019001317.104270-1-sj@kernel.org/):
- Split out cleanup/refactoring parts[1]
[1] https://lore.kernel.org/damon/20221026225943.100429-1-sj@kernel.org/
-----------------------------------------------------------------------
DAMON users can retrieve the monitoring results via 'after_aggregation'
callbacks if the user is using the kernel API, or 'damon_aggregated'
tracepoint if the user is in the user space. Those are useful if full
monitoring results are necessary. However, if the user has interest in
only a snapshot of the results for some regions having specific access
pattern, the interfaces could be inefficient. For example, some users
only want to know which memory regions are not accessed for more than a
specific time at the moment.
Also, some DAMOS users would want to know exactly to what memory regions
the schemes' actions tried to be applied, for a debugging or a tuning.
As DAMOS has its internal mechanism for quota and regions
prioritization, the users would need to simulate DAMOS' mechanism
against the monitoring results. That's unnecessarily complex.
This patchset implements DAMON kernel API callbacks and sysfs directory
for efficient exposure of the information for the use cases. The new
callback will be called for each region when a DAMOS action is gonna
tried to be applied to it. The sysfs directory will be called
'tried_regions' and placed under each scheme sysfs directory. Users can
write a special keyworkd, 'update_schemes_regions', to the 'state' file
of a kdamond sysfs directory. Then, DAMON sysfs interface will fill the
directory with the information of regions that corresponding scheme
action was tried to be applied for next one aggregation interval.
Patches Sequence
----------------
The first one (patch 1) implements the callback for the kernel space
users. Following two patches (patches 2 and 3) implements sysfs
directories for the information and its sub directories. Two patches
(patches 4 and 5) for implementing the special keywords for filling the
data to and cleaning up the directories follow. Patch 6 adds a selftest
for the new sysfs directory. Finally, two patches (patches 7 and 8)
document the new feature in the administrator guide and the ABI
document.
Assembled Tree
--------------
This patchset is based on the latest mm-unstable tree[1]. Assembled
tree is also available at the damon/next tree[2].
[1] https://git.kernel.org/akpm/mm/h/mm-unstable
[2] https://git.kernel.org/sj/h/damon/next
SeongJae Park (8):
mm/damon/core: add a callback for scheme target regions check
mm/damon/sysfs-schemes: implement schemes/tried_regions directory
mm/damon/sysfs-schemes: implement scheme region directory
mm/damon/sysfs: implement DAMOS tried regions update command
mm/damon/sysfs-schemes: implement DAMOS-tried regions clear command
tools/selftets/damon/sysfs: test tried_regions directory existence
Docs/admin-guide/mm/damon/usage: document schemes/<s>/tried_regions
sysfs directory
Docs/ABI/damon: document 'schemes/<s>/tried_regions' sysfs directory
.../ABI/testing/sysfs-kernel-mm-damon | 32 +++
Documentation/admin-guide/mm/damon/usage.rst | 45 ++-
include/linux/damon.h | 5 +
mm/damon/core.c | 6 +-
mm/damon/sysfs-common.h | 10 +
mm/damon/sysfs-schemes.c | 261 ++++++++++++++++++
mm/damon/sysfs.c | 77 +++++-
tools/testing/selftests/damon/sysfs.sh | 7 +
8 files changed, 437 insertions(+), 6 deletions(-)
--
2.25.1
Currently, if you run
$ ./tools/testing/kunit/kunit_tool_test.py
you'll see a lot of output from the parser as we feed it testdata.
This makes the output hard to read and fairly confusing, esp. since our
testdata includes example failures, which get printed out in red.
Silence that output so real failures are easier to see.
Signed-off-by: Daniel Latypov <dlatypov(a)google.com>
---
tools/testing/kunit/kunit_tool_test.py | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/kunit/kunit_tool_test.py b/tools/testing/kunit/kunit_tool_test.py
index e2cd2cc2e98f..a6e53945656e 100755
--- a/tools/testing/kunit/kunit_tool_test.py
+++ b/tools/testing/kunit/kunit_tool_test.py
@@ -80,6 +80,9 @@ class KconfigTest(unittest.TestCase):
self.assertEqual(actual_kconfig, expected_kconfig)
class KUnitParserTest(unittest.TestCase):
+ def setUp(self):
+ self.print_mock = mock.patch('kunit_printer.Printer.print').start()
+ self.addCleanup(mock.patch.stopall)
def assertContains(self, needle: str, haystack: kunit_parser.LineStream):
# Clone the iterator so we can print the contents on failure.
@@ -485,6 +488,9 @@ class LinuxSourceTreeTest(unittest.TestCase):
class KUnitJsonTest(unittest.TestCase):
+ def setUp(self):
+ self.print_mock = mock.patch('kunit_printer.Printer.print').start()
+ self.addCleanup(mock.patch.stopall)
def _json_for(self, log_file):
with open(test_data_path(log_file)) as file:
base-commit: 8f8b51f7d5c8bd3a89e7ea87aed2cdaa52ca5ba4
--
2.38.1.273.g43a17bfeac-goog
Hi!
> On Mon, Oct 31, 2022 at 07:22:11AM +0000, zhaogongyi wrote:
> > Hi!
> >
> > +to linux-fsdevel(a)vger.kernel.org
> > +cc willy(a)infradead.org
> >
> > Regards,
> > Gongyi
>
> what?
I have submitted tow patches reference to the testing of page cache, please see: https://patchwork.kernel.org/project/linux-kselftest/patch/20221021071052.1…
The patches have not responded for a while, so I'm guessing that's the reason for my lack of cc to linux-fsdevel or page cache maintainer?
Best Regards,
Gongyi
0Day/LKP observed that the kselftest blocks forever since one of the
pidfd_wait doesn't terminate in 1 of 30 runs. After digging into
the source, we found that it blocks at:
ASSERT_EQ(sys_waitid(P_PIDFD, pidfd, &info, WCONTINUED, NULL), 0);
wait_states has below testing flow:
CHILD PARENT
---------------+--------------
1 STOP itself
2 WAIT for CHILD STOPPED
3 SIGNAL CHILD to CONT
4 CONT
5 STOP itself
5' WAIT for CHILD CONT
6 WAIT for CHILD STOPPED
The problem is that the kernel cannot ensure the order of 5 and 5', once
5 goes first, the test will fail.
we can reproduce it by:
$ while true; do make run_tests -C pidfd; done
Introduce a blocking read in child process to make sure the parent can
check its WCONTINUED.
CC: Philip Li <philip.li(a)intel.com>
Reported-by: kernel test robot <lkp(a)intel.com>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner(a)kernel.org>
---
I have almost forgotten this patch since the former version post over 6 months
ago. This time I just do a rebase and update the comments.
V3: fixes description and add review tag
V2: rewrite with pipe to avoid usleep
---
tools/testing/selftests/pidfd/pidfd_wait.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/testing/selftests/pidfd/pidfd_wait.c b/tools/testing/selftests/pidfd/pidfd_wait.c
index 070c1c876df1..c3e2a3041f55 100644
--- a/tools/testing/selftests/pidfd/pidfd_wait.c
+++ b/tools/testing/selftests/pidfd/pidfd_wait.c
@@ -95,20 +95,28 @@ static int sys_waitid(int which, pid_t pid, siginfo_t *info, int options,
.flags = CLONE_PIDFD | CLONE_PARENT_SETTID,
.exit_signal = SIGCHLD,
};
+ int pfd[2];
pid_t pid;
siginfo_t info = {
.si_signo = 0,
};
+ ASSERT_EQ(pipe(pfd), 0);
pid = sys_clone3(&args);
ASSERT_GE(pid, 0);
if (pid == 0) {
+ char buf[2];
+
+ close(pfd[1]);
kill(getpid(), SIGSTOP);
+ ASSERT_EQ(read(pfd[0], buf, 1), 1);
+ close(pfd[0]);
kill(getpid(), SIGSTOP);
exit(EXIT_SUCCESS);
}
+ close(pfd[0]);
ASSERT_EQ(sys_waitid(P_PIDFD, pidfd, &info, WSTOPPED, NULL), 0);
ASSERT_EQ(info.si_signo, SIGCHLD);
ASSERT_EQ(info.si_code, CLD_STOPPED);
@@ -117,6 +125,8 @@ static int sys_waitid(int which, pid_t pid, siginfo_t *info, int options,
ASSERT_EQ(sys_pidfd_send_signal(pidfd, SIGCONT, NULL, 0), 0);
ASSERT_EQ(sys_waitid(P_PIDFD, pidfd, &info, WCONTINUED, NULL), 0);
+ ASSERT_EQ(write(pfd[1], "C", 1), 1);
+ close(pfd[1]);
ASSERT_EQ(info.si_signo, SIGCHLD);
ASSERT_EQ(info.si_code, CLD_CONTINUED);
ASSERT_EQ(info.si_pid, parent_tid);
--
1.8.3.1
This patch series is a result of long debug work to find out why
sometimes guests with win11 secure boot
were failing during boot.
During writing a unit test I found another bug, turns out
that on rsm emulation, if the rsm instruction was done in real
or 32 bit mode, KVM would truncate the restored RIP to 32 bit.
I also refactored the way we write SMRAM so it is easier
now to understand what is going on.
The main bug in this series which I fixed is that we
allowed #SMI to happen during the STI interrupt shadow,
and we did nothing to both reset it on #SMI handler
entry and restore it on RSM.
V4:
- rebased on top of patch series from Paolo which
allows smm support to be disabled by Kconfig option.
- addressed review feedback.
I included these patches in the series for reference.
Best regards,
Maxim Levitsky
Maxim Levitsky (15):
bug: introduce ASSERT_STRUCT_OFFSET
KVM: x86: emulator: em_sysexit should update ctxt->mode
KVM: x86: emulator: introduce emulator_recalc_and_set_mode
KVM: x86: emulator: update the emulation mode after rsm
KVM: x86: emulator: update the emulation mode after CR0 write
KVM: x86: smm: number of GPRs in the SMRAM image depends on the image
format
KVM: x86: smm: check for failures on smm entry
KVM: x86: smm: add structs for KVM's smram layout
KVM: x86: smm: use smram structs in the common code
KVM: x86: smm: use smram struct for 32 bit smram load/restore
KVM: x86: smm: use smram struct for 64 bit smram load/restore
KVM: svm: drop explicit return value of kvm_vcpu_map
KVM: x86: SVM: use smram structs
KVM: x86: SVM: don't save SVM state to SMRAM when VM is not long mode
capable
KVM: x86: smm: preserve interrupt shadow in SMRAM
Paolo Bonzini (8):
KVM: x86: start moving SMM-related functions to new files
KVM: x86: move SMM entry to a new file
KVM: x86: move SMM exit to a new file
KVM: x86: do not go through ctxt->ops when emulating rsm
KVM: allow compiling out SMM support
KVM: x86: compile out vendor-specific code if SMM is disabled
KVM: x86: remove SMRAM address space if SMM is not supported
KVM: x86: do not define KVM_REQ_SMI if SMM disabled
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 29 +-
arch/x86/kvm/Kconfig | 11 +
arch/x86/kvm/Makefile | 1 +
arch/x86/kvm/emulate.c | 458 +++----------
arch/x86/kvm/kvm_cache_regs.h | 5 -
arch/x86/kvm/kvm_emulate.h | 47 +-
arch/x86/kvm/lapic.c | 14 +-
arch/x86/kvm/lapic.h | 7 +-
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/smm.c | 637 ++++++++++++++++++
arch/x86/kvm/smm.h | 160 +++++
arch/x86/kvm/svm/nested.c | 3 +
arch/x86/kvm/svm/svm.c | 43 +-
arch/x86/kvm/vmx/nested.c | 1 +
arch/x86/kvm/vmx/vmcs12.h | 5 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/x86.c | 353 +---------
include/linux/build_bug.h | 9 +
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +
20 files changed, 1031 insertions(+), 768 deletions(-)
create mode 100644 arch/x86/kvm/smm.c
create mode 100644 arch/x86/kvm/smm.h
--
2.34.3
The XSAVE feature set supports the saving and restoring of xstate components.
XSAVE feature has been used for process context switching. XSAVE components
include x87 state for FP execution environment, SSE state, AVX state and so on.
In order to ensure that XSAVE works correctly, add XSAVE most basic test for
XSAVE architecture functionality.
This patch tests "FP, SSE(XMM), AVX2(YMM), AVX512_OPMASK/AVX512_ZMM_Hi256/
AVX512_Hi16_ZMM and PKRU parts" xstates with following cases:
1. The contents of these xstates in the process should not change after the
signal handling.
2. The contents of these xstates in the child process should be the same as
the contents of the xstate in the parent process after the fork syscall.
3. The contents of xstates in the parent process should not change after
the context switch.
As stated in the ABI(Application Binary Interface) specification:
https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
Xstate like XMM is not preserved across function calls, so fork() function
which provided from libc could not be used in the xsave test, and the libc
function is replaced with an inline function of the assembly code only.
To prevent GCC from generating any FP/SSE(XMM)/AVX/PKRU code by mistake, add
"-mno-sse -mno-mmx -mno-sse2 -mno-avx -mno-pku" compiler arguments. stdlib.h
can not be used because of the "-mno-sse" option.
Thanks Dave, Hansen for the above suggestion!
Thanks Chen Yu; Shuah Khan; Chatre Reinette and Tony Luck's comments!
Thanks to Bae, Chang Seok for a bunch of comments!
========
- Change from v12 to v13
- Improve the comments of CPUID.(EAX=0DH, ECX=0H):EBX.
- Change from v11 to v12
- Remove useless rbx register stuffing in assembly syscall functions.
(Zhang, Li)
- Change from v10 to v11
- Remove the small function like cpu_has_pkru(), get_xstate_size() and so
on. (Shuah Khan)
- Unify xfeature_num type to uint32_t.
- Change from v9 to v10
- Remove the small function if the function will be called once and there
is no good reason. (Shuah Khan)
- Change from v8 to v9
- Use function pointers to make it more structured. (Hansen, Dave)
- Improve the function name: xstate_tested -> xstate_in_test. (Chang S. Bae)
- Break this test up into two pieces: keep the xstate key test steps with
"-mno-sse" and no stdlib.h, keep others in xstate.c file. (Hansen, Dave)
- Use kselftest infrastructure for xstate.c file. (Hansen, Dave)
- Use instruction back to populate fp xstate buffer. (Hansen, Dave)
- Will skip the test if cpu could not support xsave. (Chang S. Bae)
- Use __cpuid_count() helper in kselftest.h. (Reinette, Chatre)
- Change from v7 to v8
Many thanks to Bae, Chang Seok for a bunch of comments as follow:
- Use the filling buffer way to prepare the xstate buffer, and use xrstor
instruction way to load the tested xstates.
- Remove useless dump_buffer, compare_buffer functions.
- Improve the struct of xstate_info.
- Added AVX512_ZMM_Hi256 and AVX512_Hi16_ZMM components in xstate test.
- Remove redundant xstate_info.xstate_mask, xstate_flag[], and
xfeature_test_mask, use xstate_info.mask instead.
- Check if xfeature is supported outside of fill_xstate_buf() , this change
is easier to read and understand.
- Remove useless wrpkru, only use filling all tested xstate buffer in
fill_xstates_buf().
- Improve a bunch of function names and variable names.
- Improve test steps flow for readability.
- Change from v6 to v7:
- Added the error number and error description of the reason for the
failure, thanks Shuah Khan's suggestion.
- Added a description of what these tests are doing in the head comments.
- Added changes update in the head comments.
- Added description of the purpose of the function. thanks Shuah Khan.
- Change from v5 to v6:
- In order to prevent GCC from generating any FP code by mistake,
"-mno-sse -mno-mmx -mno-sse2 -mno-avx -mno-pku" compiler parameter was
added, it's referred to the parameters for compiling the x86 kernel. Thanks
Dave Hansen's suggestion.
- Removed the use of "kselftest.h", because kselftest.h included <stdlib.h>,
and "stdlib.h" would use sse instructions in it's libc, and this *XSAVE*
test needed to be compiled without libc sse instructions(-mno-sse).
- Improved the description in commit header, thanks Chen Yu's suggestion.
- Becasue test code could not use buildin xsave64 in libc without sse, added
xsave function by instruction way.
- Every key test action would not use libc(like printf) except syscall until
it's failed or done. If it's failed, then it would print the failed reason.
- Used __cpuid_count() instead of native_cpuid(), becasue __cpuid_count()
was a macro definition function with one instruction in libc and did not
change xstate. Thanks Chatre Reinette, Shuah Khan.
https://lore.kernel.org/linux-sgx/8b7c98f4-f050-bc1c-5699-fa598ecc66a2@linu…
- Change from v4 to v5:
- Moved code files into tools/testing/selftests/x86.
- Delete xsave instruction test, becaue it's not related to kernel.
- Improved case description.
- Added AVX512 opmask change and related XSAVE content verification.
- Added PKRU part xstate test into instruction and signal handling test.
- Added XSAVE process swich test for FPU, AVX2, AVX512 opmask and PKRU part.
- Change from v3 to v4:
- Improve the comment in patch 1.
- Change from v2 to v3:
- Improve the description of patch 2 git log.
- Change from v1 to v2:
- Improve the cover-letter. Thanks Dave Hansen's suggestion.
Pengfei Xu (2):
selftests/x86/xstate: Add xstate signal handling test for XSAVE
feature
selftests/x86/xstate: Add xstate fork test for XSAVE feature
tools/testing/selftests/x86/.gitignore | 1 +
tools/testing/selftests/x86/Makefile | 11 +-
tools/testing/selftests/x86/xstate.c | 214 +++++++++++++++++
tools/testing/selftests/x86/xstate.h | 228 +++++++++++++++++++
tools/testing/selftests/x86/xstate_helpers.c | 209 +++++++++++++++++
tools/testing/selftests/x86/xstate_helpers.h | 9 +
6 files changed, 670 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/x86/xstate.c
create mode 100644 tools/testing/selftests/x86/xstate.h
create mode 100644 tools/testing/selftests/x86/xstate_helpers.c
create mode 100644 tools/testing/selftests/x86/xstate_helpers.h
--
2.31.1
Remove the repeated word "and" in comments.
Signed-off-by: Shaomin Deng <dengshaomin(a)cdjrlc.com>
---
tools/testing/selftests/core/close_range_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
index 749239930ca8..4db5ec73d016 100644
--- a/tools/testing/selftests/core/close_range_test.c
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -476,7 +476,7 @@ TEST(close_range_cloexec_unshare_syzbot)
/*
* Create a huge gap in the fd table. When we now call
- * CLOSE_RANGE_UNSHARE with a shared fd table and and with ~0U as upper
+ * CLOSE_RANGE_UNSHARE with a shared fd table and with ~0U as upper
* bound the kernel will only copy up to fd1 file descriptors into the
* new fd table. If the kernel is buggy and doesn't handle
* CLOSE_RANGE_CLOEXEC correctly it will not have copied all file
--
2.35.1
Paul and myself got trapped a few times by not seeing the effects of
applying a patch to the nolibc source code until a "make clean" was
issued in the nolibc directory. It's particularly annoying when trying
to confirm that a proposed patch really solves a problem (or that
reverting it reintroduces the problem).
The reason for the sysroot not being rebuilt was that it can be quite
slow. But in fact it's only slow after a "make clean" issued at the
kernel's topdir, because it's the main "make headers" that can take a
tens of seconds; as long as "usr/include" still contains headers, the
"headers_install" phase is only a quick "rsync", and rebuilding the
whole nolibc sysroot takes a bit less than one second, which is perfectly
acceptable for a test, even more once the time lost caused by misleading
results if factored in.
This patch marks the sysroot target as phony and starts by clearing
the previous sysroot for the current architecture before reinstalling
it. Thanks to this, applying a patch to nolibc makes the effect
immediately visible to "make nolibc-test":
$ time make -j -C tools/testing/selftests/nolibc nolibc-test
make: Entering directory '/k/tools/testing/selftests/nolibc'
MKDIR sysroot/x86/include
make[1]: Entering directory '/k/tools/include/nolibc'
make[2]: Entering directory '/k'
make[2]: Leaving directory '/k'
make[2]: Entering directory '/k'
INSTALL /k/tools/testing/selftests/nolibc/sysroot/sysroot/include
make[2]: Leaving directory '/k'
make[1]: Leaving directory '/k/tools/include/nolibc'
CC nolibc-test
make: Leaving directory '/k/tools/testing/selftests/nolibc'
real 0m0.869s
user 0m0.716s
sys 0m0.149s
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Link: https://lore.kernel.org/all/20221021155645.GK5600@paulmck-ThinkPad-P17-Gen-…
Signed-off-by: Willy Tarreau <w(a)1wt.eu>
---
tools/testing/selftests/nolibc/Makefile | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/nolibc/Makefile b/tools/testing/selftests/nolibc/Makefile
index 69ea659caca9..22f1e1d73fa8 100644
--- a/tools/testing/selftests/nolibc/Makefile
+++ b/tools/testing/selftests/nolibc/Makefile
@@ -95,6 +95,7 @@ all: run
sysroot: sysroot/$(ARCH)/include
sysroot/$(ARCH)/include:
+ $(Q)rm -rf sysroot/$(ARCH) sysroot/sysroot
$(QUIET_MKDIR)mkdir -p sysroot
$(Q)$(MAKE) -C ../../../include/nolibc ARCH=$(ARCH) OUTPUT=$(CURDIR)/sysroot/ headers_standalone
$(Q)mv sysroot/sysroot sysroot/$(ARCH)
@@ -133,3 +134,5 @@ clean:
$(Q)rm -rf initramfs
$(call QUIET_CLEAN, run.out)
$(Q)rm -rf run.out
+
+.PHONY: sysroot/$(ARCH)/include
--
2.35.3
From: Roberto Sassu <roberto.sassu(a)huawei.com>
include/linux/lsm_hooks.h reports the result of the LSM infrastructure to
the callers, not what LSMs should return to the LSM infrastructure.
Clarify that and add that returning 1 from the LSMs means calling
__vm_enough_memory() with cap_sys_admin set, 0 without.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
---
include/linux/lsm_hooks.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 4ec80b96c22e..f40b82ca91e7 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1411,7 +1411,9 @@
* Check permissions for allocating a new virtual mapping.
* @mm contains the mm struct it is being added to.
* @pages contains the number of pages.
- * Return 0 if permission is granted.
+ * Return 0 if permission is granted by LSMs to the caller. LSMs should
+ * return 1 if __vm_enough_memory() should be called with
+ * cap_sys_admin set, 0 if not.
*
* @ismaclabel:
* Check if the extended attribute specified by @name
--
2.25.1
Show for each node if every memory descriptor in that node has the
EFI_MEMORY_CPU_CRYPTO attribute.
fwupd project plans to use it as part of a check to see if the users
have properly configured memory hardware encryption
capabilities. fwupd's people have seen cases where it seems like there
is memory encryption because all the hardware is capable of doing it,
but on a closer look there is not, either because of system firmware
or because some component requires updating to enable the feature.
The MKTME/TME spec says that it will only encrypt those memory regions
which are flagged with the EFI_MEMORY_CPU_CRYPTO attribute.
If all nodes are capable of encryption and if the system have tme/sme
on we can pretty confidently say that the device is actively
encrypting all its memory.
It's planned to make this check part of an specification that can be
passed to people purchasing hardware
These checks will run at every boot. The specification is called Host
Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.
We choosed to do it a per-node basis because although an ABI that
shows that the whole system memory is capable of encryption would be
useful for the fwupd use case, doing it in a per-node basis would make
the path easier to give the capability to the user to target
allocations from applications to NUMA nodes which have encryption
capabilities in the future.
Changes since v8:
Add unit tests to e820_range_* functions
Changes since v7:
Less kerneldocs
Less verbosity in the e820 code
Changes since v6:
Fixes in __e820__handle_range_update
Const correctness in e820.c
Correct alignment in memblock.h
Rework memblock_overlaps_region
Changes since v5:
Refactor e820__range_{update, remove, set_crypto_capable} in order to
avoid code duplication.
Warn the user when a node has both encryptable and non-encryptable
regions.
Check that e820_table has enough size to store both current e820_table
and EFI memmap.
Changes since v4:
Add enum to represent the cryptographic capabilities in e820:
e820_crypto_capabilities.
Revert __e820__range_update, only adding the new argument for
__e820__range_add about crypto capabilities.
Add a function __e820__range_update_crypto similar to
__e820__range_update but to only update this new field.
Changes since v3:
Update date in Doc/ABI file.
More information about the fwupd usecase and the rationale behind
doing it in a per-NUMA-node.
Changes since v2:
e820__range_mark_crypto -> e820__range_mark_crypto_capable.
In e820__range_remove: Create a region with crypto capabilities
instead of creating one without it and then mark it.
Changes since v1:
Modify __e820__range_update to update the crypto capabilities of a
range; now this function will change the crypto capability of a range
if it's called with the same old_type and new_type. Rework
efi_mark_e820_regions_as_crypto_capable based on this.
Update do_add_efi_memmap to mark the regions as it creates them.
Change the type of crypto_capable in e820_entry from bool to u8.
Fix e820__update_table changes.
Remove memblock_add_crypto_capable. Now you have to add the region and
mark it then.
Better place for crypto_capable in pglist_data.
Martin Fernandez (9):
mm/memblock: Tag memblocks with crypto capabilities
mm/mmzone: Tag pg_data_t with crypto capabilities
x86/e820: Add infrastructure to refactor e820__range_{update,remove}
x86/e820: Refactor __e820__range_update
x86/e820: Refactor e820__range_remove
x86/e820: Tag e820_entry with crypto capabilities
x86/e820: Add unit tests for e820_range_* functions
x86/efi: Mark e820_entries as crypto capable from EFI memmap
drivers/node: Show in sysfs node's crypto capabilities
Documentation/ABI/testing/sysfs-devices-node | 10 +
arch/x86/Kconfig.debug | 10 +
arch/x86/include/asm/e820/api.h | 1 +
arch/x86/include/asm/e820/types.h | 12 +-
arch/x86/kernel/e820.c | 393 ++++++++++++++-----
arch/x86/kernel/e820_test.c | 249 ++++++++++++
arch/x86/platform/efi/efi.c | 37 ++
drivers/base/node.c | 10 +
include/linux/memblock.h | 5 +
include/linux/mmzone.h | 3 +
mm/memblock.c | 62 +++
mm/page_alloc.c | 1 +
12 files changed, 695 insertions(+), 98 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-devices-node
create mode 100644 arch/x86/kernel/e820_test.c
--
2.30.2