Hello,
This patch series implements IOCTL on the pagemap procfs file to get the
information about the page table entries (PTEs). The following operations
are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages atomically.
Soft-dirty PTE bit of the memory pages can be read by using the pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. There are other
methods to mimic this information entirely in userspace with poor
performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty PTE bit status and clear operation
possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages to process only the dirty pages.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project[2][3]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project[2].
The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.
Some non-dirty pages get marked as dirty because of the kernel's
internal activity (such as VMA merging as soft-dirty bit difference isn't
considered while deciding to merge VMAs). The dirty bit of the pages is
stored in the VMA flags and in the per page flags. If any of these two bits
are set, the page is considered to be soft dirty. Suppose you have cleared
the soft dirty bit of half of VMA which will be done by splitting the VMA
and clearing soft dirty bit flag in the half VMA and the pages in it. Now
kernel may decide to merge the VMAs again. So the half VMA becomes dirty
again. This splitting/merging costs performance. The application receives
a lot of pages which aren't dirty in reality but marked as dirty.
Performance is lost again here. Also sometimes user doesn't want the newly
allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
solves both the problems. It is used to not depend on the soft dirty flag
in the VMA flags. So VMA splitting and merging doesn't happen. It only
depends on the soft dirty bit of the individual pages. Thus by using this
flag, there may be a scenerio such that the new memory regions which are
just created, doesn't look dirty when seen with the IOCTL, but look dirty
when seen from procfs. This seems okay as the user of this flag know the
implication of using it.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (3):
fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
selftests: vm: add pagemap ioctl tests
fs/proc/task_mmu.c | 400 +++++++++++-
include/uapi/linux/fs.h | 53 ++
tools/include/uapi/linux/fs.h | 53 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 681 +++++++++++++++++++++
6 files changed, 1160 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2
Hi All,
Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. VM guest with TDX support is called
as a TDX Guest.
In TDX guest, attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. For example, a key server may request for attestation before
releasing the encryption keys to mount the encrypted rootfs or
secondary drive.
This patch set adds attestation support for the TDX guest. Details
about the TDX attestation process and the steps involved are explained
in Documentation/x86/tdx.rst (added by patch 2/3).
Following are the details of the patch set:
Patch 1/3 -> Preparatory patch for adding attestation support.
Patch 2/3 -> Adds user interface driver to support attestation.
Patch 3/3 -> Adds selftest support for TDREPORT feature.
Commit log history is maintained in the individual patches.
Current overall status of this series is, it has no pending issues
and can be considered for the upcoming merge cycle.
Kuppuswamy Sathyanarayanan (3):
x86/tdx: Add a wrapper to get TDREPORT from the TDX Module
virt: Add TDX guest driver
selftests: tdx: Test TDX attestation GetReport support
Documentation/virt/coco/tdx-guest.rst | 42 +++++
Documentation/virt/index.rst | 1 +
Documentation/x86/tdx.rst | 43 +++++
arch/x86/coco/tdx/tdx.c | 31 ++++
arch/x86/include/asm/tdx.h | 2 +
drivers/virt/Kconfig | 2 +
drivers/virt/Makefile | 1 +
drivers/virt/coco/tdx-guest/Kconfig | 10 ++
drivers/virt/coco/tdx-guest/Makefile | 2 +
drivers/virt/coco/tdx-guest/tdx-guest.c | 121 +++++++++++++
include/uapi/linux/tdx-guest.h | 55 ++++++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/tdx/Makefile | 7 +
tools/testing/selftests/tdx/config | 1 +
tools/testing/selftests/tdx/tdx_guest_test.c | 175 +++++++++++++++++++
15 files changed, 494 insertions(+)
create mode 100644 Documentation/virt/coco/tdx-guest.rst
create mode 100644 drivers/virt/coco/tdx-guest/Kconfig
create mode 100644 drivers/virt/coco/tdx-guest/Makefile
create mode 100644 drivers/virt/coco/tdx-guest/tdx-guest.c
create mode 100644 include/uapi/linux/tdx-guest.h
create mode 100644 tools/testing/selftests/tdx/Makefile
create mode 100644 tools/testing/selftests/tdx/config
create mode 100644 tools/testing/selftests/tdx/tdx_guest_test.c
--
2.34.1
From: Jeff Xu <jeffxu(a)chromium.org>
Hi,
This v2 series MFD_NOEXEC, this series includes:
1> address comments in V1
2> add sysctl (vm.mfd_noexec) to change the default file permissions
of memfd_create to be non-executable.
Below are cover-level for v1:
The default file permissions on a memfd include execute bits, which
means that such a memfd can be filled with a executable and passed to
the exec() family of functions. This is undesirable on systems where all
code is verified and all filesystems are intended to be mounted noexec,
since an attacker may be able to use a memfd to load unverified code and
execute it.
Additionally, execution via memfd is a common way to avoid scrutiny for
malicious code, since it allows execution of a program without a file
ever appearing on disk. This attack vector is not totally mitigated with
this new flag, since the default memfd file permissions must remain
executable to avoid breaking existing legitimate uses, but it should be
possible to use other security mechanisms to prevent memfd_create calls
without MFD_NOEXEC on systems where it is known that executable memfds
are not necessary.
This patch series adds a new MFD_NOEXEC flag for memfd_create(), which
allows creation of non-executable memfds, and as part of the
implementation of this new flag, it also adds a new F_SEAL_EXEC seal,
which will prevent modification of any of the execute bits of a sealed
memfd.
I am not sure if this is the best way to implement the desired behavior
(for example, the F_SEAL_EXEC seal is really more of an implementation
detail and feels a bit clunky to expose), so suggestions are welcome
for alternate approaches.
v1: https://lwn.net/Articles/890096/
Daniel Verkamp (4):
mm/memfd: add F_SEAL_EXEC
mm/memfd: add MFD_NOEXEC flag to memfd_create
selftests/memfd: add tests for F_SEAL_EXEC
selftests/memfd: add tests for MFD_NOEXEC
Jeff Xu (1):
sysctl: add support for mfd_noexec
include/linux/mm.h | 4 +
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 1 +
kernel/sysctl.c | 9 ++
mm/memfd.c | 39 ++++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/memfd_test.c | 163 ++++++++++++++++++++-
7 files changed, 221 insertions(+), 2 deletions(-)
base-commit: 9e2f40233670c70c25e0681cb66d50d1e2742829
--
2.37.1.559.g78731f0fdb-goog