Hello,
The aim of this patch series is to improve the resctrl selftest.
Without these fixes, some unnecessary processing will be executed
and test results will be confusing.
There is no behavior change in test themselves.
[patch 1] Make write_schemata() run to set up shemata with 100% allocation
on first run in MBM test.
[patch 2] The MBA test result message is always output as "ok",
make output message to be "not ok" if MBA check result is failed.
[patch 3] When a child process is created by fork(), the buffer of the
parent process is also copied. Flush the buffer before
executing fork().
[patch 4] Add a signal handler to cleanup properly before exiting the
parent process if there is an error occurs after creating
a child process with fork() in the CAT test, and unregister
signal handler when each test finished.
[patch 5] Before exiting each test CMT/CAT/MBM/MBA, clear test result
files function cat/cmt/mbm/mba_test_cleanup() are called
twice. Delete once.
This patch series is based on Linux v6.2-rc6.
Difference from v5:
[patch 4]
- If an error occurs in signal_handler_register() return -1,
and if an error occurs in signal_handler_unregister() does
not return any value.
- If signal_handler_register() fails, stop the running
parents&child process.
- Ignore the result of signal_handler_unregister()
so as not to overwrite earlier value of ret.
- Fix change log.
Shaopeng Tan (5):
selftests/resctrl: Fix set up schemata with 100% allocation on first
run in MBM test
selftests/resctrl: Return MBA check result and make it to output
message
selftests/resctrl: Flush stdout file buffer before executing fork()
selftests/resctrl: Cleanup properly when an error occurs in CAT test
selftests/resctrl: Remove duplicate codes that clear each test result
file
tools/testing/selftests/resctrl/cat_test.c | 29 ++++----
tools/testing/selftests/resctrl/cmt_test.c | 7 +-
tools/testing/selftests/resctrl/fill_buf.c | 14 ----
tools/testing/selftests/resctrl/mba_test.c | 23 +++---
tools/testing/selftests/resctrl/mbm_test.c | 20 +++---
tools/testing/selftests/resctrl/resctrl.h | 2 +
.../testing/selftests/resctrl/resctrl_tests.c | 4 --
tools/testing/selftests/resctrl/resctrl_val.c | 71 +++++++++++++------
tools/testing/selftests/resctrl/resctrlfs.c | 5 +-
9 files changed, 98 insertions(+), 77 deletions(-)
--
2.27.0
Arm have recently released versions 2 and 2.1 of the SME extension.
Among the features introduced by SME 2 is some new architectural state,
the ZT0 register. This series adds support for this and all the other
features of the new SME versions.
Since the architecture has been designed with the possibility of adding
further ZTn registers in mind the interfaces added for ZT0 are done with
this possibility in mind. As ZT0 is a simple fixed size register these
interfaces are all fairly simple, the main complication is that ZT0 is
only accessible when PSTATE.ZA is enabled. The memory allocation that we
already do for PSTATE.ZA is extended to include space for ZT0.
Due to textual collisions especially around the addition of hwcaps this
is based on the recently merged series "arm64: Support for 2022 data
processing instructions" but there is no meaningful interaction. There
will be collisions with "arm64/signal: Signal handling cleanups" if that
is applied but again not super substantial.
v4:
- Rebase onto v6.2-rc3.
- Add SME2 value to ID_AA64PFR1_EL1.SME and move cpufeature to key off
it.
- Fix cut'n'paste errors and missing capability in hwcap table.
- Fix bitrot in za-test program.
- Typo and cut'n'paste fixes.
v3:
- Rebase onto merged series for the 2022 architectur extensions.
- Clarifications and typo fixes in the ABI documentation.
v2:
- Add missing initialisation of user->zt in signal context parsing.
- Change the magic for ZT signal frames to 0x5a544e01 (ZTN0).
To: Catalin Marinas <catalin.marinas(a)arm.com>
To: Will Deacon <will(a)kernel.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Marc Zyngier <maz(a)kernel.org>
To: James Morse <james.morse(a)arm.com>
To: Alexandru Elisei <alexandru.elisei(a)arm.com>
To: Suzuki K Poulose <suzuki.poulose(a)arm.com>
To: Oliver Upton <oliver.upton(a)linux.dev>
Cc: Alan Hayward <alan.hayward(a)arm.com>
Cc: Luis Machado <luis.machado(a)arm.com>,
Cc: Szabolcs Nagy <szabolcs.nagy(a)arm.com>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-kernel(a)vger.kernel.org
Cc: kvmarm(a)lists.linux.dev
To: Shuah Khan <shuah(a)kernel.org>
Cc: linux-kselftest(a)vger.kernel.org
---
Mark Brown (21):
arm64/sme: Rename za_state to sme_state
arm64: Document boot requirements for SME 2
arm64/sysreg: Update system registers for SME 2 and 2.1
arm64/sme: Document SME 2 and SME 2.1 ABI
arm64/esr: Document ISS for ZT0 being disabled
arm64/sme: Manually encode ZT0 load and store instructions
arm64/sme: Enable host kernel to access ZT0
arm64/sme: Add basic enumeration for SME2
arm64/sme: Provide storage for ZT0
arm64/sme: Implement context switching for ZT0
arm64/sme: Implement signal handling for ZT
arm64/sme: Implement ZT0 ptrace support
arm64/sme: Add hwcaps for SME 2 and 2.1 features
kselftest/arm64: Add a stress test program for ZT0
kselftest/arm64: Cover ZT in the FP stress test
kselftest/arm64: Enumerate SME2 in the signal test utility code
kselftest/arm64: Teach the generic signal context validation about ZT
kselftest/arm64: Add test coverage for ZT register signal frames
kselftest/arm64: Add SME2 coverage to syscall-abi
kselftest/arm64: Add coverage of the ZT ptrace regset
kselftest/arm64: Add coverage of SME 2 and 2.1 hwcaps
Documentation/arm64/booting.rst | 10 +
Documentation/arm64/elf_hwcaps.rst | 18 +
Documentation/arm64/sme.rst | 52 ++-
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/esr.h | 1 +
arch/arm64/include/asm/fpsimd.h | 30 +-
arch/arm64/include/asm/fpsimdmacros.h | 22 ++
arch/arm64/include/asm/hwcap.h | 6 +
arch/arm64/include/asm/processor.h | 2 +-
arch/arm64/include/uapi/asm/hwcap.h | 6 +
arch/arm64/include/uapi/asm/sigcontext.h | 19 ++
arch/arm64/kernel/cpufeature.c | 28 ++
arch/arm64/kernel/cpuinfo.c | 6 +
arch/arm64/kernel/entry-fpsimd.S | 30 +-
arch/arm64/kernel/fpsimd.c | 47 ++-
arch/arm64/kernel/hyp-stub.S | 6 +
arch/arm64/kernel/idreg-override.c | 1 +
arch/arm64/kernel/process.c | 21 +-
arch/arm64/kernel/ptrace.c | 60 +++-
arch/arm64/kernel/signal.c | 113 ++++++-
arch/arm64/kvm/fpsimd.c | 2 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 27 +-
include/uapi/linux/elf.h | 1 +
tools/testing/selftests/arm64/abi/hwcap.c | 115 +++++++
.../testing/selftests/arm64/abi/syscall-abi-asm.S | 43 ++-
tools/testing/selftests/arm64/abi/syscall-abi.c | 40 ++-
tools/testing/selftests/arm64/fp/.gitignore | 2 +
tools/testing/selftests/arm64/fp/Makefile | 5 +
tools/testing/selftests/arm64/fp/fp-stress.c | 29 +-
tools/testing/selftests/arm64/fp/sme-inst.h | 20 ++
tools/testing/selftests/arm64/fp/zt-ptrace.c | 365 +++++++++++++++++++++
tools/testing/selftests/arm64/fp/zt-test.S | 317 ++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.h | 2 +
.../selftests/arm64/signal/test_signals_utils.c | 3 +
.../selftests/arm64/signal/testcases/testcases.c | 36 ++
.../selftests/arm64/signal/testcases/testcases.h | 1 +
.../selftests/arm64/signal/testcases/zt_no_regs.c | 51 +++
.../selftests/arm64/signal/testcases/zt_regs.c | 85 +++++
40 files changed, 1558 insertions(+), 72 deletions(-)
---
base-commit: b7bfaa761d760e72a969d116517eaa12e404c262
change-id: 20221208-arm64-sme2-363c27227a32
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Since commit a1d6cd88c897 ("selftests/ftrace: event_triggers: wait
longer for test_event_enable") introduced bash specific "=="
comparation operator, that test will fail when we run it on a
posix-shell. `checkbashisms` warned it as below.
possible bashism in ftrace/func_event_triggers.tc line 45 (should be 'b = a'):
if [ "$e" == $val ]; then
This replaces it with "=".
Fixes: a1d6cd88c897 ("selftests/ftrace: event_triggers: wait longer for test_event_enable")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
---
.../ftrace/test.d/ftrace/func_event_triggers.tc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/func_event_triggers.tc b/tools/testing/selftests/ftrace/test.d/ftrace/func_event_triggers.tc
index 3eea2abf68f9..2ad7d4b501cc 100644
--- a/tools/testing/selftests/ftrace/test.d/ftrace/func_event_triggers.tc
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/func_event_triggers.tc
@@ -42,7 +42,7 @@ test_event_enabled() {
while [ $check_times -ne 0 ]; do
e=`cat $EVENT_ENABLE`
- if [ "$e" == $val ]; then
+ if [ "$e" = $val ]; then
return 0
fi
sleep $SLEEP_TIME
User space can use the MEM_OP ioctl to make storage key checked reads
and writes to the guest, however, it has no way of performing atomic,
key checked, accesses to the guest.
Extend the MEM_OP ioctl in order to allow for this, by adding a cmpxchg
operation. For now, support this operation for absolute accesses only.
This operation can be use, for example, to set the device-state-change
indicator and the adapter-local-summary indicator atomically.
Also contains some fixes/changes for the memop selftest independent of
the cmpxchg changes.
v5 -> v6
* move memop selftest fixes/refactoring to front of series so they can
be picked independently from the rest
* use op instead of flag to indicate cmpxchg
* no longer indicate success of cmpxchg to user space, which can infer
it by observing a change in the old value instead
* refactor functions implementing the ioctl
* adjust documentation (drop R-b)
* adjust selftest
* rebase
v4 -> v5
* refuse cmpxchg if not write (thanks Thomas)
* minor doc changes (thanks Claudio)
* picked up R-b's (thanks Thomas & Claudio)
* memop selftest fixes
* rebased
v3 -> v4
* no functional change intended
* rework documentation a bit
* name extension cap cmpxchg bit
* picked up R-b (thanks Thomas)
* various changes (rename variable, comments, ...) see range-diff below
v2 -> v3
* rebase onto the wip/cmpxchg_user_key branch in the s390 kernel repo
* use __uint128_t instead of unsigned __int128
* put moving of testlist into main into separate patch
* pick up R-b's (thanks Nico)
v1 -> v2
* get rid of xrk instruction for cmpxchg byte and short implementation
* pass old parameter via pointer instead of in mem_op struct
* indicate failure of cmpxchg due to wrong old value by special return
code
* picked up R-b's (thanks Thomas)
Janis Schoetterl-Glausch (14):
KVM: s390: selftest: memop: Pass mop_desc via pointer
KVM: s390: selftest: memop: Replace macros by functions
KVM: s390: selftest: memop: Move testlist into main
KVM: s390: selftest: memop: Add bad address test
KVM: s390: selftest: memop: Fix typo
KVM: s390: selftest: memop: Fix wrong address being used in test
KVM: s390: selftest: memop: Fix integer literal
KVM: s390: Move common code of mem_op functions into functions
KVM: s390: Dispatch to implementing function at top level of vm mem_op
KVM: s390: Refactor absolute vm mem_op function
KVM: s390: Refactor absolute vcpu mem_op function
KVM: s390: Extend MEM_OP ioctl by storage key checked cmpxchg
Documentation: KVM: s390: Describe KVM_S390_MEMOP_F_CMPXCHG
KVM: s390: selftest: memop: Add cmpxchg tests
Documentation/virt/kvm/api.rst | 29 +-
include/uapi/linux/kvm.h | 8 +
arch/s390/kvm/gaccess.h | 3 +
arch/s390/kvm/gaccess.c | 103 ++++
arch/s390/kvm/kvm-s390.c | 249 ++++----
tools/testing/selftests/kvm/s390x/memop.c | 675 +++++++++++++++++-----
6 files changed, 819 insertions(+), 248 deletions(-)
Range-diff against v5:
3: 94c1165ae24a = 1: 512e1a3e0ae5 KVM: s390: selftest: memop: Pass mop_desc via pointer
4: 027c87eee0ac = 2: 47328ea64f80 KVM: s390: selftest: memop: Replace macros by functions
5: 16ac410ecc0f = 3: 224fe37eeec7 KVM: s390: selftest: memop: Move testlist into main
7: 2d6776733e64 = 4: f622d3413cf0 KVM: s390: selftest: memop: Add bad address test
8: 8c49eafd2881 = 5: 431f191a8a57 KVM: s390: selftest: memop: Fix typo
9: 0af907110b34 = 6: 3122187435fb KVM: s390: selftest: memop: Fix wrong address being used in test
10: 886c80b2bdce = 7: 401f51f3ef55 KVM: s390: selftest: memop: Fix integer literal
-: ------------ > 8: df09794e0794 KVM: s390: Move common code of mem_op functions into functions
-: ------------ > 9: 5cbae63357ed KVM: s390: Dispatch to implementing function at top level of vm mem_op
-: ------------ > 10: 76ba77b63a26 KVM: s390: Refactor absolute vm mem_op function
-: ------------ > 11: c848e772e22a KVM: s390: Refactor absolute vcpu mem_op function
1: 6adc166ee141 ! 12: 6ccb200ad85c KVM: s390: Extend MEM_OP ioctl by storage key checked cmpxchg
@@ Commit message
and writes to the guest, however, it has no way of performing atomic,
key checked, accesses to the guest.
Extend the MEM_OP ioctl in order to allow for this, by adding a cmpxchg
- mode. For now, support this mode for absolute accesses only.
+ op. For now, support this op for absolute accesses only.
- This mode can be use, for example, to set the device-state-change
+ This op can be use, for example, to set the device-state-change
indicator and the adapter-local-summary indicator atomically.
Signed-off-by: Janis Schoetterl-Glausch <scgl(a)linux.ibm.com>
@@ include/uapi/linux/kvm.h: struct kvm_s390_mem_op {
__u8 ar; /* the access register number */
__u8 key; /* access key, ignored if flag unset */
+ __u8 pad1[6]; /* ignored */
-+ __u64 old_addr; /* ignored if flag unset */
++ __u64 old_addr; /* ignored if cmpxchg flag unset */
};
__u32 sida_offset; /* offset into the sida */
__u8 reserved[32]; /* ignored */
@@ include/uapi/linux/kvm.h: struct kvm_s390_mem_op {
+ #define KVM_S390_MEMOP_SIDA_WRITE 3
+ #define KVM_S390_MEMOP_ABSOLUTE_READ 4
+ #define KVM_S390_MEMOP_ABSOLUTE_WRITE 5
++#define KVM_S390_MEMOP_ABSOLUTE_CMPXCHG 6
++
+ /* flags for kvm_s390_mem_op->flags */
#define KVM_S390_MEMOP_F_CHECK_ONLY (1ULL << 0)
#define KVM_S390_MEMOP_F_INJECT_EXCEPTION (1ULL << 1)
#define KVM_S390_MEMOP_F_SKEY_PROTECTION (1ULL << 2)
-+#define KVM_S390_MEMOP_F_CMPXCHG (1ULL << 3)
-+/* flags specifying extension support */
-+#define KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG 0x2
-+/* Non program exception return codes (pgm codes are 16 bit) */
-+#define KVM_S390_MEMOP_R_NO_XCHG (1 << 16)
++/* flags specifying extension support via KVM_CAP_S390_MEM_OP_EXTENSION */
++#define KVM_S390_MEMOP_EXTENSION_CAP_BASE (1 << 0)
++#define KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG (1 << 1)
++
/* for KVM_INTERRUPT */
struct kvm_interrupt {
+ /* in */
## arch/s390/kvm/gaccess.h ##
@@ arch/s390/kvm/gaccess.h: int access_guest_with_key(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
int access_guest_real(struct kvm_vcpu *vcpu, unsigned long gra,
void *data, unsigned long len, enum gacc_mode mode);
-+int cmpxchg_guest_abs_with_key(struct kvm *kvm, gpa_t gpa, int len,
-+ __uint128_t *old, __uint128_t new, u8 access_key);
++int cmpxchg_guest_abs_with_key(struct kvm *kvm, gpa_t gpa, int len, __uint128_t *old,
++ __uint128_t new, u8 access_key, bool *success);
+
/**
* write_guest_with_key - copy data from kernel space to guest space
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ * @gpa: Absolute guest address of the location to be changed.
+ * @len: Operand length of the cmpxchg, required: 1 <= len <= 16. Providing a
+ * non power of two will result in failure.
-+ * @old_addr: Pointer to old value. If the location at @gpa contains this value, the
-+ * exchange will succeed. After calling cmpxchg_guest_abs_with_key() *@old
-+ * contains the value at @gpa before the attempt to exchange the value.
++ * @old_addr: Pointer to old value. If the location at @gpa contains this value,
++ * the exchange will succeed. After calling cmpxchg_guest_abs_with_key()
++ * *@old_addr contains the value at @gpa before the attempt to
++ * exchange the value.
+ * @new: The value to place at @gpa.
+ * @access_key: The access key to use for the guest access.
++ * @success: output value indicating if an exchange occurred.
+ *
+ * Atomically exchange the value at @gpa by @new, if it contains *@old.
+ * Honors storage keys.
+ *
+ * Return: * 0: successful exchange
-+ * * 1: exchange unsuccessful
+ * * a program interruption code indicating the reason cmpxchg could
+ * not be attempted
+ * * -EINVAL: address misaligned or len not power of two
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ */
+int cmpxchg_guest_abs_with_key(struct kvm *kvm, gpa_t gpa, int len,
+ __uint128_t *old_addr, __uint128_t new,
-+ u8 access_key)
++ u8 access_key, bool *success)
+{
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u8 old;
+
+ ret = cmpxchg_user_key((u8 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u16 old;
+
+ ret = cmpxchg_user_key((u16 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u32 old;
+
+ ret = cmpxchg_user_key((u32 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u64 old;
+
+ ret = cmpxchg_user_key((u64 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ __uint128_t old;
+
+ ret = cmpxchg_user_key((__uint128_t *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/kvm-s390.c: int kvm_vm_ioctl_check_extension(struct kvm *kvm, long
+ case KVM_CAP_S390_MEM_OP_EXTENSION:
+ /*
+ * Flag bits indicating which extensions are supported.
-+ * The first extension doesn't use a flag, but pretend it does,
-+ * this way that can be changed in the future.
++ * If r > 0, the base extension must also be supported/indicated,
++ * in order to maintain backwards compatibility.
+ */
-+ r = KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG | 1;
++ r = KVM_S390_MEMOP_EXTENSION_CAP_BASE |
++ KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG;
+ break;
case KVM_CAP_NR_VCPUS:
case KVM_CAP_MAX_VCPUS:
case KVM_CAP_MAX_VCPU_ID:
-@@ arch/s390/kvm/kvm-s390.c: static bool access_key_invalid(u8 access_key)
- static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
- {
- void __user *uaddr = (void __user *)mop->buf;
+@@ arch/s390/kvm/kvm-s390.c: static int kvm_s390_vm_mem_op_abs(struct kvm *kvm, struct kvm_s390_mem_op *mop)
+ return r;
+ }
+
++static int kvm_s390_vm_mem_op_cmpxchg(struct kvm *kvm, struct kvm_s390_mem_op *mop)
++{
++ void __user *uaddr = (void __user *)mop->buf;
+ void __user *old_addr = (void __user *)mop->old_addr;
+ union {
+ __uint128_t quad;
+ char raw[sizeof(__uint128_t)];
+ } old = { .quad = 0}, new = { .quad = 0 };
+ unsigned int off_in_quad = sizeof(new) - mop->size;
- u64 supported_flags;
- void *tmpbuf = NULL;
- int r, srcu_idx;
-
- supported_flags = KVM_S390_MEMOP_F_SKEY_PROTECTION
-- | KVM_S390_MEMOP_F_CHECK_ONLY;
-+ | KVM_S390_MEMOP_F_CHECK_ONLY
-+ | KVM_S390_MEMOP_F_CMPXCHG;
- if (mop->flags & ~supported_flags || !mop->size)
- return -EINVAL;
- if (mop->size > MEM_OP_MAX_SIZE)
-@@ arch/s390/kvm/kvm-s390.c: static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
- } else {
- mop->key = 0;
- }
-+ if (mop->flags & KVM_S390_MEMOP_F_CMPXCHG) {
-+ /*
-+ * This validates off_in_quad. Checking that size is a power
-+ * of two is not necessary, as cmpxchg_guest_abs_with_key
-+ * takes care of that
-+ */
-+ if (mop->size > sizeof(new))
-+ return -EINVAL;
-+ if (mop->op != KVM_S390_MEMOP_ABSOLUTE_WRITE)
-+ return -EINVAL;
-+ if (copy_from_user(&new.raw[off_in_quad], uaddr, mop->size))
-+ return -EFAULT;
-+ if (copy_from_user(&old.raw[off_in_quad], old_addr, mop->size))
-+ return -EFAULT;
++ int r, srcu_idx;
++ bool success;
++
++ r = mem_op_validate_common(mop, KVM_S390_MEMOP_F_SKEY_PROTECTION);
++ if (r)
++ return r;
++ /*
++ * This validates off_in_quad. Checking that size is a power
++ * of two is not necessary, as cmpxchg_guest_abs_with_key
++ * takes care of that
++ */
++ if (mop->size > sizeof(new))
++ return -EINVAL;
++ if (copy_from_user(&new.raw[off_in_quad], uaddr, mop->size))
++ return -EFAULT;
++ if (copy_from_user(&old.raw[off_in_quad], old_addr, mop->size))
++ return -EFAULT;
++
++ srcu_idx = srcu_read_lock(&kvm->srcu);
++
++ if (kvm_is_error_gpa(kvm, mop->gaddr)) {
++ r = PGM_ADDRESSING;
++ goto out_unlock;
+ }
- if (!(mop->flags & KVM_S390_MEMOP_F_CHECK_ONLY)) {
- tmpbuf = vmalloc(mop->size);
- if (!tmpbuf)
++
++ r = cmpxchg_guest_abs_with_key(kvm, mop->gaddr, mop->size, &old.quad,
++ new.quad, mop->key, &success);
++ if (!success && copy_to_user(old_addr, &old.raw[off_in_quad], mop->size))
++ r = -EFAULT;
++
++out_unlock:
++ srcu_read_unlock(&kvm->srcu, srcu_idx);
++ return r;
++}
++
+ static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
+ {
+ /*
@@ arch/s390/kvm/kvm-s390.c: static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
- case KVM_S390_MEMOP_ABSOLUTE_WRITE: {
- if (mop->flags & KVM_S390_MEMOP_F_CHECK_ONLY) {
- r = check_gpa_range(kvm, mop->gaddr, mop->size, GACC_STORE, mop->key);
-+ } else if (mop->flags & KVM_S390_MEMOP_F_CMPXCHG) {
-+ r = cmpxchg_guest_abs_with_key(kvm, mop->gaddr, mop->size,
-+ &old.quad, new.quad, mop->key);
-+ if (r == 1) {
-+ r = KVM_S390_MEMOP_R_NO_XCHG;
-+ if (copy_to_user(old_addr, &old.raw[off_in_quad], mop->size))
-+ r = -EFAULT;
-+ }
- } else {
- if (copy_from_user(tmpbuf, uaddr, mop->size)) {
- r = -EFAULT;
+ case KVM_S390_MEMOP_ABSOLUTE_READ:
+ case KVM_S390_MEMOP_ABSOLUTE_WRITE:
+ return kvm_s390_vm_mem_op_abs(kvm, mop);
++ case KVM_S390_MEMOP_ABSOLUTE_CMPXCHG:
++ return kvm_s390_vm_mem_op_cmpxchg(kvm, mop);
+ default:
+ return -EINVAL;
+ }
2: fce9a063ab70 ! 13: 4d983d179903 Documentation: KVM: s390: Describe KVM_S390_MEMOP_F_CMPXCHG
@@ Commit message
checked) cmpxchg operations on guest memory.
Signed-off-by: Janis Schoetterl-Glausch <scgl(a)linux.ibm.com>
- Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
## Documentation/virt/kvm/api.rst ##
@@ Documentation/virt/kvm/api.rst: The fields in each entry are defined as follows:
@@ Documentation/virt/kvm/api.rst: Parameters are specified via the following struc
};
__u32 sida_offset; /* offset into the sida */
__u8 reserved[32]; /* ignored */
-@@ Documentation/virt/kvm/api.rst: Absolute accesses are permitted for non-protected guests only.
- Supported flags:
+@@ Documentation/virt/kvm/api.rst: Possible operations are:
+ * ``KVM_S390_MEMOP_ABSOLUTE_WRITE``
+ * ``KVM_S390_MEMOP_SIDA_READ``
+ * ``KVM_S390_MEMOP_SIDA_WRITE``
++ * ``KVM_S390_MEMOP_ABSOLUTE_CMPXCHG``
+
+ Logical read/write:
+ ^^^^^^^^^^^^^^^^^^^
+@@ Documentation/virt/kvm/api.rst: the checks required for storage key protection as one operation (as opposed to
+ user space getting the storage keys, performing the checks, and accessing
+ memory thereafter, which could lead to a delay between check and access).
+ Absolute accesses are permitted for the VM ioctl if KVM_CAP_S390_MEM_OP_EXTENSION
+-is > 0.
++has the KVM_S390_MEMOP_EXTENSION_CAP_BASE bit set.
+ Currently absolute accesses are not permitted for VCPU ioctls.
+ Absolute accesses are permitted for non-protected guests only.
+
+@@ Documentation/virt/kvm/api.rst: Supported flags:
* ``KVM_S390_MEMOP_F_CHECK_ONLY``
* ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
-+ * ``KVM_S390_MEMOP_F_CMPXCHG``
-+
+
+-The semantics of the flags are as for logical accesses.
+The semantics of the flags common with logical accesses are as for logical
+accesses.
+
-+For write accesses, the KVM_S390_MEMOP_F_CMPXCHG flag is supported if
-+KVM_CAP_S390_MEM_OP_EXTENSION has flag KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG set.
-+In this case, instead of doing an unconditional write, the access occurs
-+only if the target location contains the value pointed to by "old_addr".
++Absolute cmpxchg:
++^^^^^^^^^^^^^^^^^
++
++Perform cmpxchg on absolute guest memory. Intended for use with the
++KVM_S390_MEMOP_F_SKEY_PROTECTION flag.
++Instead of doing an unconditional write, the access occurs only if the target
++location contains the value pointed to by "old_addr".
+This is performed as an atomic cmpxchg with the length specified by the "size"
+parameter. "size" must be a power of two up to and including 16.
+If the exchange did not take place because the target value doesn't match the
-+old value, KVM_S390_MEMOP_R_NO_XCHG is returned.
-+In this case the value "old_addr" points to is replaced by the target value.
-
--The semantics of the flags are as for logical accesses.
++old value, the value "old_addr" points to is replaced by the target value.
++User space can tell if an exchange took place by checking if this replacement
++occurred. The cmpxchg op is permitted for the VM ioctl if
++KVM_CAP_S390_MEM_OP_EXTENSION has flag KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG set.
++
++Supported flags:
++ * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
SIDA read/write:
^^^^^^^^^^^^^^^^
6: 214281b6eb96 ! 14: 5250be3dd58b KVM: s390: selftest: memop: Add cmpxchg tests
@@ tools/testing/selftests/kvm/s390x/memop.c
#include <linux/bits.h>
+@@ tools/testing/selftests/kvm/s390x/memop.c: enum mop_target {
+ enum mop_access_mode {
+ READ,
+ WRITE,
++ CMPXCHG,
+ };
+
+ struct mop_desc {
@@ tools/testing/selftests/kvm/s390x/memop.c: struct mop_desc {
enum mop_access_mode mode;
void *buf;
uint32_t sida_offset;
+ void *old;
++ uint8_t old_value[16];
+ bool *cmpxchg_success;
uint8_t ar;
uint8_t key;
};
+
+ const uint8_t NO_KEY = 0xff;
+
+-static struct kvm_s390_mem_op ksmo_from_desc(const struct mop_desc *desc)
++static struct kvm_s390_mem_op ksmo_from_desc(struct mop_desc *desc)
+ {
+ struct kvm_s390_mem_op ksmo = {
+ .gaddr = (uintptr_t)desc->gaddr,
@@ tools/testing/selftests/kvm/s390x/memop.c: static struct kvm_s390_mem_op ksmo_from_desc(const struct mop_desc *desc)
- ksmo.flags |= KVM_S390_MEMOP_F_SKEY_PROTECTION;
- ksmo.key = desc->key;
- }
-+ if (desc->old) {
-+ ksmo.flags |= KVM_S390_MEMOP_F_CMPXCHG;
-+ ksmo.old_addr = (uint64_t)desc->old;
-+ }
- if (desc->_ar)
- ksmo.ar = desc->ar;
- else
+ ksmo.op = KVM_S390_MEMOP_ABSOLUTE_READ;
+ if (desc->mode == WRITE)
+ ksmo.op = KVM_S390_MEMOP_ABSOLUTE_WRITE;
++ if (desc->mode == CMPXCHG) {
++ ksmo.op = KVM_S390_MEMOP_ABSOLUTE_CMPXCHG;
++ ksmo.old_addr = (uint64_t)desc->old;
++ memcpy(desc->old_value, desc->old, desc->size);
++ }
+ break;
+ case INVALID:
+ ksmo.op = -1;
@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vcpu *vcpu, const struct kvm_s390_mem_op *ksm
+ case KVM_S390_MEMOP_ABSOLUTE_WRITE:
printf("ABSOLUTE, WRITE, ");
break;
++ case KVM_S390_MEMOP_ABSOLUTE_CMPXCHG:
++ printf("ABSOLUTE, CMPXCHG, ");
++ break;
}
- printf("gaddr=%llu, size=%u, buf=%llu, ar=%u, key=%u",
- ksmo->gaddr, ksmo->size, ksmo->buf, ksmo->ar, ksmo->key);
@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vc
if (ksmo->flags & KVM_S390_MEMOP_F_CHECK_ONLY)
printf(", CHECK_ONLY");
if (ksmo->flags & KVM_S390_MEMOP_F_INJECT_EXCEPTION)
- printf(", INJECT_EXCEPTION");
- if (ksmo->flags & KVM_S390_MEMOP_F_SKEY_PROTECTION)
- printf(", SKEY_PROTECTION");
-+ if (ksmo->flags & KVM_S390_MEMOP_F_CMPXCHG)
-+ printf(", CMPXCHG");
+@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vcpu *vcpu, const struct kvm_s390_mem_op *ksm
puts(")");
}
@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vc
+ int r;
+
+ r = err_memop_ioctl(info, ksmo, desc);
-+ if (ksmo->flags & KVM_S390_MEMOP_F_CMPXCHG) {
-+ if (desc->cmpxchg_success)
-+ *desc->cmpxchg_success = !r;
-+ if (r == KVM_S390_MEMOP_R_NO_XCHG)
-+ r = 0;
++ if (ksmo->op == KVM_S390_MEMOP_ABSOLUTE_CMPXCHG) {
++ if (desc->cmpxchg_success) {
++ int diff = memcmp(desc->old_value, desc->old, desc->size);
++ *desc->cmpxchg_success = !diff;
++ }
+ }
+ TEST_ASSERT(!r, __KVM_IOCTL_ERROR("KVM_S390_MEM_OP", r));
@@ tools/testing/selftests/kvm/s390x/memop.c: static void default_read(struct test_
+ default_write_read(test->vcpu, test->vcpu, LOGICAL, 16, NO_KEY);
+
+ memcpy(&old, mem1, 16);
-+ CHECK_N_DO(MOP, test->vm, ABSOLUTE, WRITE, new + offset,
-+ size, GADDR_V(mem1 + offset),
-+ CMPXCHG_OLD(old + offset),
-+ CMPXCHG_SUCCESS(&succ), KEY(key));
++ MOP(test->vm, ABSOLUTE, CMPXCHG, new + offset,
++ size, GADDR_V(mem1 + offset),
++ CMPXCHG_OLD(old + offset),
++ CMPXCHG_SUCCESS(&succ), KEY(key));
+ HOST_SYNC(test->vcpu, STAGE_COPIED);
+ MOP(test->vm, ABSOLUTE, READ, mem2, 16, GADDR_V(mem2));
+ TEST_ASSERT(succ, "exchange of values should succeed");
@@ tools/testing/selftests/kvm/s390x/memop.c: static void default_read(struct test_
+ memcpy(&old, mem1, 16);
+ new[offset]++;
+ old[offset]++;
-+ CHECK_N_DO(MOP, test->vm, ABSOLUTE, WRITE, new + offset,
-+ size, GADDR_V(mem1 + offset),
-+ CMPXCHG_OLD(old + offset),
-+ CMPXCHG_SUCCESS(&succ), KEY(key));
++ MOP(test->vm, ABSOLUTE, CMPXCHG, new + offset,
++ size, GADDR_V(mem1 + offset),
++ CMPXCHG_OLD(old + offset),
++ CMPXCHG_SUCCESS(&succ), KEY(key));
+ HOST_SYNC(test->vcpu, STAGE_COPIED);
+ MOP(test->vm, ABSOLUTE, READ, mem2, 16, GADDR_V(mem2));
+ TEST_ASSERT(!succ, "exchange of values should not succeed");
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_copy_key(void)
+ do {
+ old = 0;
+ new = 1;
-+ MOP(t.vm, ABSOLUTE, WRITE, &new,
++ MOP(t.vm, ABSOLUTE, CMPXCHG, &new,
+ sizeof(new), GADDR_V(mem1),
+ CMPXCHG_OLD(&old),
+ CMPXCHG_SUCCESS(&success), KEY(1));
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_copy_key(void)
+ choose_block(false, i + j, &size, &offset);
+ do {
+ new = permutate_bits(false, i + j, size, old);
-+ MOP(t.vm, ABSOLUTE, WRITE, quad_to_char(&new, size),
++ MOP(t.vm, ABSOLUTE, CMPXCHG, quad_to_char(&new, size),
+ size, GADDR_V(mem2 + offset),
+ CMPXCHG_OLD(quad_to_char(&old, size)),
+ CMPXCHG_SUCCESS(&success), KEY(1));
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_errors_key(void)
+ for (i = 1; i <= 16; i *= 2) {
+ __uint128_t old = 0;
+
-+ CHECK_N_DO(ERR_PROT_MOP, t.vm, ABSOLUTE, WRITE, mem2, i, GADDR_V(mem2),
-+ CMPXCHG_OLD(&old), KEY(2));
++ ERR_PROT_MOP(t.vm, ABSOLUTE, CMPXCHG, mem2, i, GADDR_V(mem2),
++ CMPXCHG_OLD(&old), KEY(2));
+ }
+
+ kvm_vm_free(t.kvm_vm);
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_errors(void)
+ power *= 2;
+ continue;
+ }
-+ rv = ERR_MOP(t.vm, ABSOLUTE, WRITE, mem1, i, GADDR_V(mem1),
++ rv = ERR_MOP(t.vm, ABSOLUTE, CMPXCHG, mem1, i, GADDR_V(mem1),
+ CMPXCHG_OLD(&old));
+ TEST_ASSERT(rv == -1 && errno == EINVAL,
+ "ioctl allows bad size for cmpxchg");
+ }
+ for (i = 1; i <= 16; i *= 2) {
-+ rv = ERR_MOP(t.vm, ABSOLUTE, WRITE, mem1, i, GADDR((void *)~0xfffUL),
++ rv = ERR_MOP(t.vm, ABSOLUTE, CMPXCHG, mem1, i, GADDR((void *)~0xfffUL),
+ CMPXCHG_OLD(&old));
+ TEST_ASSERT(rv > 0, "ioctl allows bad guest address for cmpxchg");
-+ rv = ERR_MOP(t.vm, ABSOLUTE, READ, mem1, i, GADDR_V(mem1),
-+ CMPXCHG_OLD(&old));
-+ TEST_ASSERT(rv == -1 && errno == EINVAL,
-+ "ioctl allows read cmpxchg call");
+ }
+ for (i = 2; i <= 16; i *= 2) {
-+ rv = ERR_MOP(t.vm, ABSOLUTE, WRITE, mem1, i, GADDR_V(mem1 + 1),
++ rv = ERR_MOP(t.vm, ABSOLUTE, CMPXCHG, mem1, i, GADDR_V(mem1 + 1),
+ CMPXCHG_OLD(&old));
+ TEST_ASSERT(rv == -1 && errno == EINVAL,
+ "ioctl allows bad alignment for cmpxchg");
--
2.34.1
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an example
for the uffd async wp test and PAGEMAP_IOCTL. It shows the interface usages as
well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (3):
userfaultfd: Add UFFD WP Async support
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
selftests: vm: add pagemap ioctl tests
fs/proc/task_mmu.c | 290 +++++++
fs/userfaultfd.c | 11 +
include/linux/userfaultfd_k.h | 6 +
include/uapi/linux/fs.h | 50 ++
include/uapi/linux/userfaultfd.h | 8 +-
mm/memory.c | 23 +-
tools/include/uapi/linux/fs.h | 50 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++++
10 files changed, 1319 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2