User space can use the MEM_OP ioctl to make storage key checked reads
and writes to the guest, however, it has no way of performing atomic,
key checked, accesses to the guest.
Extend the MEM_OP ioctl in order to allow for this, by adding a cmpxchg
operation. For now, support this operation for absolute accesses only.
This operation can be use, for example, to set the device-state-change
indicator and the adapter-local-summary indicator atomically.
Also contains some fixes/changes for the memop selftest independent of
the cmpxchg changes.
v5 -> v6
* move memop selftest fixes/refactoring to front of series so they can
be picked independently from the rest
* use op instead of flag to indicate cmpxchg
* no longer indicate success of cmpxchg to user space, which can infer
it by observing a change in the old value instead
* refactor functions implementing the ioctl
* adjust documentation (drop R-b)
* adjust selftest
* rebase
v4 -> v5
* refuse cmpxchg if not write (thanks Thomas)
* minor doc changes (thanks Claudio)
* picked up R-b's (thanks Thomas & Claudio)
* memop selftest fixes
* rebased
v3 -> v4
* no functional change intended
* rework documentation a bit
* name extension cap cmpxchg bit
* picked up R-b (thanks Thomas)
* various changes (rename variable, comments, ...) see range-diff below
v2 -> v3
* rebase onto the wip/cmpxchg_user_key branch in the s390 kernel repo
* use __uint128_t instead of unsigned __int128
* put moving of testlist into main into separate patch
* pick up R-b's (thanks Nico)
v1 -> v2
* get rid of xrk instruction for cmpxchg byte and short implementation
* pass old parameter via pointer instead of in mem_op struct
* indicate failure of cmpxchg due to wrong old value by special return
code
* picked up R-b's (thanks Thomas)
Janis Schoetterl-Glausch (14):
KVM: s390: selftest: memop: Pass mop_desc via pointer
KVM: s390: selftest: memop: Replace macros by functions
KVM: s390: selftest: memop: Move testlist into main
KVM: s390: selftest: memop: Add bad address test
KVM: s390: selftest: memop: Fix typo
KVM: s390: selftest: memop: Fix wrong address being used in test
KVM: s390: selftest: memop: Fix integer literal
KVM: s390: Move common code of mem_op functions into functions
KVM: s390: Dispatch to implementing function at top level of vm mem_op
KVM: s390: Refactor absolute vm mem_op function
KVM: s390: Refactor absolute vcpu mem_op function
KVM: s390: Extend MEM_OP ioctl by storage key checked cmpxchg
Documentation: KVM: s390: Describe KVM_S390_MEMOP_F_CMPXCHG
KVM: s390: selftest: memop: Add cmpxchg tests
Documentation/virt/kvm/api.rst | 29 +-
include/uapi/linux/kvm.h | 8 +
arch/s390/kvm/gaccess.h | 3 +
arch/s390/kvm/gaccess.c | 103 ++++
arch/s390/kvm/kvm-s390.c | 249 ++++----
tools/testing/selftests/kvm/s390x/memop.c | 675 +++++++++++++++++-----
6 files changed, 819 insertions(+), 248 deletions(-)
Range-diff against v5:
3: 94c1165ae24a = 1: 512e1a3e0ae5 KVM: s390: selftest: memop: Pass mop_desc via pointer
4: 027c87eee0ac = 2: 47328ea64f80 KVM: s390: selftest: memop: Replace macros by functions
5: 16ac410ecc0f = 3: 224fe37eeec7 KVM: s390: selftest: memop: Move testlist into main
7: 2d6776733e64 = 4: f622d3413cf0 KVM: s390: selftest: memop: Add bad address test
8: 8c49eafd2881 = 5: 431f191a8a57 KVM: s390: selftest: memop: Fix typo
9: 0af907110b34 = 6: 3122187435fb KVM: s390: selftest: memop: Fix wrong address being used in test
10: 886c80b2bdce = 7: 401f51f3ef55 KVM: s390: selftest: memop: Fix integer literal
-: ------------ > 8: df09794e0794 KVM: s390: Move common code of mem_op functions into functions
-: ------------ > 9: 5cbae63357ed KVM: s390: Dispatch to implementing function at top level of vm mem_op
-: ------------ > 10: 76ba77b63a26 KVM: s390: Refactor absolute vm mem_op function
-: ------------ > 11: c848e772e22a KVM: s390: Refactor absolute vcpu mem_op function
1: 6adc166ee141 ! 12: 6ccb200ad85c KVM: s390: Extend MEM_OP ioctl by storage key checked cmpxchg
@@ Commit message
and writes to the guest, however, it has no way of performing atomic,
key checked, accesses to the guest.
Extend the MEM_OP ioctl in order to allow for this, by adding a cmpxchg
- mode. For now, support this mode for absolute accesses only.
+ op. For now, support this op for absolute accesses only.
- This mode can be use, for example, to set the device-state-change
+ This op can be use, for example, to set the device-state-change
indicator and the adapter-local-summary indicator atomically.
Signed-off-by: Janis Schoetterl-Glausch <scgl(a)linux.ibm.com>
@@ include/uapi/linux/kvm.h: struct kvm_s390_mem_op {
__u8 ar; /* the access register number */
__u8 key; /* access key, ignored if flag unset */
+ __u8 pad1[6]; /* ignored */
-+ __u64 old_addr; /* ignored if flag unset */
++ __u64 old_addr; /* ignored if cmpxchg flag unset */
};
__u32 sida_offset; /* offset into the sida */
__u8 reserved[32]; /* ignored */
@@ include/uapi/linux/kvm.h: struct kvm_s390_mem_op {
+ #define KVM_S390_MEMOP_SIDA_WRITE 3
+ #define KVM_S390_MEMOP_ABSOLUTE_READ 4
+ #define KVM_S390_MEMOP_ABSOLUTE_WRITE 5
++#define KVM_S390_MEMOP_ABSOLUTE_CMPXCHG 6
++
+ /* flags for kvm_s390_mem_op->flags */
#define KVM_S390_MEMOP_F_CHECK_ONLY (1ULL << 0)
#define KVM_S390_MEMOP_F_INJECT_EXCEPTION (1ULL << 1)
#define KVM_S390_MEMOP_F_SKEY_PROTECTION (1ULL << 2)
-+#define KVM_S390_MEMOP_F_CMPXCHG (1ULL << 3)
-+/* flags specifying extension support */
-+#define KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG 0x2
-+/* Non program exception return codes (pgm codes are 16 bit) */
-+#define KVM_S390_MEMOP_R_NO_XCHG (1 << 16)
++/* flags specifying extension support via KVM_CAP_S390_MEM_OP_EXTENSION */
++#define KVM_S390_MEMOP_EXTENSION_CAP_BASE (1 << 0)
++#define KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG (1 << 1)
++
/* for KVM_INTERRUPT */
struct kvm_interrupt {
+ /* in */
## arch/s390/kvm/gaccess.h ##
@@ arch/s390/kvm/gaccess.h: int access_guest_with_key(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
int access_guest_real(struct kvm_vcpu *vcpu, unsigned long gra,
void *data, unsigned long len, enum gacc_mode mode);
-+int cmpxchg_guest_abs_with_key(struct kvm *kvm, gpa_t gpa, int len,
-+ __uint128_t *old, __uint128_t new, u8 access_key);
++int cmpxchg_guest_abs_with_key(struct kvm *kvm, gpa_t gpa, int len, __uint128_t *old,
++ __uint128_t new, u8 access_key, bool *success);
+
/**
* write_guest_with_key - copy data from kernel space to guest space
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ * @gpa: Absolute guest address of the location to be changed.
+ * @len: Operand length of the cmpxchg, required: 1 <= len <= 16. Providing a
+ * non power of two will result in failure.
-+ * @old_addr: Pointer to old value. If the location at @gpa contains this value, the
-+ * exchange will succeed. After calling cmpxchg_guest_abs_with_key() *@old
-+ * contains the value at @gpa before the attempt to exchange the value.
++ * @old_addr: Pointer to old value. If the location at @gpa contains this value,
++ * the exchange will succeed. After calling cmpxchg_guest_abs_with_key()
++ * *@old_addr contains the value at @gpa before the attempt to
++ * exchange the value.
+ * @new: The value to place at @gpa.
+ * @access_key: The access key to use for the guest access.
++ * @success: output value indicating if an exchange occurred.
+ *
+ * Atomically exchange the value at @gpa by @new, if it contains *@old.
+ * Honors storage keys.
+ *
+ * Return: * 0: successful exchange
-+ * * 1: exchange unsuccessful
+ * * a program interruption code indicating the reason cmpxchg could
+ * not be attempted
+ * * -EINVAL: address misaligned or len not power of two
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ */
+int cmpxchg_guest_abs_with_key(struct kvm *kvm, gpa_t gpa, int len,
+ __uint128_t *old_addr, __uint128_t new,
-+ u8 access_key)
++ u8 access_key, bool *success)
+{
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u8 old;
+
+ ret = cmpxchg_user_key((u8 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u16 old;
+
+ ret = cmpxchg_user_key((u16 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u32 old;
+
+ ret = cmpxchg_user_key((u32 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ u64 old;
+
+ ret = cmpxchg_user_key((u64 *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/gaccess.c: int access_guest_real(struct kvm_vcpu *vcpu, unsigned l
+ __uint128_t old;
+
+ ret = cmpxchg_user_key((__uint128_t *)hva, &old, *old_addr, new, access_key);
-+ ret = ret < 0 ? ret : old != *old_addr;
++ *success = !ret && old == *old_addr;
+ *old_addr = old;
+ break;
+ }
@@ arch/s390/kvm/kvm-s390.c: int kvm_vm_ioctl_check_extension(struct kvm *kvm, long
+ case KVM_CAP_S390_MEM_OP_EXTENSION:
+ /*
+ * Flag bits indicating which extensions are supported.
-+ * The first extension doesn't use a flag, but pretend it does,
-+ * this way that can be changed in the future.
++ * If r > 0, the base extension must also be supported/indicated,
++ * in order to maintain backwards compatibility.
+ */
-+ r = KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG | 1;
++ r = KVM_S390_MEMOP_EXTENSION_CAP_BASE |
++ KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG;
+ break;
case KVM_CAP_NR_VCPUS:
case KVM_CAP_MAX_VCPUS:
case KVM_CAP_MAX_VCPU_ID:
-@@ arch/s390/kvm/kvm-s390.c: static bool access_key_invalid(u8 access_key)
- static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
- {
- void __user *uaddr = (void __user *)mop->buf;
+@@ arch/s390/kvm/kvm-s390.c: static int kvm_s390_vm_mem_op_abs(struct kvm *kvm, struct kvm_s390_mem_op *mop)
+ return r;
+ }
+
++static int kvm_s390_vm_mem_op_cmpxchg(struct kvm *kvm, struct kvm_s390_mem_op *mop)
++{
++ void __user *uaddr = (void __user *)mop->buf;
+ void __user *old_addr = (void __user *)mop->old_addr;
+ union {
+ __uint128_t quad;
+ char raw[sizeof(__uint128_t)];
+ } old = { .quad = 0}, new = { .quad = 0 };
+ unsigned int off_in_quad = sizeof(new) - mop->size;
- u64 supported_flags;
- void *tmpbuf = NULL;
- int r, srcu_idx;
-
- supported_flags = KVM_S390_MEMOP_F_SKEY_PROTECTION
-- | KVM_S390_MEMOP_F_CHECK_ONLY;
-+ | KVM_S390_MEMOP_F_CHECK_ONLY
-+ | KVM_S390_MEMOP_F_CMPXCHG;
- if (mop->flags & ~supported_flags || !mop->size)
- return -EINVAL;
- if (mop->size > MEM_OP_MAX_SIZE)
-@@ arch/s390/kvm/kvm-s390.c: static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
- } else {
- mop->key = 0;
- }
-+ if (mop->flags & KVM_S390_MEMOP_F_CMPXCHG) {
-+ /*
-+ * This validates off_in_quad. Checking that size is a power
-+ * of two is not necessary, as cmpxchg_guest_abs_with_key
-+ * takes care of that
-+ */
-+ if (mop->size > sizeof(new))
-+ return -EINVAL;
-+ if (mop->op != KVM_S390_MEMOP_ABSOLUTE_WRITE)
-+ return -EINVAL;
-+ if (copy_from_user(&new.raw[off_in_quad], uaddr, mop->size))
-+ return -EFAULT;
-+ if (copy_from_user(&old.raw[off_in_quad], old_addr, mop->size))
-+ return -EFAULT;
++ int r, srcu_idx;
++ bool success;
++
++ r = mem_op_validate_common(mop, KVM_S390_MEMOP_F_SKEY_PROTECTION);
++ if (r)
++ return r;
++ /*
++ * This validates off_in_quad. Checking that size is a power
++ * of two is not necessary, as cmpxchg_guest_abs_with_key
++ * takes care of that
++ */
++ if (mop->size > sizeof(new))
++ return -EINVAL;
++ if (copy_from_user(&new.raw[off_in_quad], uaddr, mop->size))
++ return -EFAULT;
++ if (copy_from_user(&old.raw[off_in_quad], old_addr, mop->size))
++ return -EFAULT;
++
++ srcu_idx = srcu_read_lock(&kvm->srcu);
++
++ if (kvm_is_error_gpa(kvm, mop->gaddr)) {
++ r = PGM_ADDRESSING;
++ goto out_unlock;
+ }
- if (!(mop->flags & KVM_S390_MEMOP_F_CHECK_ONLY)) {
- tmpbuf = vmalloc(mop->size);
- if (!tmpbuf)
++
++ r = cmpxchg_guest_abs_with_key(kvm, mop->gaddr, mop->size, &old.quad,
++ new.quad, mop->key, &success);
++ if (!success && copy_to_user(old_addr, &old.raw[off_in_quad], mop->size))
++ r = -EFAULT;
++
++out_unlock:
++ srcu_read_unlock(&kvm->srcu, srcu_idx);
++ return r;
++}
++
+ static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
+ {
+ /*
@@ arch/s390/kvm/kvm-s390.c: static int kvm_s390_vm_mem_op(struct kvm *kvm, struct kvm_s390_mem_op *mop)
- case KVM_S390_MEMOP_ABSOLUTE_WRITE: {
- if (mop->flags & KVM_S390_MEMOP_F_CHECK_ONLY) {
- r = check_gpa_range(kvm, mop->gaddr, mop->size, GACC_STORE, mop->key);
-+ } else if (mop->flags & KVM_S390_MEMOP_F_CMPXCHG) {
-+ r = cmpxchg_guest_abs_with_key(kvm, mop->gaddr, mop->size,
-+ &old.quad, new.quad, mop->key);
-+ if (r == 1) {
-+ r = KVM_S390_MEMOP_R_NO_XCHG;
-+ if (copy_to_user(old_addr, &old.raw[off_in_quad], mop->size))
-+ r = -EFAULT;
-+ }
- } else {
- if (copy_from_user(tmpbuf, uaddr, mop->size)) {
- r = -EFAULT;
+ case KVM_S390_MEMOP_ABSOLUTE_READ:
+ case KVM_S390_MEMOP_ABSOLUTE_WRITE:
+ return kvm_s390_vm_mem_op_abs(kvm, mop);
++ case KVM_S390_MEMOP_ABSOLUTE_CMPXCHG:
++ return kvm_s390_vm_mem_op_cmpxchg(kvm, mop);
+ default:
+ return -EINVAL;
+ }
2: fce9a063ab70 ! 13: 4d983d179903 Documentation: KVM: s390: Describe KVM_S390_MEMOP_F_CMPXCHG
@@ Commit message
checked) cmpxchg operations on guest memory.
Signed-off-by: Janis Schoetterl-Glausch <scgl(a)linux.ibm.com>
- Reviewed-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
## Documentation/virt/kvm/api.rst ##
@@ Documentation/virt/kvm/api.rst: The fields in each entry are defined as follows:
@@ Documentation/virt/kvm/api.rst: Parameters are specified via the following struc
};
__u32 sida_offset; /* offset into the sida */
__u8 reserved[32]; /* ignored */
-@@ Documentation/virt/kvm/api.rst: Absolute accesses are permitted for non-protected guests only.
- Supported flags:
+@@ Documentation/virt/kvm/api.rst: Possible operations are:
+ * ``KVM_S390_MEMOP_ABSOLUTE_WRITE``
+ * ``KVM_S390_MEMOP_SIDA_READ``
+ * ``KVM_S390_MEMOP_SIDA_WRITE``
++ * ``KVM_S390_MEMOP_ABSOLUTE_CMPXCHG``
+
+ Logical read/write:
+ ^^^^^^^^^^^^^^^^^^^
+@@ Documentation/virt/kvm/api.rst: the checks required for storage key protection as one operation (as opposed to
+ user space getting the storage keys, performing the checks, and accessing
+ memory thereafter, which could lead to a delay between check and access).
+ Absolute accesses are permitted for the VM ioctl if KVM_CAP_S390_MEM_OP_EXTENSION
+-is > 0.
++has the KVM_S390_MEMOP_EXTENSION_CAP_BASE bit set.
+ Currently absolute accesses are not permitted for VCPU ioctls.
+ Absolute accesses are permitted for non-protected guests only.
+
+@@ Documentation/virt/kvm/api.rst: Supported flags:
* ``KVM_S390_MEMOP_F_CHECK_ONLY``
* ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
-+ * ``KVM_S390_MEMOP_F_CMPXCHG``
-+
+
+-The semantics of the flags are as for logical accesses.
+The semantics of the flags common with logical accesses are as for logical
+accesses.
+
-+For write accesses, the KVM_S390_MEMOP_F_CMPXCHG flag is supported if
-+KVM_CAP_S390_MEM_OP_EXTENSION has flag KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG set.
-+In this case, instead of doing an unconditional write, the access occurs
-+only if the target location contains the value pointed to by "old_addr".
++Absolute cmpxchg:
++^^^^^^^^^^^^^^^^^
++
++Perform cmpxchg on absolute guest memory. Intended for use with the
++KVM_S390_MEMOP_F_SKEY_PROTECTION flag.
++Instead of doing an unconditional write, the access occurs only if the target
++location contains the value pointed to by "old_addr".
+This is performed as an atomic cmpxchg with the length specified by the "size"
+parameter. "size" must be a power of two up to and including 16.
+If the exchange did not take place because the target value doesn't match the
-+old value, KVM_S390_MEMOP_R_NO_XCHG is returned.
-+In this case the value "old_addr" points to is replaced by the target value.
-
--The semantics of the flags are as for logical accesses.
++old value, the value "old_addr" points to is replaced by the target value.
++User space can tell if an exchange took place by checking if this replacement
++occurred. The cmpxchg op is permitted for the VM ioctl if
++KVM_CAP_S390_MEM_OP_EXTENSION has flag KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG set.
++
++Supported flags:
++ * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
SIDA read/write:
^^^^^^^^^^^^^^^^
6: 214281b6eb96 ! 14: 5250be3dd58b KVM: s390: selftest: memop: Add cmpxchg tests
@@ tools/testing/selftests/kvm/s390x/memop.c
#include <linux/bits.h>
+@@ tools/testing/selftests/kvm/s390x/memop.c: enum mop_target {
+ enum mop_access_mode {
+ READ,
+ WRITE,
++ CMPXCHG,
+ };
+
+ struct mop_desc {
@@ tools/testing/selftests/kvm/s390x/memop.c: struct mop_desc {
enum mop_access_mode mode;
void *buf;
uint32_t sida_offset;
+ void *old;
++ uint8_t old_value[16];
+ bool *cmpxchg_success;
uint8_t ar;
uint8_t key;
};
+
+ const uint8_t NO_KEY = 0xff;
+
+-static struct kvm_s390_mem_op ksmo_from_desc(const struct mop_desc *desc)
++static struct kvm_s390_mem_op ksmo_from_desc(struct mop_desc *desc)
+ {
+ struct kvm_s390_mem_op ksmo = {
+ .gaddr = (uintptr_t)desc->gaddr,
@@ tools/testing/selftests/kvm/s390x/memop.c: static struct kvm_s390_mem_op ksmo_from_desc(const struct mop_desc *desc)
- ksmo.flags |= KVM_S390_MEMOP_F_SKEY_PROTECTION;
- ksmo.key = desc->key;
- }
-+ if (desc->old) {
-+ ksmo.flags |= KVM_S390_MEMOP_F_CMPXCHG;
-+ ksmo.old_addr = (uint64_t)desc->old;
-+ }
- if (desc->_ar)
- ksmo.ar = desc->ar;
- else
+ ksmo.op = KVM_S390_MEMOP_ABSOLUTE_READ;
+ if (desc->mode == WRITE)
+ ksmo.op = KVM_S390_MEMOP_ABSOLUTE_WRITE;
++ if (desc->mode == CMPXCHG) {
++ ksmo.op = KVM_S390_MEMOP_ABSOLUTE_CMPXCHG;
++ ksmo.old_addr = (uint64_t)desc->old;
++ memcpy(desc->old_value, desc->old, desc->size);
++ }
+ break;
+ case INVALID:
+ ksmo.op = -1;
@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vcpu *vcpu, const struct kvm_s390_mem_op *ksm
+ case KVM_S390_MEMOP_ABSOLUTE_WRITE:
printf("ABSOLUTE, WRITE, ");
break;
++ case KVM_S390_MEMOP_ABSOLUTE_CMPXCHG:
++ printf("ABSOLUTE, CMPXCHG, ");
++ break;
}
- printf("gaddr=%llu, size=%u, buf=%llu, ar=%u, key=%u",
- ksmo->gaddr, ksmo->size, ksmo->buf, ksmo->ar, ksmo->key);
@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vc
if (ksmo->flags & KVM_S390_MEMOP_F_CHECK_ONLY)
printf(", CHECK_ONLY");
if (ksmo->flags & KVM_S390_MEMOP_F_INJECT_EXCEPTION)
- printf(", INJECT_EXCEPTION");
- if (ksmo->flags & KVM_S390_MEMOP_F_SKEY_PROTECTION)
- printf(", SKEY_PROTECTION");
-+ if (ksmo->flags & KVM_S390_MEMOP_F_CMPXCHG)
-+ printf(", CMPXCHG");
+@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vcpu *vcpu, const struct kvm_s390_mem_op *ksm
puts(")");
}
@@ tools/testing/selftests/kvm/s390x/memop.c: static void print_memop(struct kvm_vc
+ int r;
+
+ r = err_memop_ioctl(info, ksmo, desc);
-+ if (ksmo->flags & KVM_S390_MEMOP_F_CMPXCHG) {
-+ if (desc->cmpxchg_success)
-+ *desc->cmpxchg_success = !r;
-+ if (r == KVM_S390_MEMOP_R_NO_XCHG)
-+ r = 0;
++ if (ksmo->op == KVM_S390_MEMOP_ABSOLUTE_CMPXCHG) {
++ if (desc->cmpxchg_success) {
++ int diff = memcmp(desc->old_value, desc->old, desc->size);
++ *desc->cmpxchg_success = !diff;
++ }
+ }
+ TEST_ASSERT(!r, __KVM_IOCTL_ERROR("KVM_S390_MEM_OP", r));
@@ tools/testing/selftests/kvm/s390x/memop.c: static void default_read(struct test_
+ default_write_read(test->vcpu, test->vcpu, LOGICAL, 16, NO_KEY);
+
+ memcpy(&old, mem1, 16);
-+ CHECK_N_DO(MOP, test->vm, ABSOLUTE, WRITE, new + offset,
-+ size, GADDR_V(mem1 + offset),
-+ CMPXCHG_OLD(old + offset),
-+ CMPXCHG_SUCCESS(&succ), KEY(key));
++ MOP(test->vm, ABSOLUTE, CMPXCHG, new + offset,
++ size, GADDR_V(mem1 + offset),
++ CMPXCHG_OLD(old + offset),
++ CMPXCHG_SUCCESS(&succ), KEY(key));
+ HOST_SYNC(test->vcpu, STAGE_COPIED);
+ MOP(test->vm, ABSOLUTE, READ, mem2, 16, GADDR_V(mem2));
+ TEST_ASSERT(succ, "exchange of values should succeed");
@@ tools/testing/selftests/kvm/s390x/memop.c: static void default_read(struct test_
+ memcpy(&old, mem1, 16);
+ new[offset]++;
+ old[offset]++;
-+ CHECK_N_DO(MOP, test->vm, ABSOLUTE, WRITE, new + offset,
-+ size, GADDR_V(mem1 + offset),
-+ CMPXCHG_OLD(old + offset),
-+ CMPXCHG_SUCCESS(&succ), KEY(key));
++ MOP(test->vm, ABSOLUTE, CMPXCHG, new + offset,
++ size, GADDR_V(mem1 + offset),
++ CMPXCHG_OLD(old + offset),
++ CMPXCHG_SUCCESS(&succ), KEY(key));
+ HOST_SYNC(test->vcpu, STAGE_COPIED);
+ MOP(test->vm, ABSOLUTE, READ, mem2, 16, GADDR_V(mem2));
+ TEST_ASSERT(!succ, "exchange of values should not succeed");
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_copy_key(void)
+ do {
+ old = 0;
+ new = 1;
-+ MOP(t.vm, ABSOLUTE, WRITE, &new,
++ MOP(t.vm, ABSOLUTE, CMPXCHG, &new,
+ sizeof(new), GADDR_V(mem1),
+ CMPXCHG_OLD(&old),
+ CMPXCHG_SUCCESS(&success), KEY(1));
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_copy_key(void)
+ choose_block(false, i + j, &size, &offset);
+ do {
+ new = permutate_bits(false, i + j, size, old);
-+ MOP(t.vm, ABSOLUTE, WRITE, quad_to_char(&new, size),
++ MOP(t.vm, ABSOLUTE, CMPXCHG, quad_to_char(&new, size),
+ size, GADDR_V(mem2 + offset),
+ CMPXCHG_OLD(quad_to_char(&old, size)),
+ CMPXCHG_SUCCESS(&success), KEY(1));
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_errors_key(void)
+ for (i = 1; i <= 16; i *= 2) {
+ __uint128_t old = 0;
+
-+ CHECK_N_DO(ERR_PROT_MOP, t.vm, ABSOLUTE, WRITE, mem2, i, GADDR_V(mem2),
-+ CMPXCHG_OLD(&old), KEY(2));
++ ERR_PROT_MOP(t.vm, ABSOLUTE, CMPXCHG, mem2, i, GADDR_V(mem2),
++ CMPXCHG_OLD(&old), KEY(2));
+ }
+
+ kvm_vm_free(t.kvm_vm);
@@ tools/testing/selftests/kvm/s390x/memop.c: static void test_errors(void)
+ power *= 2;
+ continue;
+ }
-+ rv = ERR_MOP(t.vm, ABSOLUTE, WRITE, mem1, i, GADDR_V(mem1),
++ rv = ERR_MOP(t.vm, ABSOLUTE, CMPXCHG, mem1, i, GADDR_V(mem1),
+ CMPXCHG_OLD(&old));
+ TEST_ASSERT(rv == -1 && errno == EINVAL,
+ "ioctl allows bad size for cmpxchg");
+ }
+ for (i = 1; i <= 16; i *= 2) {
-+ rv = ERR_MOP(t.vm, ABSOLUTE, WRITE, mem1, i, GADDR((void *)~0xfffUL),
++ rv = ERR_MOP(t.vm, ABSOLUTE, CMPXCHG, mem1, i, GADDR((void *)~0xfffUL),
+ CMPXCHG_OLD(&old));
+ TEST_ASSERT(rv > 0, "ioctl allows bad guest address for cmpxchg");
-+ rv = ERR_MOP(t.vm, ABSOLUTE, READ, mem1, i, GADDR_V(mem1),
-+ CMPXCHG_OLD(&old));
-+ TEST_ASSERT(rv == -1 && errno == EINVAL,
-+ "ioctl allows read cmpxchg call");
+ }
+ for (i = 2; i <= 16; i *= 2) {
-+ rv = ERR_MOP(t.vm, ABSOLUTE, WRITE, mem1, i, GADDR_V(mem1 + 1),
++ rv = ERR_MOP(t.vm, ABSOLUTE, CMPXCHG, mem1, i, GADDR_V(mem1 + 1),
+ CMPXCHG_OLD(&old));
+ TEST_ASSERT(rv == -1 && errno == EINVAL,
+ "ioctl allows bad alignment for cmpxchg");
--
2.34.1
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an example
for the uffd async wp test and PAGEMAP_IOCTL. It shows the interface usages as
well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (3):
userfaultfd: Add UFFD WP Async support
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
selftests: vm: add pagemap ioctl tests
fs/proc/task_mmu.c | 290 +++++++
fs/userfaultfd.c | 11 +
include/linux/userfaultfd_k.h | 6 +
include/uapi/linux/fs.h | 50 ++
include/uapi/linux/userfaultfd.h | 8 +-
mm/memory.c | 23 +-
tools/include/uapi/linux/fs.h | 50 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++++
10 files changed, 1319 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2
The root cause is kvm_lapic_set_base() failing to handle x2APIC -> xapic ID
switch, which is addressed by patch 1.
Patch 2 provides a selftest to verify this behavior.
This serie is an RFC because I think that commit ef40757743b47 already tries to
fix one such effect of the error made in kvm_lapic_set_base, but I am not sure
how such error described in the commit message is triggered, nor how to
reproduce it using a selftest. I don't think one can enable/disable x2APIC using
KVM_SET_LAPIC, and kvm_lapic_set_base() in kvm_apic_set_state() just takes care
of updating apic->base_address, since value == old_value.
The test in patch 2 fails with the fix in ef40757743b47.
Thank you,
Emanuele
Emanuele Giuseppe Esposito (2):
KVM: x86: update APIC_ID also when disabling x2APIC in
kvm_lapic_set_base
KVM: selftests: APIC_ID must be correctly updated when disabling
x2apic
arch/x86/kvm/lapic.c | 8 ++-
.../selftests/kvm/x86_64/xapic_state_test.c | 64 +++++++++++++++++++
2 files changed, 70 insertions(+), 2 deletions(-)
--
2.31.1
During early development a dependedncy was added on having FA64
available so we could use the full FPSIMD register set in the signal
handler. Subsequently the ABI was finialised so the handler is run with
streaming mode disabled meaning this is redundant but the dependency was
never removed, do so now.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/signal/testcases/ssve_regs.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/tools/testing/selftests/arm64/signal/testcases/ssve_regs.c b/tools/testing/selftests/arm64/signal/testcases/ssve_regs.c
index d0a178945b1a..f0985da7936e 100644
--- a/tools/testing/selftests/arm64/signal/testcases/ssve_regs.c
+++ b/tools/testing/selftests/arm64/signal/testcases/ssve_regs.c
@@ -116,12 +116,7 @@ static int sme_regs(struct tdescr *td, siginfo_t *si, ucontext_t *uc)
struct tdescr tde = {
.name = "Streaming SVE registers",
.descr = "Check that we get the right Streaming SVE registers reported",
- /*
- * We shouldn't require FA64 but things like memset() used in the
- * helpers might use unsupported instructions so for now disable
- * the test unless we've got the full instruction set.
- */
- .feats_required = FEAT_SME | FEAT_SME_FA64,
+ .feats_required = FEAT_SME,
.timeout = 3,
.init = sme_get_vls,
.run = sme_regs,
---
base-commit: b7bfaa761d760e72a969d116517eaa12e404c262
change-id: 20230131-arm64-kselfetest-ssve-fa64-cec2031da43f
Best regards,
--
Mark Brown <broonie(a)kernel.org>
These two patches fix a repeated error with the way we enumerate SME
VLs, the code for which is cut'n'pasted into each test. It's in two
patches because the first applies to Linus' tree and the second covers a
new test added in -next, even if they're both applied for -next now this
should help with backporting.
It would be good to factor this code out but that's a separate issue,
I'll tackle that for the next release (along with the general fun with
the build system in these tests).
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest/arm64: Fix enumeration of systems without 128 bit SME
kselftest/arm64: Fix enumeration of systems without 128 bit SME for SSVE+ZA
tools/testing/selftests/arm64/signal/testcases/ssve_regs.c | 4 ++++
tools/testing/selftests/arm64/signal/testcases/ssve_za_regs.c | 4 ++++
tools/testing/selftests/arm64/signal/testcases/za_regs.c | 4 ++++
3 files changed, 12 insertions(+)
---
base-commit: 8154ffb7a51882c00730952ed21d80ed76f165d7
change-id: 20230131-arm64-kselftest-sig-sme-no-128-8dd219305a32
Best regards,
--
Mark Brown <broonie(a)kernel.org>
It was found that the check to see if a partition could use up all
the cpus from the parent cpuset in update_parent_subparts_cpumask()
was incorrect. As a result, it is possible to leave parent with no
effective cpu left even if there are tasks in the parent cpuset. This
can lead to system panic as reported in [1].
Fix this probem by updating the check to fail the enabling the partition
if parent's effective_cpus is a subset of the child's cpus_allowed.
Also record the error code when an error happens in update_prstate()
and add a test case where parent partition and child have the same cpu
list and parent has task. Enabling partition in the child will fail in
this case.
[1] https://www.spinics.net/lists/cgroups/msg36254.html
Fixes: f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to partition & cpus changes")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada(a)intel.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
---
kernel/cgroup/cpuset.c | 3 ++-
tools/testing/selftests/cgroup/test_cpuset_prs.sh | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a29c0b13706b..205dc9edcaa9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1346,7 +1346,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
* A parent can be left with no CPU as long as there is no
* task directly associated with the parent partition.
*/
- if (!cpumask_intersects(cs->cpus_allowed, parent->effective_cpus) &&
+ if (cpumask_subset(parent->effective_cpus, cs->cpus_allowed) &&
partition_is_populated(parent, cs))
return PERR_NOCPUS;
@@ -2324,6 +2324,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
new_prs = -new_prs;
spin_lock_irq(&callback_lock);
cs->partition_root_state = new_prs;
+ WRITE_ONCE(cs->prs_err, err);
spin_unlock_irq(&callback_lock);
/*
* Update child cpusets, if present.
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 186e1c26867e..75c100de90ff 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -268,6 +268,7 @@ TEST_MATRIX=(
# Taking away all CPUs from parent or itself if there are tasks
# will make the partition invalid.
" S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+ " S+ C3:P1:S+ C3 . . T P1 . . 0 A1:3,A2:3 A1:P1,A2:P-1"
" S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
" S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
--
2.31.1