The upcoming new Idle HLT Intercept feature allows for the HLT
instruction execution by a vCPU to be intercepted by the hypervisor
only if there are no pending V_INTR and V_NMI events for the vCPU.
When the vCPU is expected to service the pending V_INTR and V_NMI
events, the Idle HLT intercept won’t trigger. The feature allows the
hypervisor to determine if the vCPU is actually idle and reduces
wasteful VMEXITs.
The idle HLT intercept feature is used for enlightened guests who wish
to securely handle the events. When an enlightened guest does a HLT
while an interrupt is pending, hypervisor will not have a way to
figure out whether the guest needs to be re-entered or not. The Idle
HLT intercept feature allows the HLT execution only if there are no
pending V_INTR and V_NMI events.
Presence of the Idle HLT Intercept feature is indicated via CPUID
function Fn8000_000A_EDX[30].
Document for the Idle HLT intercept feature is available at [1].
This series is based on kvm-next/next (64dbb3a771a1) + [2].
Experiments done:
----------------
kvm_amd.avic is set to '0' for this experiment.
The below numbers represent the average of 10 runs.
Normal guest (L1)
The below netperf command was run on the guest with smp = 1 (pinned).
netperf -H <host ip> -t TCP_RR -l 60
----------------------------------------------------------------
|with Idle HLT(transactions/Sec)|w/o Idle HLT(transactions/Sec)|
----------------------------------------------------------------
| 25645.7136 | 25773.2796 |
----------------------------------------------------------------
Number of transactions/sec with and without idle HLT intercept feature
are almost same.
Nested guest (L2)
The below netperf command was run on L2 guest with smp = 1 (pinned).
netperf -H <host ip> -t TCP_RR -l 60
----------------------------------------------------------------
|with Idle HLT(transactions/Sec)|w/o Idle HLT(transactions/Sec)|
----------------------------------------------------------------
| 5655.4468 | 5755.2189 |
----------------------------------------------------------------
Number of transactions/sec with and without idle HLT intercept feature
are almost same.
Testing Done:
- Tested the functionality for the Idle HLT intercept feature
using selftest svm_idle_hlt_test.
- Tested SEV and SEV-ES guest for the Idle HLT intercept functionality.
- Tested the Idle HLT intercept functionality on nested guest.
v3 -> v4
- Drop the patches to add vcpu_get_stat() into a new series [2].
- Added nested Idle HLT intercept support.
v2 -> v3
- Incorporated Andrew's suggestion to structure vcpu_stat_types in
a way that each architecture can share the generic types and also
provide its own.
v1 -> v2
- Done changes in svm_idle_hlt_test based on the review comments from Sean.
- Added an enum based approach to get binary stats in vcpu_get_stat() which
doesn't use string to get stat data based on the comments from Sean.
- Added self_halt() and cli() helpers based on the comments from Sean.
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT).
https://bugzilla.kernel.org/attachment.cgi?id=306250
[2]: https://lore.kernel.org/kvm/20241021062226.108657-1-manali.shukla@amd.com/T…
Manali Shukla (4):
x86/cpufeatures: Add CPUID feature bit for Idle HLT intercept
KVM: SVM: Add Idle HLT intercept support
KVM: nSVM: implement the nested idle halt intercept
KVM: selftests: KVM: SVM: Add Idle HLT intercept test
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/svm.h | 1 +
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/governed_features.h | 1 +
arch/x86/kvm/svm/nested.c | 7 ++
arch/x86/kvm/svm/svm.c | 15 +++-
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/processor.h | 1 +
.../selftests/kvm/x86_64/svm_idle_hlt_test.c | 89 +++++++++++++++++++
9 files changed, 115 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c
base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da
prerequisite-patch-id: ca912571db5c004f77b70843b8dd35517ff1267f
prerequisite-patch-id: 164ea3b4346f9e04bc69819278d20f5e1b5df5ed
prerequisite-patch-id: 90d870f426ebc2cec43c0dd89b701ee998385455
prerequisite-patch-id: 45812b799c517a4521782a1fdbcda881237e1eda
--
2.34.1
This series was prompted by feedback given in [1].
Patch 1 : Adds safe_hlt() and cli() helpers.
Patch 2, 3: Adds an interface to read vcpu stat in selftest. Adds
a macro to generate compiler error to detect typos at
compile time while parsing vcpu and vm stats.
Patch 4 : Fix few of the selftests based on newly defined macro.
This series was split from the Idle HLT intercept support series [2]
because the series has a few changes in the vm_get_stat() interface
as suggested in [1] and a few changes in two of the self-tests
(nx_huge_pages_test.c and dirty_log_page_splitting_test.c) which use
vm_get_stat() functionality to retrieve specified VM stats. These
changes are unrelated to the Idle HLT intercept support series [2].
[1] https://lore.kernel.org/kvm/ZruDweYzQRRcJeTO@google.com/T/#m7cd7a110f0fcff9…
[2] https://lore.kernel.org/kvm/ZruDweYzQRRcJeTO@google.com/T/#m6c67ca8ccb226e5…
Manali Shukla (4):
KVM: selftests: Add safe_halt() and cli() helpers to common code
KVM: selftests: Add an interface to read the data of named vcpu stat
KVM: selftests: convert vm_get_stat to macro
KVM: selftests: Replace previously used vm_get_stat() to macro
.../testing/selftests/kvm/include/kvm_util.h | 83 +++++++++++++++++--
.../kvm/include/x86_64/kvm_util_arch.h | 52 ++++++++++++
.../selftests/kvm/include/x86_64/processor.h | 17 ++++
tools/testing/selftests/kvm/lib/kvm_util.c | 40 +++++++++
.../x86_64/dirty_log_page_splitting_test.c | 6 +-
.../selftests/kvm/x86_64/nx_huge_pages_test.c | 4 +-
6 files changed, 191 insertions(+), 11 deletions(-)
base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da
--
2.34.1
Hi,
Here is the v6 patch to support polling on event 'hist' file.
The previous version is here;
https://lore.kernel.org/all/172398710447.295714.4489282566285719918.stgit@d…
This version is rebased on the ftrace/for-next branch of the
linux-trace tree, and use global irq_work and wq instead of per-event
one.
Background
----------
There has been interest in allowing user programs to monitor kernel
events in real time. Ftrace provides `trace_pipe` interface to wait
on events in the ring buffer, but it is needed to wait until filling
up a page with events in the ring buffer. We can also peek the
`trace` file periodically, but that is inefficient way to monitor
a randomely happening event.
Overview
--------
This patch set allows user to `poll`(or `select`, `epoll`) on event
histogram interface. As you know each event has its own `hist` file
which shows histograms generated by trigger action. So user can set
a new hist trigger on any event you want to monitor, and poll on the
`hist` file until it is updated.
There are 2 poll events are supported, POLLIN and POLLPRI. POLLIN
means that there are any readable update on `hist` file and this
event will be flashed only when you call read(). So, this is
useful if you want to read the histogram periodically.
The other POLLPRI event is for monitoring trace event. Like the
POLLIN, this will be returned when the histogram is updated, but
you don't need to read() the file and use poll() again.
Note that this waits for histogram update (not event arrival), thus
you must set a histogram on the event at first.
Usage
-----
Here is an example usage:
----
TRACEFS=/sys/kernel/tracing
EVENT=$TRACEFS/events/sched/sched_process_free
# setup histogram trigger and enable event
echo "hist:key=comm" >> $EVENT/trigger
echo 1 > $EVENT/enable
# Wait for update
poll pri $EVENT/hist
# Event arrived.
echo "process free event is comming"
tail $TRACEFS/trace
----
The 'poll' command is in the selftest patch.
You can take this series also from here;
https://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux.git/log/?h=t…
Thank you,
---
Masami Hiramatsu (Google) (3):
tracing/hist: Add poll(POLLIN) support on hist file
tracing/hist: Support POLLPRI event for poll on histogram
selftests/tracing: Add hist poll() support test
include/linux/trace_events.h | 14 +++
kernel/trace/trace_events.c | 14 +++
kernel/trace/trace_events_hist.c | 100 +++++++++++++++++++-
tools/testing/selftests/ftrace/Makefile | 2
tools/testing/selftests/ftrace/poll.c | 74 +++++++++++++++
.../ftrace/test.d/trigger/trigger-hist-poll.tc | 74 +++++++++++++++
6 files changed, 275 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/poll.c
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
--
Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Since commit e87412e621f1 ("integrate Zaamo and Zalrsc text (#1304)"),
the A extension has been described as a set of instructions provided by
Zaamo and Zalrsc. Add these two extensions.
This series is based on the Zc one [1].
Link: https://lore.kernel.org/linux-riscv/20240619113529.676940-1-cleger@rivosinc…
---
Clément Léger (5):
dt-bindings: riscv: add Zaamo and Zalrsc ISA extension description
riscv: add parsing for Zaamo and Zalrsc extensions
riscv: hwprobe: export Zaamo and Zalrsc extensions
RISC-V: KVM: Allow Zaamo/Zalrsc extensions for Guest/VM
KVM: riscv: selftests: Add Zaamo/Zalrsc extensions to get-reg-list
test
Documentation/arch/riscv/hwprobe.rst | 8 ++++++++
.../devicetree/bindings/riscv/extensions.yaml | 19 +++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 2 ++
arch/riscv/include/uapi/asm/hwprobe.h | 2 ++
arch/riscv/include/uapi/asm/kvm.h | 2 ++
arch/riscv/kernel/cpufeature.c | 9 ++++++++-
arch/riscv/kernel/sys_hwprobe.c | 2 ++
arch/riscv/kvm/vcpu_onereg.c | 4 ++++
.../selftests/kvm/riscv/get-reg-list.c | 8 ++++++++
9 files changed, 55 insertions(+), 1 deletion(-)
--
2.45.2
These patches are all simple fixes with no strong dependency though,
I hope that making them a patchset will be more convenient for merge.
The patchset are based on v6.12-rc2.
Chunyan Zhang (4):
riscv: Remove unused GENERATING_ASM_OFFSETS
riscv: Remove duplicated GET_RM
selftest/mm: Fix typo in virtual_address_range
selftests/mm: skip virtual_address_range tests on riscv
arch/riscv/kernel/asm-offsets.c | 2 --
arch/riscv/kernel/traps_misaligned.c | 2 --
tools/testing/selftests/mm/Makefile | 2 ++
tools/testing/selftests/mm/run_vmtests.sh | 10 ++++++----
tools/testing/selftests/mm/virtual_address_range.c | 4 ++--
5 files changed, 10 insertions(+), 10 deletions(-)
--
2.34.1
This series introduces a new ioctl KVM_TRANSLATE2, which expands on
KVM_TRANSLATE. It is required to implement Hyper-V's
HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper-
call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which
implements the core functionality of the hyper-call. The rest of the
required functionality will be implemented in subsequent series.
Other than translating guest virtual addresses, the ioctl allows the
caller to control whether the access and dirty bits are set during the
page walk. It also allows specifying an access mode instead of returning
viable access modes, which enables setting the bits up to the level that
caused a failure. Additionally, the ioctl provides more information about
why the page walk failed, and which page table is responsible. This
functionality is not available within KVM_TRANSLATE, and can't be added
without breaking backwards compatiblity, thus a new ioctl is required.
The ioctl was designed to facilitate as many other use cases as possible
apart from VSM. The error codes were intentionally chosen to be broad
enough to avoid exposing architecture specific details. Even though
HvTranslateVirtualAddress only really needs one flag to set the accessed
and dirty bits whenever possible, that was split into several flags so
that future users can chose more gradually when these bits should be set.
Furthermore, as much information as possible is provided to the caller.
The patch series includes selftests for the ioctl, as well as fuzzy
testing on random garbage guest page table entries. All previously passing
KVM selftests and KVM unit tests still pass.
Series overview:
- 1: Document the new ioctl
- 2-11: Update the page walker in preparation
- 12-14: Implement the ioctl
- 15: Implement testing
This series, alongside the series by Nicolas Saenz Julienne [1]
introducing the core building blocks for VSM and the accompanying QEMU
implementation [2], is capable of booting Windows Server 2019.
Both series are also available on GitHub [3].
[1] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.c…
[2] https://github.com/vianpl/qemu/tree/vsm/next
[3] https://github.com/vianpl/linux/tree/vsm/next
Best,
Nikolas
Nikolas Wipper (15):
KVM: Add API documentation for KVM_TRANSLATE2
KVM: x86/mmu: Abort page walk if permission checks fail
KVM: x86/mmu: Introduce exception flag for unmapped GPAs
KVM: x86/mmu: Store GPA in exception if applicable
KVM: x86/mmu: Introduce flags parameter to page walker
KVM: x86/mmu: Implement PWALK_SET_ACCESSED in page walker
KVM: x86/mmu: Implement PWALK_SET_DIRTY in page walker
KVM: x86/mmu: Implement PWALK_FORCE_SET_ACCESSED in page walker
KVM: x86/mmu: Introduce status parameter to page walker
KVM: x86/mmu: Implement PWALK_STATUS_READ_ONLY_PTE_GPA in page walker
KVM: x86: Introduce generic gva to gpa translation function
KVM: Introduce KVM_TRANSLATE2
KVM: Add KVM_TRANSLATE2 stub
KVM: x86: Implement KVM_TRANSLATE2
KVM: selftests: Add test for KVM_TRANSLATE2
Documentation/virt/kvm/api.rst | 131 ++++++++
arch/x86/include/asm/kvm_host.h | 18 +-
arch/x86/kvm/hyperv.c | 3 +-
arch/x86/kvm/kvm_emulate.h | 8 +
arch/x86/kvm/mmu.h | 10 +-
arch/x86/kvm/mmu/mmu.c | 7 +-
arch/x86/kvm/mmu/paging_tmpl.h | 80 +++--
arch/x86/kvm/x86.c | 123 ++++++-
include/linux/kvm_host.h | 6 +
include/uapi/linux/kvm.h | 33 ++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++
virt/kvm/kvm_main.c | 41 +++
13 files changed, 724 insertions(+), 47 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
This is something that I've been thinking about for a while. We had a
discussion at LPC 2020 about this[1] but the proposals suggested there
never materialised.
In short, it is quite difficult for userspace to detect the feature
capability of syscalls at runtime. This is something a lot of programs
want to do, but they are forced to create elaborate scenarios to try to
figure out if a feature is supported without causing damage to the
system. For the vast majority of cases, each individual feature also
needs to be tested individually (because syscall results are
all-or-nothing), so testing even a single syscall's feature set can
easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this
talk[2] and in some old LKML threads (though I can't find the exact
references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be
no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the
top bits are free for us to use as makeshift flag bits. This is
preferable to using the per-syscall flag field inside the structure
because seccomp can easily detect the bit in the flag and allow the
probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid
bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid
flags or bits. For pointer fields or fields that take an arbitrary
value, the field has every bit set (0xFF... to fill the field) to
indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will
only ever be used for this purpose (so userspace can be sure that
the request succeeded).
On older kernels, the syscall will return a different error (usually
-E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by
looking at the fields in the returned structure. Flags are checked
by doing an AND with the flags field, and field support can checked
by comparing to 0. In principle you could just AND the entire
structure if you wanted to do this check generically without caring
about the structure contents (this is what libraries might consider
doing).
Userspace can even find out the internal kernel structure size by
passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and
backwards- compatible.
This allows programas to get a one-shot understanding of what features a
syscall supports without having to do any elaborate setups or tricks to
detect support for destructive features. Flags can simply be ANDed to
check if they are in the supported set, and fields can just be checked
to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to
introspect the feature set of extensible struct (copy_struct_from_user)
syscalls. It doesn't preclude the chance of a more generic mechanism
being added later.
The intended way of using this interface to get feature information
looks something like the following (imagine that openat2 has gained a
new field and a new flag in the future):
static bool openat2_no_automount_supported;
static bool openat2_cwd_fd_supported;
int check_openat2_support(void)
{
int err;
struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how));
assert(err < 0);
switch (errno) {
case EFAULT: case E2BIG:
/* Old kernel... */
check_support_the_old_way();
break;
case EEXTSYS_NOOP:
openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT);
openat2_cwd_fd_supported = (how.cwd_fd != 0);
break;
}
}
This series adds CHECK_FIELDS support for the following extensible
struct syscalls, as they are quite likely to grow flags in the near
future:
* openat2
* clone3
* mount_setattr
[1]: https://lwn.net/Articles/830666/
[2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v3:
- Fix copy_struct_to_user() return values in case of clear_user() failure.
- v2: <https://lore.kernel.org/r/20240906-extensible-structs-check_fields-v2-0-0f4…>
Changes in v2:
- Add CHECK_FIELDS support to mount_setattr(2).
- Fix build failure on architectures with custom errno values.
- Rework selftests to use the tools/ uAPI headers rather than custom
defining EEXTSYS_NOOP.
- Make sure we return -EINVAL and -E2BIG for invalid sizes even if
CHECK_FIELDS is set, and add some tests for that.
- v1: <https://lore.kernel.org/r/20240902-extensible-structs-check_fields-v1-0-545…>
---
Aleksa Sarai (10):
uaccess: add copy_struct_to_user helper
sched_getattr: port to copy_struct_to_user
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
openat2: add CHECK_FIELDS flag to usize argument
selftests: openat2: add 0xFF poisoned data after misaligned struct
selftests: openat2: add CHECK_FIELDS selftests
clone3: add CHECK_FIELDS flag to usize argument
selftests: clone3: add CHECK_FIELDS selftests
mount_setattr: add CHECK_FIELDS flag to usize argument
selftests: mount_setattr: add CHECK_FIELDS selftest
arch/alpha/include/uapi/asm/errno.h | 3 +
arch/mips/include/uapi/asm/errno.h | 3 +
arch/parisc/include/uapi/asm/errno.h | 3 +
arch/sparc/include/uapi/asm/errno.h | 3 +
fs/namespace.c | 17 ++
fs/open.c | 18 ++
include/linux/uaccess.h | 97 ++++++++
include/uapi/asm-generic/errno.h | 3 +
include/uapi/linux/openat2.h | 2 +
kernel/fork.c | 30 ++-
kernel/sched/syscalls.c | 42 +---
tools/arch/alpha/include/uapi/asm/errno.h | 3 +
tools/arch/mips/include/uapi/asm/errno.h | 3 +
tools/arch/parisc/include/uapi/asm/errno.h | 3 +
tools/arch/sparc/include/uapi/asm/errno.h | 3 +
tools/include/uapi/asm-generic/errno.h | 3 +
tools/include/uapi/asm-generic/posix_types.h | 101 ++++++++
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../testing/selftests/clone3/clone3_check_fields.c | 264 +++++++++++++++++++++
tools/testing/selftests/mount_setattr/Makefile | 2 +-
.../selftests/mount_setattr/mount_setattr_test.c | 53 ++++-
tools/testing/selftests/openat2/Makefile | 2 +
tools/testing/selftests/openat2/openat2_test.c | 165 ++++++++++++-
24 files changed, 777 insertions(+), 51 deletions(-)
---
base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652
change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>