Currently, the migration worker delays 1-10 us, assuming that one
KVM_RUN iteration only takes a few microseconds. But if C-state exit
latencies are large enough, for example, hundreds or even thousands
of microseconds on server CPUs, it may happen that it's not able to
bring the target CPU out of C-state before the migration worker starts
to migrate it to the next CPU.
If the system workload is light, most CPUs could be at a certain level
of C-state, and the vCPU thread may waste milliseconds before it can
actually migrate to a new CPU.
Thus, the tests may be inefficient in such systems, and in some cases
it may fail the migration/KVM_RUN ratio sanity check.
Since we are not able to turn off the cpuidle sub-system in run time,
this patch creates an idle thread on every CPU to prevent them from
entering C-states.
Additionally, seems it's reasonable to randomize the length of usleep(),
other than delay in a fixed pattern.
Signed-off-by: Zide Chen <zide.chen(a)intel.com>
---
tools/testing/selftests/kvm/rseq_test.c | 76 ++++++++++++++++++++++---
1 file changed, 69 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c
index 28f97fb52044..d6e8b851d29e 100644
--- a/tools/testing/selftests/kvm/rseq_test.c
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -11,6 +11,7 @@
#include <syscall.h>
#include <sys/ioctl.h>
#include <sys/sysinfo.h>
+#include <sys/resource.h>
#include <asm/barrier.h>
#include <linux/atomic.h>
#include <linux/rseq.h>
@@ -29,9 +30,10 @@
#define NR_TASK_MIGRATIONS 100000
static pthread_t migration_thread;
+static pthread_t *idle_threads;
static cpu_set_t possible_mask;
-static int min_cpu, max_cpu;
-static bool done;
+static int min_cpu, max_cpu, nproc;
+static volatile bool done;
static atomic_t seq_cnt;
@@ -150,7 +152,7 @@ static void *migration_worker(void *__rseq_tid)
* Use usleep() for simplicity and to avoid unnecessary kernel
* dependencies.
*/
- usleep((i % 10) + 1);
+ usleep((rand() % 10) + 1);
}
done = true;
return NULL;
@@ -158,7 +160,7 @@ static void *migration_worker(void *__rseq_tid)
static void calc_min_max_cpu(void)
{
- int i, cnt, nproc;
+ int i, cnt;
TEST_REQUIRE(CPU_COUNT(&possible_mask) >= 2);
@@ -186,6 +188,61 @@ static void calc_min_max_cpu(void)
"Only one usable CPU, task migration not possible");
}
+static void *idle_thread_fn(void *__idle_cpu)
+{
+ int r, cpu = (int)(unsigned long)__idle_cpu;
+ cpu_set_t allowed_mask;
+
+ CPU_ZERO(&allowed_mask);
+ CPU_SET(cpu, &allowed_mask);
+
+ r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
+ TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
+ errno, strerror(errno));
+
+ /* lowest priority, trying to prevent it from entering C-states */
+ r = setpriority(PRIO_PROCESS, 0, 19);
+ TEST_ASSERT(!r, "setpriority failed, errno = %d (%s)",
+ errno, strerror(errno));
+
+ while(!done);
+
+ return NULL;
+}
+
+static void spawn_threads(void)
+{
+ int cpu;
+
+ /* Run a dummy thread on every CPU */
+ for (cpu = min_cpu; cpu <= max_cpu; cpu++) {
+ if (!CPU_ISSET(cpu, &possible_mask))
+ continue;
+
+ pthread_create(&idle_threads[cpu], NULL, idle_thread_fn,
+ (void *)(unsigned long)cpu);
+ }
+
+ pthread_create(&migration_thread, NULL, migration_worker,
+ (void *)(unsigned long)syscall(SYS_gettid));
+}
+
+static void join_threads(void)
+{
+ int cpu;
+
+ pthread_join(migration_thread, NULL);
+
+ for (cpu = min_cpu; cpu <= max_cpu; cpu++) {
+ if (!CPU_ISSET(cpu, &possible_mask))
+ continue;
+
+ pthread_join(idle_threads[cpu], NULL);
+ }
+
+ free(idle_threads);
+}
+
int main(int argc, char *argv[])
{
int r, i, snapshot;
@@ -199,6 +256,12 @@ int main(int argc, char *argv[])
calc_min_max_cpu();
+ srand(time(NULL));
+
+ idle_threads = malloc(sizeof(pthread_t) * nproc);
+ TEST_ASSERT(idle_threads, "malloc failed, errno = %d (%s)", errno,
+ strerror(errno));
+
r = rseq_register_current_thread();
TEST_ASSERT(!r, "rseq_register_current_thread failed, errno = %d (%s)",
errno, strerror(errno));
@@ -210,8 +273,7 @@ int main(int argc, char *argv[])
*/
vm = vm_create_with_one_vcpu(&vcpu, guest_code);
- pthread_create(&migration_thread, NULL, migration_worker,
- (void *)(unsigned long)syscall(SYS_gettid));
+ spawn_threads();
for (i = 0; !done; i++) {
vcpu_run(vcpu);
@@ -258,7 +320,7 @@ int main(int argc, char *argv[])
TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2),
"Only performed %d KVM_RUNs, task stalled too much?", i);
- pthread_join(migration_thread, NULL);
+ join_threads();
kvm_vm_free(vm);
--
2.34.1
This patch series introduces a new char misc driver, /dev/ntsync, which is used
to implement Windows NT synchronization primitives.
== Background ==
The Wine project emulates the Windows API in user space. One particular part of
that API, namely the NT synchronization primitives, have historically been
implemented via RPC to a dedicated "kernel" process. However, more recent
applications use these APIs more strenuously, and the overhead of RPC has become
a bottleneck.
The NT synchronization APIs are too complex to implement on top of existing
primitives without sacrificing correctness. Certain operations, such as
NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require
direct control over the underlying wait queue, and implementing a wait queue
sufficiently robust for Wine in user space is not possible. This proposed
driver, therefore, implements the problematic interfaces directly in the Linux
kernel.
This driver was presented at Linux Plumbers Conference 2023. For those further
interested in the history of synchronization in Wine and past attempts to solve
this problem in user space, a recording of the presentation can be viewed here:
https://www.youtube.com/watch?v=NjU4nyWyhU8
== Performance ==
The gain in performance varies wildly depending on the application in question
and the user's hardware. For some games NT synchronization is not a bottleneck
and no change can be observed, but for others frame rate improvements of 50 to
150 percent are not atypical. The following table lists frame rate measurements
from a variety of games on a variety of hardware, taken by users Dmitry
Skvortsov, FuzzyQuils, OnMars, and myself:
Game Upstream ntsync improvement
===========================================================================
Anger Foot 69 99 43%
Call of Juarez 99.8 224.1 125%
Dirt 3 110.6 860.7 678%
Forza Horizon 5 108 160 48%
Lara Croft: Temple of Osiris 141 326 131%
Metro 2033 164.4 199.2 21%
Resident Evil 2 26 77 196%
The Crew 26 51 96%
Tiny Tina's Wonderlands 130 360 177%
Total War Saga: Troy 109 146 34%
===========================================================================
== Patches ==
The intended semantics of the patches are broadly intended to match those of the
corresponding Windows functions. For those not already familiar with the Windows
functions (or their undocumented behaviour), patch 31/31 provides a detailed
specification, and individual patches also include a brief description of the
API they are implementing.
The patches making use of this driver in Wine can be retrieved or browsed here:
https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5
== Implementation ==
Some aspects of the implementation may deserve particular comment:
* In the interest of performance, each object is governed only by a single
spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
objects be changed as a single atomic operation. In order to achieve this, we
first take a device-wide lock ("wait_all_lock") any time we are going to lock
more than one object at a time.
The maximum number of objects that can be used in a vectored wait, and
therefore the maximum that can be locked simultaneously, is 64. This number is
NT's own limit.
The acquisition of multiple spinlocks will degrade performance. This is a
conscious choice, however. Wait-for-all is known to be a very rare operation
in practice, especially with counts that approach the maximum, and it is the
intent of the ntsync driver to optimize wait-for-any at the expense of
wait-for-all as much as possible.
* NT mutexes are tied to their threads on an OS level, and the kernel includes
builtin support for "robust" mutexes. In order to keep the ntsync driver
self-contained and avoid touching more code than necessary, it does not hook
into task exit nor use pids.
Instead, the user space emulator is expected to manage thread IDs and pass
them as an argument to any relevant functions; this is the "owner" field of
ntsync_wait_args and ntsync_mutex_args.
When the emulator detects that a thread dies, it should therefore call
NTSYNC_IOC_MUTEX_KILL on any open mutexes.
* ntsync is module-capable mostly because there was nothing preventing it, and
because it aided development. It is not a hard requirement, though.
== Previous versions ==
Changes from v2:
* Check the result of fget() for NULL.
* Squash patch 31 (introducing the NTSYNC_WAIT_REALTIME flag) into patch 4, per
Arnd Bergmann.
* Use atomic_try_cmpxchg() instead of atomic_cmpxchg(), per off-list review from
Uros Bizjak.
* Link to v2: https://lore.kernel.org/lkml/20240219223833.95710-1-zfigura@codeweavers.com/
* Link to v1: https://lore.kernel.org/lkml/20240214233645.9273-1-zfigura@codeweavers.com/
* Link to RFC v2: https://lore.kernel.org/lkml/20240131021356.10322-1-zfigura@codeweavers.com/
* Link to RFC v1: https://lore.kernel.org/lkml/20240124004028.16826-1-zfigura@codeweavers.com/
Elizabeth Figura (30):
ntsync: Introduce the ntsync driver and character device.
ntsync: Introduce NTSYNC_IOC_CREATE_SEM.
ntsync: Introduce NTSYNC_IOC_SEM_POST.
ntsync: Introduce NTSYNC_IOC_WAIT_ANY.
ntsync: Introduce NTSYNC_IOC_WAIT_ALL.
ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX.
ntsync: Introduce NTSYNC_IOC_MUTEX_UNLOCK.
ntsync: Introduce NTSYNC_IOC_MUTEX_KILL.
ntsync: Introduce NTSYNC_IOC_CREATE_EVENT.
ntsync: Introduce NTSYNC_IOC_EVENT_SET.
ntsync: Introduce NTSYNC_IOC_EVENT_RESET.
ntsync: Introduce NTSYNC_IOC_EVENT_PULSE.
ntsync: Introduce NTSYNC_IOC_SEM_READ.
ntsync: Introduce NTSYNC_IOC_MUTEX_READ.
ntsync: Introduce NTSYNC_IOC_EVENT_READ.
ntsync: Introduce alertable waits.
selftests: ntsync: Add some tests for semaphore state.
selftests: ntsync: Add some tests for mutex state.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for manual-reset event state.
selftests: ntsync: Add some tests for auto-reset event state.
selftests: ntsync: Add some tests for wakeup signaling with events.
selftests: ntsync: Add tests for alertable waits.
selftests: ntsync: Add some tests for wakeup signaling via alerts.
selftests: ntsync: Add a stress test for contended waits.
maintainers: Add an entry for ntsync.
docs: ntsync: Add documentation for the ntsync uAPI.
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 2 +
Documentation/userspace-api/ntsync.rst | 399 +++++
MAINTAINERS | 9 +
drivers/misc/Kconfig | 11 +
drivers/misc/Makefile | 1 +
drivers/misc/ntsync.c | 1166 ++++++++++++++
include/uapi/linux/ntsync.h | 62 +
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/drivers/ntsync/Makefile | 8 +
tools/testing/selftests/drivers/ntsync/config | 1 +
.../testing/selftests/drivers/ntsync/ntsync.c | 1407 +++++++++++++++++
12 files changed, 3068 insertions(+)
create mode 100644 Documentation/userspace-api/ntsync.rst
create mode 100644 drivers/misc/ntsync.c
create mode 100644 include/uapi/linux/ntsync.h
create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile
create mode 100644 tools/testing/selftests/drivers/ntsync/config
create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c
base-commit: 4cece764965020c22cff7665b18a012006359095
--
2.43.0
This patch series adds support to freeze the task cgroup hierarchy
that is on a default cgroup v2 without going through kernfs interface.
For some cases we want to freeze the cgroup of a task based on some
signals, doing so from bpf is better than user space which could be
too late.
Planned users of this feature are: tetragon and systemd when freezing
a cgroup hierarchy that could be a K8s pod, container, system service
or a user session.
Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf
Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task
Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup
include/linux/cgroup.h | 2 ++
kernel/bpf/helpers.c | 31 ++++
kernel/cgroup/cgroup.c | 67 ++++++++
tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c | 165 +++++++++++++++++++++
tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c | 110 ++++++++++++++
5 files changed, 375 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c
create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
--
2.34.1
This series implements SBI PMU improvements done in SBI v2.0[1] i.e. PMU snapshot
and fw_read_hi() functions.
SBI v2.0 introduced PMU snapshot feature which allows the SBI implementation
to provide counter information (i.e. values/overflow status) via a shared
memory between the SBI implementation and supervisor OS. This allows to minimize
the number of traps in when perf being used inside a kvm guest as it relies on
SBI PMU + trap/emulation of the counters.
The current set of ratified RISC-V specification also doesn't allow scountovf
to be trap/emulated by the hypervisor. The SBI PMU snapshot bridges the gap
in ISA as well and enables perf sampling in the guest. However, LCOFI in the
guest only works via IRQ filtering in AIA specification. That's why, AIA
has to be enabled in the hardware (at least the Ssaia extension) in order to
use the sampling support in the perf.
Here are the patch wise implementation details.
PATCH 1,6,7 : Generic cleanups/improvements.
PATCH 2,3,10 : FW_READ_HI function implementation
PATCH 4-5: Add PMU snapshot feature in sbi pmu driver
PATCH 6-7: KVM implementation for snapshot and sampling in kvm guests
PATCH 11-15: KVM selftests for SBI PMU extension
The series is based on kvm-next and is available at:
https://github.com/atishp04/linux/tree/kvm_pmu_snapshot_v4
The series is based on kvm-riscv/queue branch + fixes suggested on the following
series
https://patchwork.kernel.org/project/kvm/cover/cover.1705916069.git.haibo1.…
The kvmtool patch is also available at:
https://github.com/atishp04/kvmtool/tree/sscofpmf
It also requires Ssaia ISA extension to be present in the hardware in order to
get perf sampling support in the guest. In Qemu virt machine, it can be done
by the following config.
```
-cpu rv64,sscofpmf=true,x-ssaia=true
```
There is no other dependencies on AIA apart from that. Thus, Ssaia must be disabled
for the guest if AIA patches are not available. Here is the example command.
```
./lkvm-static run -m 256 -c2 --console serial -p "console=ttyS0 earlycon" --disable-ssaia -k ./Image --debug
```
The series has been tested only in Qemu.
Here is the snippet of the perf running inside a kvm guest.
===================================================
$ perf record -e cycles -e instructions perf bench sched messaging -g 5
...
$ Running 'sched/messaging' benchmark:
...
[ 45.928723] perf_duration_warn: 2 callbacks suppressed
[ 45.929000] perf: interrupt took too long (484426 > 483186), lowering kernel.perf_event_max_sample_rate to 250
$ 20 sender and receiver processes per group
$ 5 groups == 200 processes run
Total time: 14.220 [sec]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.117 MB perf.data (1942 samples) ]
$ perf report --stdio
$ To display the perf.data header info, please use --header/--header-only optio>
$
$
$ Total Lost Samples: 0
$
$ Samples: 943 of event 'cycles'
$ Event count (approx.): 5128976844
$
$ Overhead Command Shared Object Symbol >
$ ........ ............... ........................... .....................>
$
7.59% sched-messaging [kernel.kallsyms] [k] memcpy
5.48% sched-messaging [kernel.kallsyms] [k] percpu_counter_ad>
5.24% sched-messaging [kernel.kallsyms] [k] __sbi_rfence_v02_>
4.00% sched-messaging [kernel.kallsyms] [k] _raw_spin_unlock_>
3.79% sched-messaging [kernel.kallsyms] [k] set_pte_range
3.72% sched-messaging [kernel.kallsyms] [k] next_uptodate_fol>
3.46% sched-messaging [kernel.kallsyms] [k] filemap_map_pages
3.31% sched-messaging [kernel.kallsyms] [k] handle_mm_fault
3.20% sched-messaging [kernel.kallsyms] [k] finish_task_switc>
3.16% sched-messaging [kernel.kallsyms] [k] clear_page
3.03% sched-messaging [kernel.kallsyms] [k] mtree_range_walk
2.42% sched-messaging [kernel.kallsyms] [k] flush_icache_pte
===================================================
[1] https://github.com/riscv-non-isa/riscv-sbi-doc
Changes from v3->v4:
1. Added selftests.
2. Fixed an issue to clear the interrupt pending bits.
3. Fixed the counter index in snapshot memory start function.
Changes from v2->v3:
1. Fixed a patchwork warning on patch6.
2. Fixed a comment formatting & nit fix in PATCH 3 & 5.
3. Moved the hvien update and sscofpmf enabling to PATCH 9 from PATCH 8.
Changes from v1->v2:
1. Fixed warning/errors from patchwork CI.
2. Rebased on top of kvm-next.
3. Added Acked-by tags.
Changes from RFC->v1:
1. Addressed all the comments on RFC series.
2. Removed PATCH2 and merged into later patches.
3. Added 2 more patches for minor fixes.
4. Fixed KVM boot issue without Ssaia and made sscofpmf in guest dependent on
Ssaia in the host.
Atish Patra (15):
RISC-V: Fix the typo in Scountovf CSR name
RISC-V: Add FIRMWARE_READ_HI definition
drivers/perf: riscv: Read upper bits of a firmware counter
RISC-V: Add SBI PMU snapshot definitions
drivers/perf: riscv: Implement SBI PMU snapshot function
RISC-V: KVM: No need to update the counter value during reset
RISC-V: KVM: No need to exit to the user space if perf event failed
RISC-V: KVM: Implement SBI PMU Snapshot feature
RISC-V: KVM: Add perf sampling support for guests
RISC-V: KVM: Support 64 bit firmware counters on RV32
KVM: riscv: selftests: Add Sscofpmf to get-reg-list test
KVM: riscv: selftests: Add SBI PMU extension definitions
KVM: riscv: selftests: Add SBI PMU selftest
KVM: riscv: selftests: Add a test for PMU snapshot functionality
KVM: riscv: selftests: Add a test for counter overflow
arch/riscv/include/asm/csr.h | 5 +-
arch/riscv/include/asm/errata_list.h | 2 +-
arch/riscv/include/asm/kvm_vcpu_pmu.h | 14 +-
arch/riscv/include/asm/sbi.h | 12 +
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kvm/aia.c | 5 +
arch/riscv/kvm/vcpu.c | 14 +-
arch/riscv/kvm/vcpu_onereg.c | 9 +-
arch/riscv/kvm/vcpu_pmu.c | 247 +++++++-
arch/riscv/kvm/vcpu_sbi_pmu.c | 15 +-
drivers/perf/riscv_pmu.c | 1 +
drivers/perf/riscv_pmu_sbi.c | 229 ++++++-
include/linux/perf/riscv_pmu.h | 6 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/riscv/processor.h | 92 +++
.../selftests/kvm/lib/riscv/processor.c | 12 +
.../selftests/kvm/riscv/get-reg-list.c | 4 +
tools/testing/selftests/kvm/riscv/sbi_pmu.c | 588 ++++++++++++++++++
18 files changed, 1212 insertions(+), 45 deletions(-)
create mode 100644 tools/testing/selftests/kvm/riscv/sbi_pmu.c
--
2.34.1
This series implements support for the 2023 dpISA extensions in KVM
guests, it was previously posted as part of a series with the host
support but that has now been merged so only the KVM portions remain.
Most of these extensions add only new instructions so the guest support
consists of adding the relevant ID registers, masking out other features
like the 2023 MTE extensions.
FEAT_FPMR introduces a new system register FPMR to the floating point
state which we enable guest access to and context switch when the ID
registers indicate that it is supported. Currently we implement
visibility for FPMR with a fpmr_visibility() function as for other
system registers, I will separately look into adding support for
specifying this in the struct sys_reg_desc.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v6:
- Rebase onto v6.9-rc1.
- The host portions of the series were merged so only the KVM guest
support remains.
- Link to v5: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-0-c568edc8ed7f@kerne…
Changes in v5:
- Rebase onto v6.8-rc3.
- Use u64 rather than unsigned long for storing FPMR.
- Temporarily drop KVM guest support due to issues with KVM being a
moving target.
- Link to v4: https://lore.kernel.org/r/20240122-arm64-2023-dpisa-v4-0-776e094861df@kerne…
Changes in v4:
- Rebase onto v6.8-rc1.
- Move KVM support to the end of the series.
- Link to v3: https://lore.kernel.org/r/20231205-arm64-2023-dpisa-v3-0-dbcbcd867a7f@kerne…
Changes in v3:
- Rebase onto v6.7-rc3.
- Hook up traps for FPMR in emulate-nested.c.
- Link to v2: https://lore.kernel.org/r/20231114-arm64-2023-dpisa-v2-0-47251894f6a8@kerne…
Changes in v2:
- Rebase onto v6.7-rc1.
- Link to v1: https://lore.kernel.org/r/20231026-arm64-2023-dpisa-v1-0-8470dd989bb2@kerne…
---
Mark Brown (5):
KVM: arm64: Share all userspace hardened thread data with the hypervisor
KVM: arm64: Add newly allocated ID registers to register descriptions
KVM: arm64: Support FEAT_FPMR for guests
KVM: arm64: selftests: Document feature registers added in 2023 extensions
KVM: arm64: selftests: Teach get-reg-list about FPMR
arch/arm64/include/asm/kvm_host.h | 6 ++++--
arch/arm64/include/asm/processor.h | 2 +-
arch/arm64/kvm/emulate-nested.c | 9 ++++++++
arch/arm64/kvm/fpsimd.c | 15 +++++++-------
arch/arm64/kvm/hyp/include/hyp/switch.h | 9 ++++++--
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 4 ++--
arch/arm64/kvm/sys_regs.c | 24 +++++++++++++++++++---
tools/testing/selftests/kvm/aarch64/get-reg-list.c | 11 ++++++++--
8 files changed, 60 insertions(+), 20 deletions(-)
---
base-commit: 4cece764965020c22cff7665b18a012006359095
change-id: 20231003-arm64-2023-dpisa-2f3d25746474
Best regards,
--
Mark Brown <broonie(a)kernel.org>
New version of the sleepable bpf_timer code, without BPF changes, as
they can now go through the HID tree independantly:
https://lore.kernel.org/all/20240315-hid-bpf-sleepable-v4-0-5658f2540564@ke…
For reference, the use cases I have in mind:
---
Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
Sometimes we receive an out of proximity event, but the device can not
be trusted enough, and we need to ensure that we won't receive another
one in the following n milliseconds. So we need to wait those n
milliseconds, and eventually re-inject that event in the stack.
2. inject new events in reaction to one given event:
We might want to transform one given event into several. This is the
case for macro keys where a single key press is supposed to send
a sequence of key presses. But this could also be used to patch a
faulty behavior, if a device forgets to send a release event.
3. communicate with the device in reaction to one event:
We might want to communicate back to the device after a given event.
For example a device might send us an event saying that it came back
from sleeping state and needs to be re-initialized.
Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.
bpf_timers are currently running in a soft IRQ context, this patch
series implements a sleppable context for them.
Cheers,
Benjamin
To: Jiri Kosina <jikos(a)kernel.org>
To: Benjamin Tissoires <benjamin.tissoires(a)redhat.com>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
Cc: Benjamin Tissoires <bentiss(a)kernel.org>
Cc: <linux-input(a)vger.kernel.org>
Cc: <linux-kernel(a)vger.kernel.org>
Cc: <bpf(a)vger.kernel.org>
Cc: <linux-doc(a)vger.kernel.org>
Cc: <linux-kselftest(a)vger.kernel.org>
---
Changes in v4:
- dropped the BPF changes, they can go independently in bpf-core
- dropped the HID-BPF integration tests with the sleppable timers,
I'll re-add them once both series (this and sleepable timers) are
merged
- Link to v3: https://lore.kernel.org/r/20240221-hid-bpf-sleepable-v3-0-1fb378ca6301@kern…
Changes in v3:
- fixed the crash from v2
- changed the API to have only BPF_F_TIMER_SLEEPABLE for
bpf_timer_start()
- split the new kfuncs/verifier patch into several sub-patches, for
easier reviews
- Link to v2: https://lore.kernel.org/r/20240214-hid-bpf-sleepable-v2-0-5756b054724d@kern…
Changes in v2:
- make use of bpf_timer (and dropped the custom HID handling)
- implemented bpf_timer_set_sleepable_cb as a kfunc
- still not implemented global subprogs
- no sleepable bpf_timer selftests yet
- Link to v1: https://lore.kernel.org/r/20240209-hid-bpf-sleepable-v1-0-4cc895b5adbd@kern…
---
Benjamin Tissoires (7):
HID: bpf/dispatch: regroup kfuncs definitions
HID: bpf: export hid_hw_output_report as a BPF kfunc
selftests/hid: add KASAN to the VM tests
selftests/hid: Add test for hid_bpf_hw_output_report
HID: bpf: allow to inject HID event from BPF
selftests/hid: add tests for hid_bpf_input_report
HID: bpf: allow to use bpf_timer_set_sleepable_cb() in tracing callbacks.
Documentation/hid/hid-bpf.rst | 2 +-
drivers/hid/bpf/hid_bpf_dispatch.c | 226 ++++++++++++++-------
drivers/hid/hid-core.c | 2 +
include/linux/hid_bpf.h | 3 +
tools/testing/selftests/hid/config.common | 1 +
tools/testing/selftests/hid/hid_bpf.c | 112 +++++++++-
tools/testing/selftests/hid/progs/hid.c | 46 +++++
.../testing/selftests/hid/progs/hid_bpf_helpers.h | 6 +
8 files changed, 324 insertions(+), 74 deletions(-)
---
base-commit: 3e78a6c0d3e02e4cf881dc84c5127e9990f939d6
change-id: 20240314-b4-hid-bpf-new-funcs-ecf05d0ef870
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
Currently, the sud_test expects the emulated syscall to return the
emulated syscall number. This assumption only works on architectures
were the syscall calling convention use the same register for syscall
number/syscall return value. This is not the case for RISC-V and thus
the return value must be also emulated using the provided ucontext.
Signed-off-by: Clément Léger <cleger(a)rivosinc.com>
Reviewed-by: Palmer Dabbelt <palmer(a)rivosinc.com>
Acked-by: Palmer Dabbelt <palmer(a)rivosinc.com>
---
Changes in V2:
- Changes comment to be more explicit
- Use A7 syscall arg rather than hardcoding MAGIC_SYSCALL_1
---
.../selftests/syscall_user_dispatch/sud_test.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
index b5d592d4099e..d975a6767329 100644
--- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c
+++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
@@ -158,6 +158,20 @@ static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
/* In preparation for sigreturn. */
SYSCALL_DISPATCH_OFF(glob_sel);
+
+ /*
+ * The tests for argument handling assume that `syscall(x) == x`. This
+ * is a NOP on x86 because the syscall number is passed in %rax, which
+ * happens to also be the function ABI return register. Other
+ * architectures may need to swizzle the arguments around.
+ */
+#if defined(__riscv)
+/* REG_A7 is not defined in libc headers */
+# define REG_A7 (REG_A0 + 7)
+
+ ((ucontext_t *)ucontext)->uc_mcontext.__gregs[REG_A0] =
+ ((ucontext_t *)ucontext)->uc_mcontext.__gregs[REG_A7];
+#endif
}
TEST(dispatch_and_return)
--
2.43.0
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace. IMO, the
kernel should provide a set of workingset interfaces that should be
generic enough to accommodate the various use cases, and be extensible
to potential future use cases. The current proposed interfaces are not
sufficient in that regard, but I would like to start somewhere, solicit
feedback, and iterate.
Use cases
==========
Job scheduling
For data center machines, workingset information allows the job scheduler
to right-size each job and land more jobs on the same host or NUMA node,
and in the case of a job with increasing workingset, policy decisions
can be made to migrate other jobs off the host/NUMA node, or oom-kill
the misbehaving job. If the job shape is very different from the machine
shape, knowing the workingset per-node can also help inform page
allocation policies.
Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting enables the policy to be more accurate and
flexible.
Ballooning (similar to proactive reclaim)
While this patch series does not extend the virtio-balloon device,
balloon policies benefit from workingset to more precisely determine
the size of the memory balloon. On desktops/laptops/mobile devices where
memory is scarce and overcommitted, the balloon sizing in multiple VMs
running on the same device can be orchestrated with workingset reports
from each one.
Promotion/Demotion
Similar to proactive reclaim, a workingset report enables demotion to a
slower tier of memory.
For promotion, the workingset report interfaces need to be extended to
report hotness and gather hotness information from the devices[1].
[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements…
Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.
1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I
lack the intuition for an abstraction that presents hotness in a useful
way. Based on a recent proposal for move_phys_pages[2], it seems like
userspace tiering software would like to move specific physical pages,
instead of informing the kernel "move x number of hot pages to y
device". Please advise.
[2]
https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge…
Implementation
==========
Currently, the reporting of user pages is based off of MGLRU, and
therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU
generations for a more fine-grained workingset report. I will make the
generation count configurable in the next version. The workingset
reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the
aging thread is behind CONFIG_WORKINGSET_REPORT_AGING.
--
Changes from RFC v2 -> RFC v3:
- Update to v6.8
- Added an aging kernel thread (gated behind config)
- Added basic selftests for sysfs interface files
- Track swapped out pages for reaccesses
- Refactoring and cleanup
- Dropped the virtio-balloon extension to make things manageable
Changes from RFC v1 -> RFC v2:
- Refactored the patchs into smaller pieces
- Renamed interfaces and functions from wss to wsr (Working Set Reporting)
- Fixed build errors when CONFIG_WSR is not set
- Changed working_set_num_bins to u8 for virtio-balloon
- Added support for per-NUMA node reporting for virtio-balloon
[rfc v1]
https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.co…
[rfc v2]
https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/
Yuanchu Xie (8):
mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true
mm: aggregate working set information into histograms
mm: use refresh interval to rate-limit workingset report aggregation
mm: report workingset during memory pressure driven scanning
mm: extend working set reporting to memcgs
mm: add per-memcg reaccess histogram
mm: add kernel aging thread for workingset reporting
mm: test system-wide workingset reporting
drivers/base/node.c | 3 +
include/linux/memcontrol.h | 5 +
include/linux/mmzone.h | 4 +
include/linux/workingset_report.h | 107 +++
mm/Kconfig | 15 +
mm/Makefile | 2 +
mm/internal.h | 45 ++
mm/memcontrol.c | 386 ++++++++-
mm/mmzone.c | 2 +
mm/vmscan.c | 95 ++-
mm/workingset.c | 9 +-
mm/workingset_report.c | 757 ++++++++++++++++++
mm/workingset_report_aging.c | 127 +++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
.../testing/selftests/mm/workingset_report.c | 315 ++++++++
.../testing/selftests/mm/workingset_report.h | 37 +
.../selftests/mm/workingset_report_test.c | 328 ++++++++
18 files changed, 2231 insertions(+), 10 deletions(-)
create mode 100644 include/linux/workingset_report.h
create mode 100644 mm/workingset_report.c
create mode 100644 mm/workingset_report_aging.c
create mode 100644 tools/testing/selftests/mm/workingset_report.c
create mode 100644 tools/testing/selftests/mm/workingset_report.h
create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
--
2.44.0.396.g6e790dbe36-goog
riscv-debug-spec [1] Chapter 5: Sdtrig extension introduces
trigger CSRs which can cause a breakpoint exception,
entry into Debug Mode, or a trace action without having to
execute a special instruction.
The focus in the following patches is on the two CSRs from
the Sdtrig extension: hcontext and scontext.
These two CSRs are optional according to the spec, apart from
the Smstateen extension [2], which has bit 57 to control the
accessbility of the hcontext/scontext CSRs.
We also introduce dt-binding in the CPU DTS for the existence of
the CSRs in situations where the Smstaten extension is not available.
The hcontext/scontext CSRs can help to raise triggers with the
textra32/textra64 CSRs set up correctly. (Chapter 5.7.17/ 5.7.18 [1])
Therefore, as part of Linux awareness debugging.
We propose the scontext CSR be filled by the Linux PID,
And the hcontext CSR be filled with a self-maintained Guest OS ID.
The reason for using the self-maintained Guest OS ID instead of VMID is
that VMID might change over time, and the user setting up the trigger
might enter the previous value, invoking the wrong VM for debugging.
The tests have been done on QEMU with Sdtrig CSRs
(mcontext/hcontext/scontext implemented) [3] boot on virt machine
and also run the Guest OS as virt machine with KVM enabled,
the two hcontext/scontext CSRs can be written correctly.
This patch series is based on v6.9-rc1.
Link: https://github.com/riscv/riscv-debug-spec/releases/download/ar20231208/risc… [1]
Link: https://github.com/riscvarchive/riscv-state-enable/releases/download/v1.0.0… [2]
Link: https://github.com/sifive/qemu/tree/dev/maxh/sdtrig_ISA [3]
Signed-off-by: Max Hsu <max.hsu(a)sifive.com>
---
Max Hsu (7):
dt-bindings: riscv: Add Sdtrig ISA extension
dt-bindings: riscv: Add Sdtrig optional CSRs existence on DT
riscv: Add ISA extension parsing for Sdtrig
riscv: Add Sdtrig CSRs definition, Smstateen bit to access Sdtrig CSRs
riscv: cpufeature: Add Sdtrig optional CSRs checks
riscv: suspend: add Smstateen CSRs save/restore
riscv: Add task switch support for scontext CSR
Yong-Xuan Wang (4):
riscv: KVM: Add Sdtrig Extension Support for Guest/VM
riscv: KVM: Add scontext to ONE_REG
riscv: KVM: Add hcontext support
KVM: riscv: selftests: Add Sdtrig Extension to get-reg-list test
Documentation/devicetree/bindings/riscv/cpus.yaml | 18 +++
.../devicetree/bindings/riscv/extensions.yaml | 7 +
arch/riscv/include/asm/csr.h | 6 +
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/kvm_host.h | 14 ++
arch/riscv/include/asm/kvm_vcpu_debug.h | 24 +++
arch/riscv/include/asm/suspend.h | 7 +
arch/riscv/include/asm/switch_to.h | 15 ++
arch/riscv/include/uapi/asm/kvm.h | 9 ++
arch/riscv/kernel/cpufeature.c | 162 +++++++++++++++++++++
arch/riscv/kernel/suspend.c | 25 ++++
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/main.c | 4 +
arch/riscv/kvm/vcpu.c | 8 +
arch/riscv/kvm/vcpu_debug.c | 107 ++++++++++++++
arch/riscv/kvm/vcpu_onereg.c | 63 +++++++-
arch/riscv/kvm/vm.c | 4 +
tools/testing/selftests/kvm/riscv/get-reg-list.c | 27 ++++
18 files changed, 500 insertions(+), 2 deletions(-)
---
base-commit: 317c7bc0ef035d8ebfc3e55c5dde0566fd5fb171
change-id: 20240329-dev-maxh-lin-452-6-9-c6209e6db67f
Best regards,
--
Max Hsu <max.hsu(a)sifive.com>