March 2024 - Linux-kselftest-mirror

[PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

by Marco Elver

From: Dmitry Vyukov <dvyukov(a)google.com> POSIX timers using the CLOCK_PROCESS_CPUTIME_ID clock prefer the main thread of a thread group for signal delivery. However, this has a significant downside: it requires waking up a potentially idle thread. Instead, prefer to deliver signals to the current thread (in the same thread group) if SIGEV_THREAD_ID is not set by the user. This does not change guaranteed semantics, since POSIX process CPU time timers have never guaranteed that signal delivery is to a specific thread (without SIGEV_THREAD_ID set). The effect is that we no longer wake up potentially idle threads, and the kernel is no longer biased towards delivering the timer signal to any particular thread (which better distributes the timer signals esp. when multiple timers fire concurrently). Signed-off-by: Dmitry Vyukov <dvyukov(a)google.com> Suggested-by: Oleg Nesterov <oleg(a)redhat.com> Reviewed-by: Oleg Nesterov <oleg(a)redhat.com> Signed-off-by: Marco Elver <elver(a)google.com> --- v6: - Split test from this patch. - Update wording on what this patch aims to improve. v5: - Rebased onto v6.2. v4: - Restructured checks in send_sigqueue() as suggested. v3: - Switched to the completely different implementation (much simpler) based on the Oleg's idea. RFC v2: - Added additional Cc as Thomas asked. --- kernel/signal.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 8cb28f1df294..605445fa27d4 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1003,8 +1003,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) /* * Now find a thread we can wake up to take the signal off the queue. * - * If the main thread wants the signal, it gets first crack. - * Probably the least surprising to the average bear. + * Try the suggested task first (may or may not be the main thread). */ if (wants_signal(sig, p)) t = p; @@ -1970,8 +1969,23 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) ret = -1; rcu_read_lock(); + /* + * This function is used by POSIX timers to deliver a timer signal. + * Where type is PIDTYPE_PID (such as for timers with SIGEV_THREAD_ID + * set), the signal must be delivered to the specific thread (queues + * into t->pending). + * + * Where type is not PIDTYPE_PID, signals must just be delivered to the + * current process. In this case, prefer to deliver to current if it is + * in the same thread group as the target, as it avoids unnecessarily + * waking up a potentially idle task. + */ t = pid_task(pid, type); - if (!t || !likely(lock_task_sighand(t, &flags))) + if (!t) + goto ret; + if (type != PIDTYPE_PID && same_thread_group(t, current)) + t = current; + if (!likely(lock_task_sighand(t, &flags))) goto ret; ret = 1; /* the signal is ignored */ @@ -1993,6 +2007,11 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) q->info.si_overrun = 0; signalfd_notify(t, sig); + /* + * If the type is not PIDTYPE_PID, we just use shared_pending, which + * won't guarantee that the specified task will receive the signal, but + * is sufficient if t==current in the common case. + */ pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending; list_add_tail(&q->list, &pending->list); sigaddset(&pending->signal, sig); -- 2.40.0.rc1.284.g88254d51c5-goog

1 year, 1 month

9
50
0 0

[PATCH v3] selftests/ftrace: traceonoff_triggers: strip off names

by Yipeng Zou

The func_traceonoff_triggers.tc sometimes goes to fail on my board, Kunpeng-920. [root@localhost]# ./ftracetest ./test.d/ftrace/func_traceonoff_triggers.tc -l fail.log === Ftrace unit tests === [1] ftrace - test for function traceon/off triggers [FAIL] [2] (instance) ftrace - test for function traceon/off triggers [UNSUPPORTED] I look up the log, and it shows that the md5sum is different between csum1 and csum2. ++ cnt=611 ++ sleep .1 +++ cnt_trace +++ grep -v '^#' trace +++ wc -l ++ cnt2=611 ++ '[' 611 -ne 611 ']' +++ cat tracing_on ++ on=0 ++ '[' 0 '!=' 0 ']' +++ md5sum trace ++ csum1='76896aa74362fff66a6a5f3cf8a8a500 trace' ++ sleep .1 +++ md5sum trace ++ csum2='ee8625a21c058818fc26e45c1ed3f6de trace' ++ '[' '76896aa74362fff66a6a5f3cf8a8a500 trace' '!=' 'ee8625a21c058818fc26e45c1ed3f6de trace' ']' ++ fail 'Tracing file is still changing' ++ echo Tracing file is still changing Tracing file is still changing ++ exit_fail ++ exit 1 So I directly dump the trace file before md5sum, the diff shows that: [root@localhost]# diff trace_1.log trace_2.log -y --suppress-common-lines dockerd-12285 [036] d.... 18385.510290: sched_stat | <...>-12285 [036] d.... 18385.510290: sched_stat dockerd-12285 [036] d.... 18385.510291: sched_swit | <...>-12285 [036] d.... 18385.510291: sched_swit <...>-740 [044] d.... 18385.602859: sched_stat | kworker/44:1-740 [044] d.... 18385.602859: sched_stat <...>-740 [044] d.... 18385.602860: sched_swit | kworker/44:1-740 [044] d.... 18385.602860: sched_swit And we can see that <...> filed be filled with names. We can strip off the names there to fix that. After strip off the names: kworker/u257:0-12 [019] d..2. 2528.758910: sched_stat | -12 [019] d..2. 2528.758910: sched_stat_runtime: comm=k kworker/u257:0-12 [019] d..2. 2528.758912: sched_swit | -12 [019] d..2. 2528.758912: sched_switch: prev_comm=kw <idle>-0 [000] d.s5. 2528.762318: sched_waki | -0 [000] d.s5. 2528.762318: sched_waking: comm=sshd pi <idle>-0 [037] dNh2. 2528.762326: sched_wake | -0 [037] dNh2. 2528.762326: sched_wakeup: comm=sshd pi <idle>-0 [037] d..2. 2528.762334: sched_swit | -0 [037] d..2. 2528.762334: sched_switch: prev_comm=sw Fixes: d87b29179aa0 ("selftests: ftrace: Use md5sum to take less time of checking logs") Suggested-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- .../ftrace/test.d/ftrace/func_traceonoff_triggers.tc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc b/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc index aee22289536b..1b57771dbfdf 100644 --- a/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc +++ b/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc @@ -90,9 +90,10 @@ if [ $on != "0" ]; then fail "Tracing is not off" fi -csum1=`md5sum trace` +# Cannot rely on names being around as they are only cached, strip them +csum1=`cat trace | sed -e 's/^ *[^ ]*$-[0-9][0-9]*$/\1/' | md5sum` sleep $SLEEP_TIME -csum2=`md5sum trace` +csum2=`cat trace | sed -e 's/^ *[^ ]*$-[0-9][0-9]*$/\1/' | md5sum` if [ "$csum1" != "$csum2" ]; then fail "Tracing file is still changing" -- 2.34.1

1 year, 1 month

2
2
0 0

[PATCH] selftests/rseq: take large C-state exit latency into consideration

by Zide Chen

Currently, the migration worker delays 1-10 us, assuming that one KVM_RUN iteration only takes a few microseconds. But if C-state exit latencies are large enough, for example, hundreds or even thousands of microseconds on server CPUs, it may happen that it's not able to bring the target CPU out of C-state before the migration worker starts to migrate it to the next CPU. If the system workload is light, most CPUs could be at a certain level of C-state, and the vCPU thread may waste milliseconds before it can actually migrate to a new CPU. Thus, the tests may be inefficient in such systems, and in some cases it may fail the migration/KVM_RUN ratio sanity check. Since we are not able to turn off the cpuidle sub-system in run time, this patch creates an idle thread on every CPU to prevent them from entering C-states. Additionally, seems it's reasonable to randomize the length of usleep(), other than delay in a fixed pattern. Signed-off-by: Zide Chen <zide.chen(a)intel.com> --- tools/testing/selftests/kvm/rseq_test.c | 76 ++++++++++++++++++++++--- 1 file changed, 69 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c index 28f97fb52044..d6e8b851d29e 100644 --- a/tools/testing/selftests/kvm/rseq_test.c +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -11,6 +11,7 @@ #include <syscall.h> #include <sys/ioctl.h> #include <sys/sysinfo.h> +#include <sys/resource.h> #include <asm/barrier.h> #include <linux/atomic.h> #include <linux/rseq.h> @@ -29,9 +30,10 @@ #define NR_TASK_MIGRATIONS 100000 static pthread_t migration_thread; +static pthread_t *idle_threads; static cpu_set_t possible_mask; -static int min_cpu, max_cpu; -static bool done; +static int min_cpu, max_cpu, nproc; +static volatile bool done; static atomic_t seq_cnt; @@ -150,7 +152,7 @@ static void *migration_worker(void *__rseq_tid) * Use usleep() for simplicity and to avoid unnecessary kernel * dependencies. */ - usleep((i % 10) + 1); + usleep((rand() % 10) + 1); } done = true; return NULL; @@ -158,7 +160,7 @@ static void *migration_worker(void *__rseq_tid) static void calc_min_max_cpu(void) { - int i, cnt, nproc; + int i, cnt; TEST_REQUIRE(CPU_COUNT(&possible_mask) >= 2); @@ -186,6 +188,61 @@ static void calc_min_max_cpu(void) "Only one usable CPU, task migration not possible"); } +static void *idle_thread_fn(void *__idle_cpu) +{ + int r, cpu = (int)(unsigned long)__idle_cpu; + cpu_set_t allowed_mask; + + CPU_ZERO(&allowed_mask); + CPU_SET(cpu, &allowed_mask); + + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", + errno, strerror(errno)); + + /* lowest priority, trying to prevent it from entering C-states */ + r = setpriority(PRIO_PROCESS, 0, 19); + TEST_ASSERT(!r, "setpriority failed, errno = %d (%s)", + errno, strerror(errno)); + + while(!done); + + return NULL; +} + +static void spawn_threads(void) +{ + int cpu; + + /* Run a dummy thread on every CPU */ + for (cpu = min_cpu; cpu <= max_cpu; cpu++) { + if (!CPU_ISSET(cpu, &possible_mask)) + continue; + + pthread_create(&idle_threads[cpu], NULL, idle_thread_fn, + (void *)(unsigned long)cpu); + } + + pthread_create(&migration_thread, NULL, migration_worker, + (void *)(unsigned long)syscall(SYS_gettid)); +} + +static void join_threads(void) +{ + int cpu; + + pthread_join(migration_thread, NULL); + + for (cpu = min_cpu; cpu <= max_cpu; cpu++) { + if (!CPU_ISSET(cpu, &possible_mask)) + continue; + + pthread_join(idle_threads[cpu], NULL); + } + + free(idle_threads); +} + int main(int argc, char *argv[]) { int r, i, snapshot; @@ -199,6 +256,12 @@ int main(int argc, char *argv[]) calc_min_max_cpu(); + srand(time(NULL)); + + idle_threads = malloc(sizeof(pthread_t) * nproc); + TEST_ASSERT(idle_threads, "malloc failed, errno = %d (%s)", errno, + strerror(errno)); + r = rseq_register_current_thread(); TEST_ASSERT(!r, "rseq_register_current_thread failed, errno = %d (%s)", errno, strerror(errno)); @@ -210,8 +273,7 @@ int main(int argc, char *argv[]) */ vm = vm_create_with_one_vcpu(&vcpu, guest_code); - pthread_create(&migration_thread, NULL, migration_worker, - (void *)(unsigned long)syscall(SYS_gettid)); + spawn_threads(); for (i = 0; !done; i++) { vcpu_run(vcpu); @@ -258,7 +320,7 @@ int main(int argc, char *argv[]) TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), "Only performed %d KVM_RUNs, task stalled too much?", i); - pthread_join(migration_thread, NULL); + join_threads(); kvm_vm_free(vm); -- 2.34.1

1 year, 1 month

3
4
0 0

[PATCH v3 00/30] NT synchronization primitive driver

by Elizabeth Figura

This patch series introduces a new char misc driver, /dev/ntsync, which is used to implement Windows NT synchronization primitives. == Background == The Wine project emulates the Windows API in user space. One particular part of that API, namely the NT synchronization primitives, have historically been implemented via RPC to a dedicated "kernel" process. However, more recent applications use these APIs more strenuously, and the overhead of RPC has become a bottleneck. The NT synchronization APIs are too complex to implement on top of existing primitives without sacrificing correctness. Certain operations, such as NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require direct control over the underlying wait queue, and implementing a wait queue sufficiently robust for Wine in user space is not possible. This proposed driver, therefore, implements the problematic interfaces directly in the Linux kernel. This driver was presented at Linux Plumbers Conference 2023. For those further interested in the history of synchronization in Wine and past attempts to solve this problem in user space, a recording of the presentation can be viewed here: https://www.youtube.com/watch?v=NjU4nyWyhU8 == Performance == The gain in performance varies wildly depending on the application in question and the user's hardware. For some games NT synchronization is not a bottleneck and no change can be observed, but for others frame rate improvements of 50 to 150 percent are not atypical. The following table lists frame rate measurements from a variety of games on a variety of hardware, taken by users Dmitry Skvortsov, FuzzyQuils, OnMars, and myself: Game Upstream ntsync improvement =========================================================================== Anger Foot 69 99 43% Call of Juarez 99.8 224.1 125% Dirt 3 110.6 860.7 678% Forza Horizon 5 108 160 48% Lara Croft: Temple of Osiris 141 326 131% Metro 2033 164.4 199.2 21% Resident Evil 2 26 77 196% The Crew 26 51 96% Tiny Tina's Wonderlands 130 360 177% Total War Saga: Troy 109 146 34% =========================================================================== == Patches == The intended semantics of the patches are broadly intended to match those of the corresponding Windows functions. For those not already familiar with the Windows functions (or their undocumented behaviour), patch 31/31 provides a detailed specification, and individual patches also include a brief description of the API they are implementing. The patches making use of this driver in Wine can be retrieved or browsed here: https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5 == Implementation == Some aspects of the implementation may deserve particular comment: * In the interest of performance, each object is governed only by a single spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple objects be changed as a single atomic operation. In order to achieve this, we first take a device-wide lock ("wait_all_lock") any time we are going to lock more than one object at a time. The maximum number of objects that can be used in a vectored wait, and therefore the maximum that can be locked simultaneously, is 64. This number is NT's own limit. The acquisition of multiple spinlocks will degrade performance. This is a conscious choice, however. Wait-for-all is known to be a very rare operation in practice, especially with counts that approach the maximum, and it is the intent of the ntsync driver to optimize wait-for-any at the expense of wait-for-all as much as possible. * NT mutexes are tied to their threads on an OS level, and the kernel includes builtin support for "robust" mutexes. In order to keep the ntsync driver self-contained and avoid touching more code than necessary, it does not hook into task exit nor use pids. Instead, the user space emulator is expected to manage thread IDs and pass them as an argument to any relevant functions; this is the "owner" field of ntsync_wait_args and ntsync_mutex_args. When the emulator detects that a thread dies, it should therefore call NTSYNC_IOC_MUTEX_KILL on any open mutexes. * ntsync is module-capable mostly because there was nothing preventing it, and because it aided development. It is not a hard requirement, though. == Previous versions == Changes from v2: * Check the result of fget() for NULL. * Squash patch 31 (introducing the NTSYNC_WAIT_REALTIME flag) into patch 4, per Arnd Bergmann. * Use atomic_try_cmpxchg() instead of atomic_cmpxchg(), per off-list review from Uros Bizjak. * Link to v2: https://lore.kernel.org/lkml/20240219223833.95710-1-zfigura@codeweavers.com/ * Link to v1: https://lore.kernel.org/lkml/20240214233645.9273-1-zfigura@codeweavers.com/ * Link to RFC v2: https://lore.kernel.org/lkml/20240131021356.10322-1-zfigura@codeweavers.com/ * Link to RFC v1: https://lore.kernel.org/lkml/20240124004028.16826-1-zfigura@codeweavers.com/ Elizabeth Figura (30): ntsync: Introduce the ntsync driver and character device. ntsync: Introduce NTSYNC_IOC_CREATE_SEM. ntsync: Introduce NTSYNC_IOC_SEM_POST. ntsync: Introduce NTSYNC_IOC_WAIT_ANY. ntsync: Introduce NTSYNC_IOC_WAIT_ALL. ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX. ntsync: Introduce NTSYNC_IOC_MUTEX_UNLOCK. ntsync: Introduce NTSYNC_IOC_MUTEX_KILL. ntsync: Introduce NTSYNC_IOC_CREATE_EVENT. ntsync: Introduce NTSYNC_IOC_EVENT_SET. ntsync: Introduce NTSYNC_IOC_EVENT_RESET. ntsync: Introduce NTSYNC_IOC_EVENT_PULSE. ntsync: Introduce NTSYNC_IOC_SEM_READ. ntsync: Introduce NTSYNC_IOC_MUTEX_READ. ntsync: Introduce NTSYNC_IOC_EVENT_READ. ntsync: Introduce alertable waits. selftests: ntsync: Add some tests for semaphore state. selftests: ntsync: Add some tests for mutex state. selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY. selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL. selftests: ntsync: Add some tests for wakeup signaling with WINESYNC_IOC_WAIT_ANY. selftests: ntsync: Add some tests for wakeup signaling with WINESYNC_IOC_WAIT_ALL. selftests: ntsync: Add some tests for manual-reset event state. selftests: ntsync: Add some tests for auto-reset event state. selftests: ntsync: Add some tests for wakeup signaling with events. selftests: ntsync: Add tests for alertable waits. selftests: ntsync: Add some tests for wakeup signaling via alerts. selftests: ntsync: Add a stress test for contended waits. maintainers: Add an entry for ntsync. docs: ntsync: Add documentation for the ntsync uAPI. Documentation/userspace-api/index.rst | 1 + .../userspace-api/ioctl/ioctl-number.rst | 2 + Documentation/userspace-api/ntsync.rst | 399 +++++ MAINTAINERS | 9 + drivers/misc/Kconfig | 11 + drivers/misc/Makefile | 1 + drivers/misc/ntsync.c | 1166 ++++++++++++++ include/uapi/linux/ntsync.h | 62 + tools/testing/selftests/Makefile | 1 + .../testing/selftests/drivers/ntsync/Makefile | 8 + tools/testing/selftests/drivers/ntsync/config | 1 + .../testing/selftests/drivers/ntsync/ntsync.c | 1407 +++++++++++++++++ 12 files changed, 3068 insertions(+) create mode 100644 Documentation/userspace-api/ntsync.rst create mode 100644 drivers/misc/ntsync.c create mode 100644 include/uapi/linux/ntsync.h create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile create mode 100644 tools/testing/selftests/drivers/ntsync/config create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c base-commit: 4cece764965020c22cff7665b18a012006359095 -- 2.43.0

1 year, 1 month

4
36
0 0

[RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

by Djalal Harouni

This patch series adds support to freeze the task cgroup hierarchy that is on a default cgroup v2 without going through kernfs interface. For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late. Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session. Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup include/linux/cgroup.h | 2 ++ kernel/bpf/helpers.c | 31 ++++ kernel/cgroup/cgroup.c | 67 ++++++++ tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c | 165 +++++++++++++++++++++ tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c | 110 ++++++++++++++ 5 files changed, 375 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c -- 2.34.1

1 year, 1 month

5
22
0 0

[PATCH v4 00/15] RISC-V SBI v2.0 PMU improvements and Perf sampling in KVM guest

by Atish Patra

This series implements SBI PMU improvements done in SBI v2.0[1] i.e. PMU snapshot and fw_read_hi() functions. SBI v2.0 introduced PMU snapshot feature which allows the SBI implementation to provide counter information (i.e. values/overflow status) via a shared memory between the SBI implementation and supervisor OS. This allows to minimize the number of traps in when perf being used inside a kvm guest as it relies on SBI PMU + trap/emulation of the counters. The current set of ratified RISC-V specification also doesn't allow scountovf to be trap/emulated by the hypervisor. The SBI PMU snapshot bridges the gap in ISA as well and enables perf sampling in the guest. However, LCOFI in the guest only works via IRQ filtering in AIA specification. That's why, AIA has to be enabled in the hardware (at least the Ssaia extension) in order to use the sampling support in the perf. Here are the patch wise implementation details. PATCH 1,6,7 : Generic cleanups/improvements. PATCH 2,3,10 : FW_READ_HI function implementation PATCH 4-5: Add PMU snapshot feature in sbi pmu driver PATCH 6-7: KVM implementation for snapshot and sampling in kvm guests PATCH 11-15: KVM selftests for SBI PMU extension The series is based on kvm-next and is available at: https://github.com/atishp04/linux/tree/kvm_pmu_snapshot_v4 The series is based on kvm-riscv/queue branch + fixes suggested on the following series https://patchwork.kernel.org/project/kvm/cover/cover.1705916069.git.haibo1.… The kvmtool patch is also available at: https://github.com/atishp04/kvmtool/tree/sscofpmf It also requires Ssaia ISA extension to be present in the hardware in order to get perf sampling support in the guest. In Qemu virt machine, it can be done by the following config. ``` -cpu rv64,sscofpmf=true,x-ssaia=true ``` There is no other dependencies on AIA apart from that. Thus, Ssaia must be disabled for the guest if AIA patches are not available. Here is the example command. ``` ./lkvm-static run -m 256 -c2 --console serial -p "console=ttyS0 earlycon" --disable-ssaia -k ./Image --debug ``` The series has been tested only in Qemu. Here is the snippet of the perf running inside a kvm guest. =================================================== $ perf record -e cycles -e instructions perf bench sched messaging -g 5 ... $ Running 'sched/messaging' benchmark: ... [ 45.928723] perf_duration_warn: 2 callbacks suppressed [ 45.929000] perf: interrupt took too long (484426 > 483186), lowering kernel.perf_event_max_sample_rate to 250 $ 20 sender and receiver processes per group $ 5 groups == 200 processes run Total time: 14.220 [sec] [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.117 MB perf.data (1942 samples) ] $ perf report --stdio $ To display the perf.data header info, please use --header/--header-only optio> $ $ $ Total Lost Samples: 0 $ $ Samples: 943 of event 'cycles' $ Event count (approx.): 5128976844 $ $ Overhead Command Shared Object Symbol > $ ........ ............... ........................... .....................> $ 7.59% sched-messaging [kernel.kallsyms] [k] memcpy 5.48% sched-messaging [kernel.kallsyms] [k] percpu_counter_ad> 5.24% sched-messaging [kernel.kallsyms] [k] __sbi_rfence_v02_> 4.00% sched-messaging [kernel.kallsyms] [k] _raw_spin_unlock_> 3.79% sched-messaging [kernel.kallsyms] [k] set_pte_range 3.72% sched-messaging [kernel.kallsyms] [k] next_uptodate_fol> 3.46% sched-messaging [kernel.kallsyms] [k] filemap_map_pages 3.31% sched-messaging [kernel.kallsyms] [k] handle_mm_fault 3.20% sched-messaging [kernel.kallsyms] [k] finish_task_switc> 3.16% sched-messaging [kernel.kallsyms] [k] clear_page 3.03% sched-messaging [kernel.kallsyms] [k] mtree_range_walk 2.42% sched-messaging [kernel.kallsyms] [k] flush_icache_pte =================================================== [1] https://github.com/riscv-non-isa/riscv-sbi-doc Changes from v3->v4: 1. Added selftests. 2. Fixed an issue to clear the interrupt pending bits. 3. Fixed the counter index in snapshot memory start function. Changes from v2->v3: 1. Fixed a patchwork warning on patch6. 2. Fixed a comment formatting & nit fix in PATCH 3 & 5. 3. Moved the hvien update and sscofpmf enabling to PATCH 9 from PATCH 8. Changes from v1->v2: 1. Fixed warning/errors from patchwork CI. 2. Rebased on top of kvm-next. 3. Added Acked-by tags. Changes from RFC->v1: 1. Addressed all the comments on RFC series. 2. Removed PATCH2 and merged into later patches. 3. Added 2 more patches for minor fixes. 4. Fixed KVM boot issue without Ssaia and made sscofpmf in guest dependent on Ssaia in the host. Atish Patra (15): RISC-V: Fix the typo in Scountovf CSR name RISC-V: Add FIRMWARE_READ_HI definition drivers/perf: riscv: Read upper bits of a firmware counter RISC-V: Add SBI PMU snapshot definitions drivers/perf: riscv: Implement SBI PMU snapshot function RISC-V: KVM: No need to update the counter value during reset RISC-V: KVM: No need to exit to the user space if perf event failed RISC-V: KVM: Implement SBI PMU Snapshot feature RISC-V: KVM: Add perf sampling support for guests RISC-V: KVM: Support 64 bit firmware counters on RV32 KVM: riscv: selftests: Add Sscofpmf to get-reg-list test KVM: riscv: selftests: Add SBI PMU extension definitions KVM: riscv: selftests: Add SBI PMU selftest KVM: riscv: selftests: Add a test for PMU snapshot functionality KVM: riscv: selftests: Add a test for counter overflow arch/riscv/include/asm/csr.h | 5 +- arch/riscv/include/asm/errata_list.h | 2 +- arch/riscv/include/asm/kvm_vcpu_pmu.h | 14 +- arch/riscv/include/asm/sbi.h | 12 + arch/riscv/include/uapi/asm/kvm.h | 1 + arch/riscv/kvm/aia.c | 5 + arch/riscv/kvm/vcpu.c | 14 +- arch/riscv/kvm/vcpu_onereg.c | 9 +- arch/riscv/kvm/vcpu_pmu.c | 247 +++++++- arch/riscv/kvm/vcpu_sbi_pmu.c | 15 +- drivers/perf/riscv_pmu.c | 1 + drivers/perf/riscv_pmu_sbi.c | 229 ++++++- include/linux/perf/riscv_pmu.h | 6 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/include/riscv/processor.h | 92 +++ .../selftests/kvm/lib/riscv/processor.c | 12 + .../selftests/kvm/riscv/get-reg-list.c | 4 + tools/testing/selftests/kvm/riscv/sbi_pmu.c | 588 ++++++++++++++++++ 18 files changed, 1212 insertions(+), 45 deletions(-) create mode 100644 tools/testing/selftests/kvm/riscv/sbi_pmu.c -- 2.34.1

1 year, 1 month

7
55
0 0

[PATCH v6 0/5] KVM: arm64: Support for 2023 dpISA extensions

by Mark Brown

This series implements support for the 2023 dpISA extensions in KVM guests, it was previously posted as part of a series with the host support but that has now been merged so only the KVM portions remain. Most of these extensions add only new instructions so the guest support consists of adding the relevant ID registers, masking out other features like the 2023 MTE extensions. FEAT_FPMR introduces a new system register FPMR to the floating point state which we enable guest access to and context switch when the ID registers indicate that it is supported. Currently we implement visibility for FPMR with a fpmr_visibility() function as for other system registers, I will separately look into adding support for specifying this in the struct sys_reg_desc. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v6: - Rebase onto v6.9-rc1. - The host portions of the series were merged so only the KVM guest support remains. - Link to v5: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-0-c568edc8ed7f@kerne… Changes in v5: - Rebase onto v6.8-rc3. - Use u64 rather than unsigned long for storing FPMR. - Temporarily drop KVM guest support due to issues with KVM being a moving target. - Link to v4: https://lore.kernel.org/r/20240122-arm64-2023-dpisa-v4-0-776e094861df@kerne… Changes in v4: - Rebase onto v6.8-rc1. - Move KVM support to the end of the series. - Link to v3: https://lore.kernel.org/r/20231205-arm64-2023-dpisa-v3-0-dbcbcd867a7f@kerne… Changes in v3: - Rebase onto v6.7-rc3. - Hook up traps for FPMR in emulate-nested.c. - Link to v2: https://lore.kernel.org/r/20231114-arm64-2023-dpisa-v2-0-47251894f6a8@kerne… Changes in v2: - Rebase onto v6.7-rc1. - Link to v1: https://lore.kernel.org/r/20231026-arm64-2023-dpisa-v1-0-8470dd989bb2@kerne… --- Mark Brown (5): KVM: arm64: Share all userspace hardened thread data with the hypervisor KVM: arm64: Add newly allocated ID registers to register descriptions KVM: arm64: Support FEAT_FPMR for guests KVM: arm64: selftests: Document feature registers added in 2023 extensions KVM: arm64: selftests: Teach get-reg-list about FPMR arch/arm64/include/asm/kvm_host.h | 6 ++++-- arch/arm64/include/asm/processor.h | 2 +- arch/arm64/kvm/emulate-nested.c | 9 ++++++++ arch/arm64/kvm/fpsimd.c | 15 +++++++------- arch/arm64/kvm/hyp/include/hyp/switch.h | 9 ++++++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 4 ++-- arch/arm64/kvm/sys_regs.c | 24 +++++++++++++++++++--- tools/testing/selftests/kvm/aarch64/get-reg-list.c | 11 ++++++++-- 8 files changed, 60 insertions(+), 20 deletions(-) --- base-commit: 4cece764965020c22cff7665b18a012006359095 change-id: 20231003-arm64-2023-dpisa-2f3d25746474 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 year, 1 month

2
15
0 0

[PATCH v4 0/7] Extend HID-BPF kfuncs (was: allow HID-BPF to do device IOs)

by Benjamin Tissoires

New version of the sleepable bpf_timer code, without BPF changes, as they can now go through the HID tree independantly: https://lore.kernel.org/all/20240315-hid-bpf-sleepable-v4-0-5658f2540564@ke… For reference, the use cases I have in mind: --- Basically, I need to be able to defer a HID-BPF program for the following reasons (from the aforementioned patch): 1. defer an event: Sometimes we receive an out of proximity event, but the device can not be trusted enough, and we need to ensure that we won't receive another one in the following n milliseconds. So we need to wait those n milliseconds, and eventually re-inject that event in the stack. 2. inject new events in reaction to one given event: We might want to transform one given event into several. This is the case for macro keys where a single key press is supposed to send a sequence of key presses. But this could also be used to patch a faulty behavior, if a device forgets to send a release event. 3. communicate with the device in reaction to one event: We might want to communicate back to the device after a given event. For example a device might send us an event saying that it came back from sleeping state and needs to be re-initialized. Currently we can achieve that by keeping a userspace program around, raise a bpf event, and let that userspace program inject the events and commands. However, we are just keeping that program alive as a daemon for just scheduling commands. There is no logic in it, so it doesn't really justify an actual userspace wakeup. So a kernel workqueue seems simpler to handle. bpf_timers are currently running in a soft IRQ context, this patch series implements a sleppable context for them. Cheers, Benjamin To: Jiri Kosina <jikos(a)kernel.org> To: Benjamin Tissoires <benjamin.tissoires(a)redhat.com> To: Jonathan Corbet <corbet(a)lwn.net> To: Shuah Khan <shuah(a)kernel.org> Cc: Benjamin Tissoires <bentiss(a)kernel.org> Cc: <linux-input(a)vger.kernel.org> Cc: <linux-kernel(a)vger.kernel.org> Cc: <bpf(a)vger.kernel.org> Cc: <linux-doc(a)vger.kernel.org> Cc: <linux-kselftest(a)vger.kernel.org> --- Changes in v4: - dropped the BPF changes, they can go independently in bpf-core - dropped the HID-BPF integration tests with the sleppable timers, I'll re-add them once both series (this and sleepable timers) are merged - Link to v3: https://lore.kernel.org/r/20240221-hid-bpf-sleepable-v3-0-1fb378ca6301@kern… Changes in v3: - fixed the crash from v2 - changed the API to have only BPF_F_TIMER_SLEEPABLE for bpf_timer_start() - split the new kfuncs/verifier patch into several sub-patches, for easier reviews - Link to v2: https://lore.kernel.org/r/20240214-hid-bpf-sleepable-v2-0-5756b054724d@kern… Changes in v2: - make use of bpf_timer (and dropped the custom HID handling) - implemented bpf_timer_set_sleepable_cb as a kfunc - still not implemented global subprogs - no sleepable bpf_timer selftests yet - Link to v1: https://lore.kernel.org/r/20240209-hid-bpf-sleepable-v1-0-4cc895b5adbd@kern… --- Benjamin Tissoires (7): HID: bpf/dispatch: regroup kfuncs definitions HID: bpf: export hid_hw_output_report as a BPF kfunc selftests/hid: add KASAN to the VM tests selftests/hid: Add test for hid_bpf_hw_output_report HID: bpf: allow to inject HID event from BPF selftests/hid: add tests for hid_bpf_input_report HID: bpf: allow to use bpf_timer_set_sleepable_cb() in tracing callbacks. Documentation/hid/hid-bpf.rst | 2 +- drivers/hid/bpf/hid_bpf_dispatch.c | 226 ++++++++++++++------- drivers/hid/hid-core.c | 2 + include/linux/hid_bpf.h | 3 + tools/testing/selftests/hid/config.common | 1 + tools/testing/selftests/hid/hid_bpf.c | 112 +++++++++- tools/testing/selftests/hid/progs/hid.c | 46 +++++ .../testing/selftests/hid/progs/hid_bpf_helpers.h | 6 + 8 files changed, 324 insertions(+), 74 deletions(-) --- base-commit: 3e78a6c0d3e02e4cf881dc84c5127e9990f939d6 change-id: 20240314-b4-hid-bpf-new-funcs-ecf05d0ef870 Best regards, -- Benjamin Tissoires <bentiss(a)kernel.org>

1 year, 1 month

3
11
0 0

[PATCH v2] selftests: sud_test: return correct emulated syscall value on RISC-V

by Clément Léger

Currently, the sud_test expects the emulated syscall to return the emulated syscall number. This assumption only works on architectures were the syscall calling convention use the same register for syscall number/syscall return value. This is not the case for RISC-V and thus the return value must be also emulated using the provided ucontext. Signed-off-by: Clément Léger <cleger(a)rivosinc.com> Reviewed-by: Palmer Dabbelt <palmer(a)rivosinc.com> Acked-by: Palmer Dabbelt <palmer(a)rivosinc.com> --- Changes in V2: - Changes comment to be more explicit - Use A7 syscall arg rather than hardcoding MAGIC_SYSCALL_1 --- .../selftests/syscall_user_dispatch/sud_test.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c index b5d592d4099e..d975a6767329 100644 --- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c +++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c @@ -158,6 +158,20 @@ static void handle_sigsys(int sig, siginfo_t *info, void *ucontext) /* In preparation for sigreturn. */ SYSCALL_DISPATCH_OFF(glob_sel); + + /* + * The tests for argument handling assume that `syscall(x) == x`. This + * is a NOP on x86 because the syscall number is passed in %rax, which + * happens to also be the function ABI return register. Other + * architectures may need to swizzle the arguments around. + */ +#if defined(__riscv) +/* REG_A7 is not defined in libc headers */ +# define REG_A7 (REG_A0 + 7) + + ((ucontext_t *)ucontext)->uc_mcontext.__gregs[REG_A0] = + ((ucontext_t *)ucontext)->uc_mcontext.__gregs[REG_A7]; +#endif } TEST(dispatch_and_return) -- 2.43.0

1 year, 1 month

3
2
0 0

[RFC PATCH v3 0/8] mm: workingset reporting

by Yuanchu Xie

This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate. Use cases ========== Job scheduling For data center machines, workingset information allows the job scheduler to right-size each job and land more jobs on the same host or NUMA node, and in the case of a job with increasing workingset, policy decisions can be made to migrate other jobs off the host/NUMA node, or oom-kill the misbehaving job. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting enables the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) While this patch series does not extend the virtio-balloon device, balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On desktops/laptops/mobile devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. Promotion/Demotion Similar to proactive reclaim, a workingset report enables demotion to a slower tier of memory. For promotion, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements… Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise. [2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge… Implementation ========== Currently, the reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report. I will make the generation count configurable in the next version. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. -- Changes from RFC v2 -> RFC v3: - Update to v6.8 - Added an aging kernel thread (gated behind config) - Added basic selftests for sysfs interface files - Track swapped out pages for reaccesses - Refactoring and cleanup - Dropped the virtio-balloon extension to make things manageable Changes from RFC v1 -> RFC v2: - Refactored the patchs into smaller pieces - Renamed interfaces and functions from wss to wsr (Working Set Reporting) - Fixed build errors when CONFIG_WSR is not set - Changed working_set_num_bins to u8 for virtio-balloon - Added support for per-NUMA node reporting for virtio-balloon [rfc v1] https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.co… [rfc v2] https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/ Yuanchu Xie (8): mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true mm: aggregate working set information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend working set reporting to memcgs mm: add per-memcg reaccess histogram mm: add kernel aging thread for workingset reporting mm: test system-wide workingset reporting drivers/base/node.c | 3 + include/linux/memcontrol.h | 5 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 107 +++ mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 45 ++ mm/memcontrol.c | 386 ++++++++- mm/mmzone.c | 2 + mm/vmscan.c | 95 ++- mm/workingset.c | 9 +- mm/workingset_report.c | 757 ++++++++++++++++++ mm/workingset_report_aging.c | 127 +++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../testing/selftests/mm/workingset_report.c | 315 ++++++++ .../testing/selftests/mm/workingset_report.h | 37 + .../selftests/mm/workingset_report_test.c | 328 ++++++++ 18 files changed, 2231 insertions(+), 10 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c -- 2.44.0.396.g6e790dbe36-goog

1 year, 1 month

4
16
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror March 2024