In the middle of the thread about a patch to add the skip test result,
I suggested documenting the process of deprecating the KTAP v1 Specification
method of marking a skipped test:
https://lore.kernel.org/all/490271eb-1429-2217-6e38-837c6e5e328b@gmail.com/…
In a reply to that email I suggested that we ought to have a process to transition
the KTAP Specification from v1 to v2, and possibly v3 and future.
This email is meant to be the root of that discussion.
My initial thinking is that there are at least three different types of project
and/or community that may have different needs in this area.
Type 1 - project controls both the test output generation and the test output
parsing tool. Both generation and parsing code are in the same repository
and/or synchronized versions are distributed together.
Devicetree unittests are an example of Type 1. I plan to maintain changes
of test output to KTAP v2 format in coordination with updating the parser
to process KTAP v2 data.
Type 2 - project controls both the test output generation and the test output
parsing tool. The test output generation and a parser modifications may be
controlled by the project BUT there are one or more external testing projects
that (1) may have their own parsers, and (2) may have a single framework that
tests multiple versions of the tests.
I think that kselftest and kunit tests are probably examples of Type 2. I also
think that DT unittests will become a Type 2 project as a result of converting
to KTAP v2 data.
Type 3 - project may create and maintain some tests, but is primarily a consumer
of tests created by other projects. Type 3 projects typically have a single
framework that is able to execute and process multiple versions of the tests.
The Fuego test project is an example of Type 3.
Maybe adding all of this complexity of different Types in my initial thinking
was silly -- maybe everything in this topic is governed by the more complex
Type 3.
My thinking was that the three different Types of project would be impacted
in different ways by transition plans. Type 3 would be the most impacted,
so I wanted to be sure that any transition plan especially considered their
needs.
There is an important aspect of the KTAP format that might ease the transition
from one version to another: All KTAP formatted results begin with a "version
line", so as soon as a parser has processed the first line of a test, it can
apply the appropriate KTAP Specification version to all subsequent lines of
test output. A parser implementation could choose to process all versions,
could choose to invoke a version specific parser, or some other approach
all together.
In the "add skip test results" thread, I suggested deprecating the v1
method of marking a skipped test in v2, with a scheduled removal of
the v1 method in v3. But since the KTAP format version is available
in the very first line of test output, is it necessary to do a slow
deprecation and removal over two versions?
One argument to doing a two version deprecation/removal process is that
a parser that is one version older the the test output _might_ be able
to process the test output without error, but would not be able to take
advantage of features added in the newer version of the Specification.
My opinion is that a two version deprecation/removal process will slow
the Specification update process and lead to more versions of the
Specification over a given time interval.
A one version deprecation/removal process puts more of a burden on Type 3
projects and external parsers for Type 2 projects to implement parsers
that can process the newer Specification more quickly and puts a burden
on test maintainers to delay a move to the newer Specification, or possibly
pressure to support selection of more than one Specification version format
for output data.
One additional item... On the KTAP Specification version 2 process wiki page,
I suggested that it is "desirable for test result parsers that understand the
KTAP Specification version 2 data also be able to parse version 1 data."
With the implication "Converting version 1 compliant data to version 2 compliant
data should not require a "flag day" switch of test result parsers." If this
thread discussion results in a different decision, I will update the wiki.
Thoughts?
-Frank
The perf subsystem today unifies various tracing and monitoring
features, from both software and hardware. One benefit of the perf
subsystem is automatically inheriting events to child tasks, which
enables process-wide events monitoring with low overheads. By default
perf events are non-intrusive, not affecting behaviour of the tasks
being monitored.
For certain use-cases, however, it makes sense to leverage the
generality of the perf events subsystem and optionally allow the tasks
being monitored to receive signals on events they are interested in.
This patch series adds the option to synchronously signal user space on
events.
To better support process-wide synchronous self-monitoring, without
events propagating to children that do not share the current process's
shared environment, two pre-requisite patches are added to optionally
restrict inheritance to CLONE_THREAD, and remove events on exec (without
affecting the parent).
Examples how to use these features can be found in the tests added at
the end of the series. In addition to the tests added, the series has
also been subjected to syzkaller fuzzing (focus on 'kernel/events/'
coverage).
Motivation and Example Uses
---------------------------
1. Our immediate motivation is low-overhead sampling-based race
detection for user space [1]. By using perf_event_open() at
process initialization, we can create hardware
breakpoint/watchpoint events that are propagated automatically
to all threads in a process. As far as we are aware, today no
existing kernel facility (such as ptrace) allows us to set up
process-wide watchpoints with minimal overheads (that are
comparable to mprotect() of whole pages).
2. Other low-overhead error detectors that rely on detecting
accesses to certain memory locations or code, process-wide and
also only in a specific set of subtasks or threads.
[1] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf
Other ideas for use-cases we found interesting, but should only
illustrate the range of potential to further motivate the utility (we're
sure there are more):
3. Code hot patching without full stop-the-world. Specifically, by
setting a code breakpoint to entry to the patched routine, then
send signals to threads and check that they are not in the
routine, but without stopping them further. If any of the
threads will enter the routine, it will receive SIGTRAP and
pause.
4. Safepoints without mprotect(). Some Java implementations use
"load from a known memory location" as a safepoint. When threads
need to be stopped, the page containing the location is
mprotect()ed and threads get a signal. This could be replaced with
a watchpoint, which does not require a whole page nor DTLB
shootdowns.
5. Threads receiving signals on performance events to
throttle/unthrottle themselves.
6. Tracking data flow globally.
Changelog
---------
v4:
* Fix for parent and child racing to exit in sync_child_event().
* Fix race between irq_work running and task's sighand being released by
release_task().
* Generalize setting si_perf and si_addr independent of event type;
introduces perf_event_attr::sig_data, which can be set by user space
to be propagated to si_perf.
* Warning in perf_sigtrap() if ctx->task and current mismatch; we expect
this on architectures that do not properly implement
arch_irq_work_raise().
* Require events that want sigtrap to be associated with a task.
* Dropped "perf: Add breakpoint information to siginfo on SIGTRAP"
in favor of more generic solution (perf_event_attr::sig_data).
v3:
* Add patch "perf: Rework perf_event_exit_event()" to beginning of
series, courtesy of Peter Zijlstra.
* Rework "perf: Add support for event removal on exec" based on
the added "perf: Rework perf_event_exit_event()".
* Fix kselftests to work with more recent libc, due to the way it forces
using the kernel's own siginfo_t.
* Add basic perf-tool built-in test.
v2/RFC: https://lkml.kernel.org/r/20210310104139.679618-1-elver@google.com
* Patch "Support only inheriting events if cloned with CLONE_THREAD"
added to series.
* Patch "Add support for event removal on exec" added to series.
* Patch "Add kselftest for process-wide sigtrap handling" added to
series.
* Patch "Add kselftest for remove_on_exec" added to series.
* Implicitly restrict inheriting events if sigtrap, but the child was
cloned with CLONE_CLEAR_SIGHAND, because it is not generally safe if
the child cleared all signal handlers to continue sending SIGTRAP.
* Various minor fixes (see details in patches).
v1/RFC: https://lkml.kernel.org/r/20210223143426.2412737-1-elver@google.com
Pre-series: The discussion at [2] led to the changes in this series. The
approach taken in "Add support for SIGTRAP on perf events" to trigger
the signal was suggested by Peter Zijlstra in [3].
[2] https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZ…
[3] https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@hirez.programming.kicks-ass.n…
Marco Elver (9):
perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
perf: Support only inheriting events if cloned with CLONE_THREAD
perf: Add support for event removal on exec
signal: Introduce TRAP_PERF si_code and si_perf to siginfo
perf: Add support for SIGTRAP on perf events
selftests/perf_events: Add kselftest for process-wide sigtrap handling
selftests/perf_events: Add kselftest for remove_on_exec
tools headers uapi: Sync tools/include/uapi/linux/perf_event.h
perf test: Add basic stress test for sigtrap handling
Peter Zijlstra (1):
perf: Rework perf_event_exit_event()
arch/m68k/kernel/signal.c | 3 +
arch/x86/kernel/signal_compat.c | 5 +-
fs/signalfd.c | 4 +
include/linux/compat.h | 2 +
include/linux/perf_event.h | 9 +-
include/linux/signal.h | 1 +
include/uapi/asm-generic/siginfo.h | 6 +-
include/uapi/linux/perf_event.h | 12 +-
include/uapi/linux/signalfd.h | 4 +-
kernel/events/core.c | 302 +++++++++++++-----
kernel/fork.c | 2 +-
kernel/signal.c | 11 +
tools/include/uapi/linux/perf_event.h | 12 +-
tools/perf/tests/Build | 1 +
tools/perf/tests/builtin-test.c | 5 +
tools/perf/tests/sigtrap.c | 150 +++++++++
tools/perf/tests/tests.h | 1 +
.../testing/selftests/perf_events/.gitignore | 3 +
tools/testing/selftests/perf_events/Makefile | 6 +
tools/testing/selftests/perf_events/config | 1 +
.../selftests/perf_events/remove_on_exec.c | 260 +++++++++++++++
tools/testing/selftests/perf_events/settings | 1 +
.../selftests/perf_events/sigtrap_threads.c | 210 ++++++++++++
23 files changed, 924 insertions(+), 87 deletions(-)
create mode 100644 tools/perf/tests/sigtrap.c
create mode 100644 tools/testing/selftests/perf_events/.gitignore
create mode 100644 tools/testing/selftests/perf_events/Makefile
create mode 100644 tools/testing/selftests/perf_events/config
create mode 100644 tools/testing/selftests/perf_events/remove_on_exec.c
create mode 100644 tools/testing/selftests/perf_events/settings
create mode 100644 tools/testing/selftests/perf_events/sigtrap_threads.c
--
2.31.0.208.g409f899ff0-goog
From: Dmitry Vyukov <dvyukov(a)google.com>
POSIX timers using the CLOCK_PROCESS_CPUTIME_ID clock prefer the main
thread of a thread group for signal delivery. However, this has a
significant downside: it requires waking up a potentially idle thread.
Instead, prefer to deliver signals to the current thread (in the same
thread group) if SIGEV_THREAD_ID is not set by the user. This does not
change guaranteed semantics, since POSIX process CPU time timers have
never guaranteed that signal delivery is to a specific thread (without
SIGEV_THREAD_ID set).
The effect is that we no longer wake up potentially idle threads, and
the kernel is no longer biased towards delivering the timer signal to
any particular thread (which better distributes the timer signals esp.
when multiple timers fire concurrently).
Signed-off-by: Dmitry Vyukov <dvyukov(a)google.com>
Suggested-by: Oleg Nesterov <oleg(a)redhat.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Signed-off-by: Marco Elver <elver(a)google.com>
---
v6:
- Split test from this patch.
- Update wording on what this patch aims to improve.
v5:
- Rebased onto v6.2.
v4:
- Restructured checks in send_sigqueue() as suggested.
v3:
- Switched to the completely different implementation (much simpler)
based on the Oleg's idea.
RFC v2:
- Added additional Cc as Thomas asked.
---
kernel/signal.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 8cb28f1df294..605445fa27d4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1003,8 +1003,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
/*
* Now find a thread we can wake up to take the signal off the queue.
*
- * If the main thread wants the signal, it gets first crack.
- * Probably the least surprising to the average bear.
+ * Try the suggested task first (may or may not be the main thread).
*/
if (wants_signal(sig, p))
t = p;
@@ -1970,8 +1969,23 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
ret = -1;
rcu_read_lock();
+ /*
+ * This function is used by POSIX timers to deliver a timer signal.
+ * Where type is PIDTYPE_PID (such as for timers with SIGEV_THREAD_ID
+ * set), the signal must be delivered to the specific thread (queues
+ * into t->pending).
+ *
+ * Where type is not PIDTYPE_PID, signals must just be delivered to the
+ * current process. In this case, prefer to deliver to current if it is
+ * in the same thread group as the target, as it avoids unnecessarily
+ * waking up a potentially idle task.
+ */
t = pid_task(pid, type);
- if (!t || !likely(lock_task_sighand(t, &flags)))
+ if (!t)
+ goto ret;
+ if (type != PIDTYPE_PID && same_thread_group(t, current))
+ t = current;
+ if (!likely(lock_task_sighand(t, &flags)))
goto ret;
ret = 1; /* the signal is ignored */
@@ -1993,6 +2007,11 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
q->info.si_overrun = 0;
signalfd_notify(t, sig);
+ /*
+ * If the type is not PIDTYPE_PID, we just use shared_pending, which
+ * won't guarantee that the specified task will receive the signal, but
+ * is sufficient if t==current in the common case.
+ */
pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
list_add_tail(&q->list, &pending->list);
sigaddset(&pending->signal, sig);
--
2.40.0.rc1.284.g88254d51c5-goog
From: Zi Yan <ziy(a)nvidia.com>
Hi all,
File folio supports any order and people would like to support flexible orders
for anonymous folio[1] too. Currently, split_huge_page() only splits a huge
page to order-0 pages, but splitting to orders higher than 0 is also useful.
This patchset adds support for splitting a huge page to any lower order pages
and uses it during file folio truncate operations.
The patchset is on top of mm-everything-2023-03-27-21-20.
Changelog
===
Since v2
---
1. Fixed an issue in __split_page_owner() introduced during my rebase
Since v1
---
1. Changed split_page_memcg() and split_page_owner() parameter to use order
2. Used folio_test_pmd_mappable() in place of the equivalent code
Details
===
* Patch 1 changes split_page_memcg() to use order instead of nr_pages
* Patch 2 changes split_page_owner() to use order instead of nr_pages
* Patch 3 and 4 add new_order parameter split_page_memcg() and
split_page_owner() and prepare for upcoming changes.
* Patch 5 adds split_huge_page_to_list_to_order() to split a huge page
to any lower order. The original split_huge_page_to_list() calls
split_huge_page_to_list_to_order() with new_order = 0.
* Patch 6 uses split_huge_page_to_list_to_order() in large pagecache folio
truncation instead of split the large folio all the way down to order-0.
* Patch 7 adds a test API to debugfs and test cases in
split_huge_page_test selftests.
Comments and/or suggestions are welcome.
[1] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
Zi Yan (7):
mm/memcg: use order instead of nr in split_page_memcg()
mm/page_owner: use order instead of nr in split_page_owner()
mm: memcg: make memcg huge page split support any order split.
mm: page_owner: add support for splitting to any order in split
page_owner.
mm: thp: split huge page to any lower order pages.
mm: truncate: split huge page cache page to a non-zero order if
possible.
mm: huge_memory: enable debugfs to split huge pages to any order.
include/linux/huge_mm.h | 10 +-
include/linux/memcontrol.h | 4 +-
include/linux/page_owner.h | 10 +-
mm/huge_memory.c | 137 ++++++++---
mm/memcontrol.c | 10 +-
mm/page_alloc.c | 8 +-
mm/page_owner.c | 8 +-
mm/truncate.c | 21 +-
.../selftests/mm/split_huge_page_test.c | 225 +++++++++++++++++-
9 files changed, 365 insertions(+), 68 deletions(-)
--
2.39.2