From: Dmitry Vyukov <dvyukov(a)google.com>
POSIX timers using the CLOCK_PROCESS_CPUTIME_ID clock prefer the main
thread of a thread group for signal delivery. However, this has a
significant downside: it requires waking up a potentially idle thread.
Instead, prefer to deliver signals to the current thread (in the same
thread group) if SIGEV_THREAD_ID is not set by the user. This does not
change guaranteed semantics, since POSIX process CPU time timers have
never guaranteed that signal delivery is to a specific thread (without
SIGEV_THREAD_ID set).
The effect is that we no longer wake up potentially idle threads, and
the kernel is no longer biased towards delivering the timer signal to
any particular thread (which better distributes the timer signals esp.
when multiple timers fire concurrently).
Signed-off-by: Dmitry Vyukov <dvyukov(a)google.com>
Suggested-by: Oleg Nesterov <oleg(a)redhat.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Signed-off-by: Marco Elver <elver(a)google.com>
---
v6:
- Split test from this patch.
- Update wording on what this patch aims to improve.
v5:
- Rebased onto v6.2.
v4:
- Restructured checks in send_sigqueue() as suggested.
v3:
- Switched to the completely different implementation (much simpler)
based on the Oleg's idea.
RFC v2:
- Added additional Cc as Thomas asked.
---
kernel/signal.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 8cb28f1df294..605445fa27d4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1003,8 +1003,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
/*
* Now find a thread we can wake up to take the signal off the queue.
*
- * If the main thread wants the signal, it gets first crack.
- * Probably the least surprising to the average bear.
+ * Try the suggested task first (may or may not be the main thread).
*/
if (wants_signal(sig, p))
t = p;
@@ -1970,8 +1969,23 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
ret = -1;
rcu_read_lock();
+ /*
+ * This function is used by POSIX timers to deliver a timer signal.
+ * Where type is PIDTYPE_PID (such as for timers with SIGEV_THREAD_ID
+ * set), the signal must be delivered to the specific thread (queues
+ * into t->pending).
+ *
+ * Where type is not PIDTYPE_PID, signals must just be delivered to the
+ * current process. In this case, prefer to deliver to current if it is
+ * in the same thread group as the target, as it avoids unnecessarily
+ * waking up a potentially idle task.
+ */
t = pid_task(pid, type);
- if (!t || !likely(lock_task_sighand(t, &flags)))
+ if (!t)
+ goto ret;
+ if (type != PIDTYPE_PID && same_thread_group(t, current))
+ t = current;
+ if (!likely(lock_task_sighand(t, &flags)))
goto ret;
ret = 1; /* the signal is ignored */
@@ -1993,6 +2007,11 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
q->info.si_overrun = 0;
signalfd_notify(t, sig);
+ /*
+ * If the type is not PIDTYPE_PID, we just use shared_pending, which
+ * won't guarantee that the specified task will receive the signal, but
+ * is sufficient if t==current in the common case.
+ */
pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
list_add_tail(&q->list, &pending->list);
sigaddset(&pending->signal, sig);
--
2.40.0.rc1.284.g88254d51c5-goog
From: Zi Yan <ziy(a)nvidia.com>
Hi all,
File folio supports any order and people would like to support flexible orders
for anonymous folio[1] too. Currently, split_huge_page() only splits a huge
page to order-0 pages, but splitting to orders higher than 0 is also useful.
This patchset adds support for splitting a huge page to any lower order pages
and uses it during file folio truncate operations.
The patchset is on top of mm-everything-2023-03-27-21-20.
Changelog
===
Since v2
---
1. Fixed an issue in __split_page_owner() introduced during my rebase
Since v1
---
1. Changed split_page_memcg() and split_page_owner() parameter to use order
2. Used folio_test_pmd_mappable() in place of the equivalent code
Details
===
* Patch 1 changes split_page_memcg() to use order instead of nr_pages
* Patch 2 changes split_page_owner() to use order instead of nr_pages
* Patch 3 and 4 add new_order parameter split_page_memcg() and
split_page_owner() and prepare for upcoming changes.
* Patch 5 adds split_huge_page_to_list_to_order() to split a huge page
to any lower order. The original split_huge_page_to_list() calls
split_huge_page_to_list_to_order() with new_order = 0.
* Patch 6 uses split_huge_page_to_list_to_order() in large pagecache folio
truncation instead of split the large folio all the way down to order-0.
* Patch 7 adds a test API to debugfs and test cases in
split_huge_page_test selftests.
Comments and/or suggestions are welcome.
[1] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
Zi Yan (7):
mm/memcg: use order instead of nr in split_page_memcg()
mm/page_owner: use order instead of nr in split_page_owner()
mm: memcg: make memcg huge page split support any order split.
mm: page_owner: add support for splitting to any order in split
page_owner.
mm: thp: split huge page to any lower order pages.
mm: truncate: split huge page cache page to a non-zero order if
possible.
mm: huge_memory: enable debugfs to split huge pages to any order.
include/linux/huge_mm.h | 10 +-
include/linux/memcontrol.h | 4 +-
include/linux/page_owner.h | 10 +-
mm/huge_memory.c | 137 ++++++++---
mm/memcontrol.c | 10 +-
mm/page_alloc.c | 8 +-
mm/page_owner.c | 8 +-
mm/truncate.c | 21 +-
.../selftests/mm/split_huge_page_test.c | 225 +++++++++++++++++-
9 files changed, 365 insertions(+), 68 deletions(-)
--
2.39.2
Hi Reinette, Fenghua,
This series introduces a new mount option enabling an alternate mode for
MBM to work around an issue on present AMD implementations and any other
resctrl implementation where there are more RMIDs (or equivalent) than
hardware counters.
The L3 External Bandwidth Monitoring feature of the AMD PQoS
extension[1] only guarantees that RMIDs currently assigned to a
processor will be tracked by hardware. The counters of any other RMIDs
which are no longer being tracked will be reset to zero. The MBM event
counters return "Unavailable" to indicate when this has happened.
An interval for effectively measuring memory bandwidth typically needs
to be multiple seconds long. In Google's workloads, it is not feasible
to bound the number of jobs with different RMIDs which will run in a
cache domain over any period of time. Consequently, on a
fully-committed system where all RMIDs are allocated, few groups'
counters return non-zero values.
To demonstrate the underlying issue, the first patch provides a test
case in tools/testing/selftests/resctrl/test_rmids.sh.
On an AMD EPYC 7B12 64-Core Processor with the default behavior:
# ./test_rmids.sh
Created 255 monitoring groups.
g1: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g2: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g3: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
[..]
g238: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g239: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g240: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g241: mbm_total_bytes: Unavailable -> 660497472
g242: mbm_total_bytes: Unavailable -> 660793344
g243: mbm_total_bytes: Unavailable -> 660477312
g244: mbm_total_bytes: Unavailable -> 660495360
g245: mbm_total_bytes: Unavailable -> 660775360
g246: mbm_total_bytes: Unavailable -> 660645504
g247: mbm_total_bytes: Unavailable -> 660696128
g248: mbm_total_bytes: Unavailable -> 660605248
g249: mbm_total_bytes: Unavailable -> 660681280
g250: mbm_total_bytes: Unavailable -> 660834240
g251: mbm_total_bytes: Unavailable -> 660440064
g252: mbm_total_bytes: Unavailable -> 660501504
g253: mbm_total_bytes: Unavailable -> 660590720
g254: mbm_total_bytes: Unavailable -> 660548352
g255: mbm_total_bytes: Unavailable -> 660607296
255 groups, 0 returned counts in first pass, 15 in second
successfully measured bandwidth from 15/255 groups
To compare, here is the output from an Intel(R) Xeon(R) Platinum 8173M
CPU:
# ./test_rmids.sh
Created 223 monitoring groups.
g1: mbm_total_bytes: 0 -> 606126080
g2: mbm_total_bytes: 0 -> 613236736
g3: mbm_total_bytes: 0 -> 610254848
[..]
g221: mbm_total_bytes: 0 -> 584679424
g222: mbm_total_bytes: 0 -> 588808192
g223: mbm_total_bytes: 0 -> 587317248
223 groups, 223 returned counts in first pass, 223 in second
successfully measured bandwidth from 223/223 groups
To make better use of the hardware in such a use case, this patchset
introduces a "soft" RMID implementation, where each CPU is permanently
assigned a "hard" RMID. On context switches which change the current
soft RMID, the difference between each CPU's current event counts and
most recent counts is added to the totals for the current or outgoing
soft RMID.
This technique does not work for cache occupancy counters, so this patch
series disables cache occupancy events when soft RMIDs are enabled.
This series adds the "mbm_soft_rmid" mount option to allow users to
opt-in to the functionaltiy when they deem it helpful.
When the same system from the earlier AMD example enables the
mbm_soft_rmid mount option:
# ./test_rmids.sh
Created 255 monitoring groups.
g1: mbm_total_bytes: 0 -> 686560576
g2: mbm_total_bytes: 0 -> 668204416
[..]
g252: mbm_total_bytes: 0 -> 672651200
g253: mbm_total_bytes: 0 -> 666956800
g254: mbm_total_bytes: 0 -> 665917056
g255: mbm_total_bytes: 0 -> 671049600
255 groups, 255 returned counts in first pass, 255 in second
successfully measured bandwidth from 255/255 groups
(patches are based on tip/master)
[1] https://www.amd.com/system/files/TechDocs/56375_1.03_PUB.pdf
Peter Newman (8):
selftests/resctrl: Verify all RMIDs count together
x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
x86/resctrl: Flush MBM event counts on soft RMID change
x86/resctrl: Call mon_event_count() directly for soft RMIDs
x86/resctrl: Create soft RMID version of __mon_event_count()
x86/resctrl: Assign HW RMIDs to CPUs for soft RMID
x86/resctrl: Use mbm_update() to push soft RMID counts
x86/resctrl: Add mount option to enable soft RMID
Stephane Eranian (1):
x86/resctrl: Hold a spinlock in __rmid_read() on AMD
arch/x86/include/asm/resctrl.h | 29 +++-
arch/x86/kernel/cpu/resctrl/core.c | 80 ++++++++-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 9 +-
arch/x86/kernel/cpu/resctrl/internal.h | 19 ++-
arch/x86/kernel/cpu/resctrl/monitor.c | 158 +++++++++++++++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 ++++++
tools/testing/selftests/resctrl/test_rmids.sh | 93 +++++++++++
7 files changed, 425 insertions(+), 15 deletions(-)
create mode 100755 tools/testing/selftests/resctrl/test_rmids.sh
base-commit: dd806e2f030e57dd5bac973372aa252b6c175b73
--
2.40.0.634.g4ca3ef3211-goog
'available_events' is actually not required by
'test.d/event/toplevel-enable.tc' and its Existence has been tested in
'test.d/00basic/basic4.tc'.
So the require of 'available_events' can be dropped and then we can add
'instance' flag to test 'test.d/event/toplevel-enable.tc' for instance.
Test result show as below:
# ./ftracetest test.d/event/toplevel-enable.tc
=== Ftrace unit tests ===
[1] event tracing - enable/disable with top level files [PASS]
[2] (instance) event tracing - enable/disable with top level files [PASS]
# of passed: 2
# of failed: 0
# of unresolved: 0
# of untested: 0
# of unsupported: 0
# of xfailed: 0
# of undefined(test bug): 0
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
---
tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc b/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc
index 93c10ea42a68..8b8e1aea985b 100644
--- a/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc
+++ b/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc
@@ -1,7 +1,8 @@
#!/bin/sh
# SPDX-License-Identifier: GPL-2.0
# description: event tracing - enable/disable with top level files
-# requires: available_events set_event events/enable
+# requires: set_event events/enable
+# flags: instance
do_reset() {
echo > set_event
--
2.25.1
Hi All,
In TDX guest, the attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest.
The TDX guest attestation process consists of two steps:
1. TDREPORT generation
2. Quote generation.
The First step (TDREPORT generation) involves getting the TDX guest
measurement data in the format of TDREPORT which is further used to
validate the authenticity of the TDX guest. The second step involves
sending the TDREPORT to a Quoting Enclave (QE) server to generate a
remotely verifiable Quote. TDREPORT by design can only be verified on
the local platform. To support remote verification of the TDREPORT,
TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT
locally and convert it to a remotely verifiable Quote. Although
attestation software can use communication methods like TCP/IP or
vsock to send the TDREPORT to QE, not all platforms support these
communication models. So TDX GHCI specification [1] defines a method
for Quote generation via hypercalls. Please check the discussion from
Google [2] and Alibaba [3] which clarifies the need for hypercall based
Quote generation support. This patch set adds this support.
Support for TDREPORT generation already exists in the TDX guest driver.
This patchset extends the same driver to add the Quote generation
support.
Following are the details of the patch set:
Patch 1/3 -> Adds event notification IRQ support.
Patch 2/3 -> Adds Quote generation support.
Patch 3/3 -> Adds selftest support for Quote generation feature.
[1] https://cdrdv2.intel.com/v1/dl/getContent/726790, section titled "TDG.VP.VMCALL<GetQuote>".
[2] https://lore.kernel.org/lkml/CAAYXXYxxs2zy_978GJDwKfX5Hud503gPc8=1kQ-+JwG_k…
[3] https://lore.kernel.org/lkml/a69faebb-11e8-b386-d591-dbd08330b008@linux.ali…
Kuppuswamy Sathyanarayanan (3):
x86/tdx: Add TDX Guest event notify interrupt support
virt: tdx-guest: Add Quote generation support
selftests/tdx: Test GetQuote TDX attestation feature
Documentation/virt/coco/tdx-guest.rst | 11 ++
arch/x86/coco/tdx/tdx.c | 194 +++++++++++++++++++
arch/x86/include/asm/tdx.h | 8 +
drivers/virt/coco/tdx-guest/tdx-guest.c | 175 ++++++++++++++++-
include/uapi/linux/tdx-guest.h | 44 +++++
tools/testing/selftests/tdx/tdx_guest_test.c | 65 ++++++-
6 files changed, 490 insertions(+), 7 deletions(-)
--
2.34.1
From: Zi Yan <ziy(a)nvidia.com>
Hi all,
File folio supports any order and people would like to support flexible orders
for anonymous folio[1] too. Currently, split_huge_page() only splits a huge
page to order-0 pages, but splitting to orders higher than 0 is also useful.
This patchset adds support for splitting a huge page to any lower order pages
and uses it during folio truncate operations.
The patchset is on top of mm-everything-2023-03-27-21-20.
Changelog from v1
===
1. Changed split_page_memcg() and split_page_owner() parameter to use order
2. Used folio_test_pmd_mappable() in place of the equivalent code
Details
===
* Patch 1 changes split_page_memcg() to use order instead of nr_pages
* Patch 2 changes split_page_owner() to use order instead of nr_pages
* Patch 3 and 4 add new_order parameter split_page_memcg() and
split_page_owner() and prepare for upcoming changes.
* Patch 5 adds split_huge_page_to_list_to_order() to split a huge page
to any lower order. The original split_huge_page_to_list() calls
split_huge_page_to_list_to_order() with new_order = 0.
* Patch 6 uses split_huge_page_to_list_to_order() in large pagecache folio
truncation instead of split the large folio all the way down to order-0.
* Patch 7 adds a test API to debugfs and test cases in
split_huge_page_test selftests.
Comments and/or suggestions are welcome.
[1] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
Zi Yan (7):
mm/memcg: use order instead of nr in split_page_memcg()
mm/page_owner: use order instead of nr in split_page_owner()
mm: memcg: make memcg huge page split support any order split.
mm: page_owner: add support for splitting to any order in split
page_owner.
mm: thp: split huge page to any lower order pages.
mm: truncate: split huge page cache page to a non-zero order if
possible.
mm: huge_memory: enable debugfs to split huge pages to any order.
include/linux/huge_mm.h | 10 +-
include/linux/memcontrol.h | 4 +-
include/linux/page_owner.h | 10 +-
mm/huge_memory.c | 137 ++++++++---
mm/memcontrol.c | 10 +-
mm/page_alloc.c | 8 +-
mm/page_owner.c | 10 +-
mm/truncate.c | 21 +-
.../selftests/mm/split_huge_page_test.c | 225 +++++++++++++++++-
9 files changed, 366 insertions(+), 69 deletions(-)
--
2.39.2
Hi all,
following bug is trying to workaround an error on ppc64le, where
zram01.sh LTP test (there is also kernel selftest
tools/testing/selftests/zram/zram01.sh, but LTP test got further
updates) has often mem_used_total 0 although zram is already filled.
Patch tries to repeatedly read /sys/block/zram*/mm_stat for 1 sec,
waiting for mem_used_total > 0. The question if this is expected and
should be workarounded or a bug which should be fixed.
REPRODUCE THE ISSUE
Quickest way to install only zram tests and their dependencies:
make autotools && ./configure && for i in testcases/lib/ testcases/kernel/device-drivers/zram/; do cd $i && make -j$(getconf _NPROCESSORS_ONLN) && make install && cd -; done
Run the test (only on vfat)
PATH="/opt/ltp/testcases/bin:$PATH" LTP_SINGLE_FS_TYPE=vfat zram01.sh
Petr Vorel (1):
zram01.sh: Workaround division by 0 on vfat on ppc64le
.../kernel/device-drivers/zram/zram01.sh | 27 +++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)
--
2.38.0