Changes in v2:
* Dropped patches which were pulled into maintainer trees.
* Split BPF patches out into another series targeting bpf-next.
* trace-agent now falls back to debugfs if tracefs isn't present.
* Added Acked-by from mst(a)redhat.com to series.
* Added a typo fixup for the virtio-trace README.
Steven, assuming there are no objections, would you feel comfortable
taking this series through your tree?
---
The canonical location for the tracefs filesystem is at /sys/kernel/tracing.
But, from Documentation/trace/ftrace.rst:
Before 4.1, all ftrace tracing control files were within the debugfs
file system, which is typically located at /sys/kernel/debug/tracing.
For backward compatibility, when mounting the debugfs file system,
the tracefs file system will be automatically mounted at:
/sys/kernel/debug/tracing
There are many places where this older debugfs path is still used in
code comments, selftests, examples and tools, so let's update them to
avoid confusion.
I've broken up the series as best I could by maintainer or directory,
and I've only sent people the patches that I think they care about to
avoid spamming everyone.
Ross Zwisler (6):
tracing: always use canonical ftrace path
selftests: use canonical ftrace path
leaking_addresses: also skip canonical ftrace path
tools/kvm_stat: use canonical ftrace path
tools/virtio: use canonical ftrace path
tools/virtio: fix typo in README instructions
include/linux/kernel.h | 2 +-
include/linux/tracepoint.h | 4 ++--
kernel/trace/Kconfig | 20 +++++++++----------
kernel/trace/kprobe_event_gen_test.c | 2 +-
kernel/trace/ring_buffer.c | 2 +-
kernel/trace/synth_event_gen_test.c | 2 +-
kernel/trace/trace.c | 2 +-
samples/user_events/example.c | 4 ++--
scripts/leaking_addresses.pl | 1 +
scripts/tracing/draw_functrace.py | 6 +++---
tools/kvm/kvm_stat/kvm_stat | 2 +-
tools/lib/api/fs/tracing_path.c | 4 ++--
.../testing/selftests/user_events/dyn_test.c | 2 +-
.../selftests/user_events/ftrace_test.c | 10 +++++-----
.../testing/selftests/user_events/perf_test.c | 8 ++++----
tools/testing/selftests/vm/protection_keys.c | 4 ++--
tools/tracing/latency/latency-collector.c | 2 +-
tools/virtio/virtio-trace/README | 4 ++--
tools/virtio/virtio-trace/trace-agent.c | 12 +++++++----
19 files changed, 49 insertions(+), 44 deletions(-)
--
2.39.1.637.g21b0678d19-goog
Currently, anonymous PTE-mapped THPs cannot be collapsed in-place:
collapsing (e.g., via MADV_COLLAPSE) implies allocating a fresh THP and
mapping that new THP via a PMD: as it's a fresh anon THP, it will get the
exclusive flag set on the head page and everybody is happy.
However, if the kernel would ever support in-place collapse of anonymous
THPs (replacing a page table mapping each sub-page of a THP via PTEs with a
single PMD mapping the complete THP), exclusivity information stored for
each sub-page would have to be collapsed accordingly:
(1) All PTEs map !exclusive anon sub-pages: the in-place collapsed THP
must not not have the exclusive flag set on the head page mapped by
the PMD. This is the easiest case to handle ("simply don't set any
exclusive flags").
(2) All PTEs map exclusive anon sub-pages: when collapsing, we have to
clear the exclusive flag from all tail pages and only leave the
exclusive flag set for the head page. Otherwise, fork() after
collapse would not clear the exclusive flags from the tail pages
and we'd be in trouble once PTE-mapping the shared THP when writing
to shared tail pages that still have the exclusive flag set. This
would effectively revert what the PTE-mapping code does when
propagating the exclusive flag to all sub-pages.
(3) PTEs map a mixture of exclusive and !exclusive anon sub-pages (can
happen e.g., due to MADV_DONTFORK before fork()). We must not
collapse the THP in-place, otherwise bad things may happen:
the exclusive flags of sub-pages would get ignored and the
exclusive flag of the head page would get used instead.
Now that we have MADV_COLLAPSE in place to trigger collapsing a THP,
let's add some test cases that would bail out early, if we'd
voluntarily/accidantially unlock in-place collapse for anon THPs and
forget about taking proper care of exclusive flags.
Running the test on a kernel with MADV_COLLAPSE support:
# [INFO] Anonymous THP tests
# [RUN] Basic COW after fork() when collapsing before fork()
ok 169 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
ok 170 # SKIP MADV_COLLAPSE failed: Invalid argument
# [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
ok 171 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
ok 172 No leak from parent into child
For now, MADV_COLLAPSE always seems to fail if all PTEs map shared
sub-pages.
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Zach O'Keefe <zokeefe(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
A patch from Hugh made me explore the wonderful world of in-place collapse
of THP, and I was briefly concerned that it would apply to anon THP as
well. After thinking about it a bit, I decided to add test cases, to better
be safe than sorry in any case, and to document how PG_anon_exclusive is to
be handled in that case.
---
tools/testing/selftests/vm/cow.c | 228 +++++++++++++++++++++++++++++++
1 file changed, 228 insertions(+)
diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
index 26f6ea3079e2..16216d893d96 100644
--- a/tools/testing/selftests/vm/cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -30,6 +30,10 @@
#include "../kselftest.h"
#include "vm_util.h"
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
static size_t pagesize;
static int pagemap_fd;
static size_t thpsize;
@@ -1178,6 +1182,228 @@ static int tests_per_anon_test_case(void)
return tests;
}
+enum anon_thp_collapse_test {
+ ANON_THP_COLLAPSE_UNSHARED,
+ ANON_THP_COLLAPSE_FULLY_SHARED,
+ ANON_THP_COLLAPSE_LOWER_SHARED,
+ ANON_THP_COLLAPSE_UPPER_SHARED,
+};
+
+static void do_test_anon_thp_collapse(char *mem, size_t size,
+ enum anon_thp_collapse_test test)
+{
+ struct comm_pipes comm_pipes;
+ char buf;
+ int ret;
+
+ ret = setup_comm_pipes(&comm_pipes);
+ if (ret) {
+ ksft_test_result_fail("pipe() failed\n");
+ return;
+ }
+
+ /*
+ * Trigger PTE-mapping the THP by temporarily mapping a single subpage
+ * R/O, such that we can try collapsing it later.
+ */
+ ret = mprotect(mem + pagesize, pagesize, PROT_READ);
+ if (ret) {
+ ksft_test_result_fail("mprotect() failed\n");
+ goto close_comm_pipes;
+ }
+ ret = mprotect(mem + pagesize, pagesize, PROT_READ | PROT_WRITE);
+ if (ret) {
+ ksft_test_result_fail("mprotect() failed\n");
+ goto close_comm_pipes;
+ }
+
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ /* Collapse before actually COW-sharing the page. */
+ ret = madvise(mem, size, MADV_COLLAPSE);
+ if (ret) {
+ ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
+ strerror(errno));
+ goto close_comm_pipes;
+ }
+ break;
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ /* COW-share the full PTE-mapped THP. */
+ break;
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ /* Don't COW-share the upper part of the THP. */
+ ret = madvise(mem + size / 2, size / 2, MADV_DONTFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DONTFORK failed\n");
+ goto close_comm_pipes;
+ }
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ /* Don't COW-share the lower part of the THP. */
+ ret = madvise(mem, size / 2, MADV_DONTFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DONTFORK failed\n");
+ goto close_comm_pipes;
+ }
+ break;
+ default:
+ assert(false);
+ }
+
+ ret = fork();
+ if (ret < 0) {
+ ksft_test_result_fail("fork() failed\n");
+ goto close_comm_pipes;
+ } else if (!ret) {
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ exit(child_memcmp_fn(mem, size, &comm_pipes));
+ break;
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ exit(child_memcmp_fn(mem, size / 2, &comm_pipes));
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ exit(child_memcmp_fn(mem + size / 2, size / 2,
+ &comm_pipes));
+ break;
+ default:
+ assert(false);
+ }
+ }
+
+ while (read(comm_pipes.child_ready[0], &buf, 1) != 1)
+ ;
+
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ /*
+ * Revert MADV_DONTFORK such that we merge the VMAs and are
+ * able to actually collapse.
+ */
+ ret = madvise(mem, size, MADV_DOFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DOFORK failed\n");
+ write(comm_pipes.parent_ready[1], "0", 1);
+ wait(&ret);
+ goto close_comm_pipes;
+ }
+ /* FALLTHROUGH */
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ /* Collapse before anyone modified the COW-shared page. */
+ ret = madvise(mem, size, MADV_COLLAPSE);
+ if (ret) {
+ ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
+ strerror(errno));
+ write(comm_pipes.parent_ready[1], "0", 1);
+ wait(&ret);
+ goto close_comm_pipes;
+ }
+ break;
+ default:
+ assert(false);
+ }
+
+ /* Modify the page. */
+ memset(mem, 0xff, size);
+ write(comm_pipes.parent_ready[1], "0", 1);
+
+ wait(&ret);
+ if (WIFEXITED(ret))
+ ret = WEXITSTATUS(ret);
+ else
+ ret = -EINVAL;
+
+ ksft_test_result(!ret, "No leak from parent into child\n");
+close_comm_pipes:
+ close_comm_pipes(&comm_pipes);
+}
+
+static void test_anon_thp_collapse_unshared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UNSHARED);
+}
+
+static void test_anon_thp_collapse_fully_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_FULLY_SHARED);
+}
+
+static void test_anon_thp_collapse_lower_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_LOWER_SHARED);
+}
+
+static void test_anon_thp_collapse_upper_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UPPER_SHARED);
+}
+
+/*
+ * Test cases that are specific to anonymous THP: pages in private mappings
+ * that may get shared via COW during fork().
+ */
+static const struct test_case anon_thp_test_cases[] = {
+ /*
+ * Basic COW test for fork() without any GUP when collapsing a THP
+ * before fork().
+ *
+ * Re-mapping a PTE-mapped anon THP using a single PMD ("in-place
+ * collapse") might easily get COW handling wrong when not collapsing
+ * exclusivity information properly.
+ */
+ {
+ "Basic COW after fork() when collapsing before fork()",
+ test_anon_thp_collapse_unshared,
+ },
+ /* Basic COW test, but collapse after COW-sharing a full THP. */
+ {
+ "Basic COW after fork() when collapsing after fork() (fully shared)",
+ test_anon_thp_collapse_fully_shared,
+ },
+ /*
+ * Basic COW test, but collapse after COW-sharing the lower half of a
+ * THP.
+ */
+ {
+ "Basic COW after fork() when collapsing after fork() (lower shared)",
+ test_anon_thp_collapse_lower_shared,
+ },
+ /*
+ * Basic COW test, but collapse after COW-sharing the upper half of a
+ * THP.
+ */
+ {
+ "Basic COW after fork() when collapsing after fork() (upper shared)",
+ test_anon_thp_collapse_upper_shared,
+ },
+};
+
+static void run_anon_thp_test_cases(void)
+{
+ int i;
+
+ if (!thpsize)
+ return;
+
+ ksft_print_msg("[INFO] Anonymous THP tests\n");
+
+ for (i = 0; i < ARRAY_SIZE(anon_thp_test_cases); i++) {
+ struct test_case const *test_case = &anon_thp_test_cases[i];
+
+ ksft_print_msg("[RUN] %s\n", test_case->desc);
+ do_run_with_thp(test_case->fn, THP_RUN_PMD);
+ }
+}
+
+static int tests_per_anon_thp_test_case(void)
+{
+ return thpsize ? 1 : 0;
+}
+
typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
static void test_cow(char *mem, const char *smem, size_t size)
@@ -1518,6 +1744,7 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(anon_test_cases) * tests_per_anon_test_case() +
+ ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() +
ARRAY_SIZE(non_anon_test_cases) * tests_per_non_anon_test_case());
gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
@@ -1526,6 +1753,7 @@ int main(int argc, char **argv)
ksft_exit_fail_msg("opening pagemap failed\n");
run_anon_test_cases();
+ run_anon_thp_test_cases();
run_non_anon_test_cases();
err = ksft_get_fail_cnt();
--
2.39.0
This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.
iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.
IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.
The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.
This series is infrastructure work for the following series which:
- Add replace for attach
- Expose replace through VFIO APIs
- Implement driver parameters for HWPT creation (nesting)
Once review of this is complete I will keep it on a side branch and
accumulate the following series when they are ready so we can have a
stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.
I have this on github:
https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
Jason Gunthorpe (12):
iommufd: Move isolated msi enforcement to iommufd_device_bind()
iommufd: Add iommufd_group
iommufd: Replace the hwpt->devices list with iommufd_group
iommufd: Use the iommufd_group to avoid duplicate reserved groups and
msi setup
iommufd: Make sw_msi_start a group global
iommufd: Move putting a hwpt to a helper function
iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
iommufd: Add iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
Nicolin Chen (2):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd/selftest: Test iommufd_device_replace()
drivers/iommu/iommu-priv.h | 10 +
drivers/iommu/iommu.c | 30 ++
drivers/iommu/iommufd/device.c | 482 +++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 96 +++-
drivers/iommu/iommufd/io_pagetable.c | 5 +-
drivers/iommu/iommufd/iommufd_private.h | 44 +-
drivers/iommu/iommufd/iommufd_test.h | 7 +
drivers/iommu/iommufd/main.c | 17 +-
drivers/iommu/iommufd/selftest.c | 40 ++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 26 +
tools/testing/selftests/iommu/iommufd.c | 64 ++-
.../selftests/iommu/iommufd_fail_nth.c | 57 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 59 ++-
14 files changed, 758 insertions(+), 180 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h
base-commit: ac395958f9156733246b5bc3a481c6d38c321a7a
--
2.39.1
This series fixes a few cleanup/error handling problems and cleans up
code.
v2:
- Improved changelogs
- Return NULL directly from malloc_and_init_memory()
- Added patch to convert memalign() to posix_memalign()
- Added patch to correct function comment parameter
- Dropped literal -> define patch for now (likely superceded soon)
Fenghua Yu (1):
selftests/resctrl: Change name from CBM_MASK_PATH to INFO_PATH
Ilpo Järvinen (8):
selftests/resctrl: Return NULL if malloc_and_init_memory() did not
alloc mem
selftests/resctrl: Move ->setup() call outside of test specific
branches
selftests/resctrl: Allow ->setup() to return errors
selftests/resctrl: Check for return value after write_schemata()
selftests/resctrl: Replace obsolete memalign() with posix_memalign()
selftests/resctrl: Change initialize_llc_perf() return type to void
selftests/resctrl: Use remount_resctrlfs() consistently with boolean
selftests/resctrl: Correct get_llc_perf() param in function comment
tools/testing/selftests/resctrl/cache.c | 17 +++++++--------
tools/testing/selftests/resctrl/cat_test.c | 4 ++--
tools/testing/selftests/resctrl/cmt_test.c | 9 ++++----
tools/testing/selftests/resctrl/fill_buf.c | 7 +++++--
tools/testing/selftests/resctrl/mba_test.c | 11 +++++++---
tools/testing/selftests/resctrl/mbm_test.c | 4 ++--
tools/testing/selftests/resctrl/resctrl.h | 6 ++++--
tools/testing/selftests/resctrl/resctrl_val.c | 21 +++++++------------
tools/testing/selftests/resctrl/resctrlfs.c | 2 +-
9 files changed, 41 insertions(+), 40 deletions(-)
--
2.30.2
Add support for sockmap to vsock.
We're testing usage of vsock as a way to redirect guest-local UDS
requests to the host and this patch series greatly improves the
performance of such a setup.
Compared to copying packets via userspace, this improves throughput by
121% in basic testing.
Tested as follows.
Setup: guest unix dgram sender -> guest vsock redirector -> host vsock
server
Threads: 1
Payload: 64k
No sockmap:
- 76.3 MB/s
- The guest vsock redirector was
"socat VSOCK-CONNECT:2:1234 UNIX-RECV:/path/to/sock"
Using sockmap (this patch):
- 168.8 MB/s (+121%)
- The guest redirector was a simple sockmap echo server,
redirecting unix ingress to vsock 2:1234 egress.
- Same sender and server programs
*Note: these numbers are from RFC v1
Only the virtio transport has been tested. The loopback transport was
used in writing bpf/selftests, but not thoroughly tested otherwise.
This series requires the skb patch.
Changes in v3:
- vsock/bpf: Refactor wait logic in vsock_bpf_recvmsg() to avoid
backwards goto
- vsock/bpf: Check psock before acquiring slock
- vsock/bpf: Return bool instead of int of 0 or 1
- vsock/bpf: Wrap macro args __sk/__psock in parens
- vsock/bpf: Place comment trailer */ on separate line
Changes in v2:
- vsock/bpf: rename vsock_dgram_* -> vsock_*
- vsock/bpf: change sk_psock_{get,put} and {lock,release}_sock() order
to minimize slock hold time
- vsock/bpf: use "new style" wait
- vsock/bpf: fix bug in wait log
- vsock/bpf: add check that recvmsg sk_type is one dgram, seqpacket, or
stream. Return error if not one of the three.
- virtio/vsock: comment __skb_recv_datagram() usage
- virtio/vsock: do not init copied in read_skb()
- vsock/bpf: add ifdef guard around struct proto in dgram_recvmsg()
- selftests/bpf: add vsock loopback config for aarch64
- selftests/bpf: add vsock loopback config for s390x
- selftests/bpf: remove vsock device from vmtest.sh qemu machine
- selftests/bpf: remove CONFIG_VIRTIO_VSOCKETS=y from config.x86_64
- vsock/bpf: move transport-related (e.g., if (!vsk->transport)) checks
out of fast path
Signed-off-by: Bobby Eshleman <bobby.eshleman(a)bytedance.com>
---
Bobby Eshleman (3):
vsock: support sockmap
selftests/bpf: add vsock to vmtest.sh
selftests/bpf: Add a test case for vsock sockmap
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 1 +
include/net/af_vsock.h | 17 ++
net/vmw_vsock/Makefile | 1 +
net/vmw_vsock/af_vsock.c | 55 ++++++-
net/vmw_vsock/virtio_transport.c | 2 +
net/vmw_vsock/virtio_transport_common.c | 24 +++
net/vmw_vsock/vsock_bpf.c | 175 +++++++++++++++++++++
net/vmw_vsock/vsock_loopback.c | 2 +
tools/testing/selftests/bpf/config.aarch64 | 2 +
tools/testing/selftests/bpf/config.s390x | 3 +
tools/testing/selftests/bpf/config.x86_64 | 3 +
.../selftests/bpf/prog_tests/sockmap_listen.c | 163 +++++++++++++++++++
13 files changed, 443 insertions(+), 6 deletions(-)
---
base-commit: d83115ce337a632f996e44c9f9e18cadfcf5a094
change-id: 20230118-support-vsock-sockmap-connectible-2e1297d2111a
Best regards,
--
Bobby Eshleman <bobby.eshleman(a)bytedance.com>
---
Bobby Eshleman (3):
vsock: support sockmap
selftests/bpf: add vsock to vmtest.sh
selftests/bpf: add a test case for vsock sockmap
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 1 +
include/net/af_vsock.h | 17 ++
net/vmw_vsock/Makefile | 1 +
net/vmw_vsock/af_vsock.c | 55 ++++++-
net/vmw_vsock/virtio_transport.c | 2 +
net/vmw_vsock/virtio_transport_common.c | 25 +++
net/vmw_vsock/vsock_bpf.c | 174 +++++++++++++++++++++
net/vmw_vsock/vsock_loopback.c | 2 +
tools/testing/selftests/bpf/config.aarch64 | 2 +
tools/testing/selftests/bpf/config.s390x | 3 +
tools/testing/selftests/bpf/config.x86_64 | 3 +
.../selftests/bpf/prog_tests/sockmap_listen.c | 163 +++++++++++++++++++
13 files changed, 443 insertions(+), 6 deletions(-)
---
base-commit: c2ea552065e43d05bce240f53c3185fd3a066204
change-id: 20230227-vsock-sockmap-upstream-9d65c84174a2
Best regards,
--
Bobby Eshleman <bobby.eshleman(a)bytedance.com>
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An example for
this are programs with forked workloads using a garbage collected language without
pointers. In such a language madvise cannot be made available.
In addition the addresses of objects get moved around as they are garbage
collected. KSM sharing needs to be enabled "from the outside" for these type of
workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings no
benefit or even has overhead. We'd like to be able to enable KSM on a workload
by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the current
process: it is a workload-local decision. A considerable number of sharing
opportuniites may exist across multiple workloads or jobs. Only a higler level
entity like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal worklaod knowledge to make targeted madvise calls.
Security concerns:
In previous discussions security concerns have been brought up. The problem is
that an individual workload does not have the knowledge about what else is
running on a machine. Therefore it has to be very conservative in what memory
areas can be shared or not. However, if the system is dedicated to running
multiple jobs within the same security domain, its the job scheduler that has
the knowledge that sharing can be safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first
one allows to enable KSM at the process level and the second one to query the
setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all
the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by
the new child process.
In addition when KSM is disabled for a process, KSM will be disabled for the
VMA's where KSM has been enabled.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation, but not
calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit and ksm type metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests
This adds an option to specify the merge type to the ksm_tests. This allows to
test madvise and prctl KSM. It also adds a new option to query if prctl KSM has
been enabled. It adds a fork test to verify that the KSM process setting is
inherited by client processes.
Changes:
- V3:
- folded patch 1 - 6
- folded patch 7 - 14
- folded patch 15 - 19
- Expanded on the use cases in the cover letter
- Added a section on security concerns to the cover letter
- V2:
- Added use cases to the cover letter
- Removed the tracing patch from the patch series and posted it as an
individual patch
- Refreshed repo
Stefan Roesch (3):
mm: add new api to enable ksm per process
mm: add new KSM process and sysfs knobs
selftests/mm: add new selftests for KSM
Documentation/ABI/testing/sysfs-kernel-mm-ksm | 8 +
Documentation/admin-guide/mm/ksm.rst | 8 +-
fs/proc/base.c | 5 +
include/linux/ksm.h | 19 +-
include/linux/sched/coredump.h | 1 +
include/uapi/linux/prctl.h | 2 +
kernel/sys.c | 29 ++
mm/ksm.c | 114 +++++++-
tools/include/uapi/linux/prctl.h | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/ksm_tests.c | 254 +++++++++++++++---
11 files changed, 389 insertions(+), 56 deletions(-)
base-commit: 234a68e24b120b98875a8b6e17a9dead277be16a
--
2.30.2
Shuah,
I'd like this to go through your tree as this is timeout related.
In order to help me help developers run tests against the components
I maintain much easily I have enabled selftests support on kdevops [0].
kdevops deals with abstractsions like letting you pick virtualization
or cloud solutions to run the tests using kconfig, installs all
dependencies for you, and with just a few make target commands can get
you the latest linux-next tested against selftests.
If other find this useful and would like support for their selftests on
kdevops feel free to send patches. Eventually the idea is to be able to
run as many selftests in parallel using different guests for each main
selftest to speed up tests.
Prior to this I used to run tests manually, now the selftests helpers
are used (./tools/testing/selftests/run_kselftest.sh -s) and with this
the default selftest timeout is hit. This just increases that for the few
selftests I help maintain where obviously its not enough anymore.
Note: on the firmware side I am spotting an OOM triggered by running
tests in a loop, so far I hit in the android configuration but its
not clear if the issue is just for that setup.
[0] https://github.com/linux-kdevops/kdevops
Luis Chamberlain (2):
selftests/kmod: increase the kmod timeout from 45 to 165
selftests/firmware: increase timeout from 165 to 230
tools/testing/selftests/firmware/settings | 8 +++++++-
tools/testing/selftests/kmod/settings | 4 ++++
2 files changed, 11 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kmod/settings
--
2.39.1
This series implements selftests targeting the feature floated by Chao via:
https://lore.kernel.org/lkml/20221202061347.1070246-10-chao.p.peng@linux.in…
Below changes aim to test the fd based approach for guest private memory
in context of normal (non-confidential) VMs executing on non-confidential
platforms.
private_mem_test.c file adds selftest to access private memory from the
guest via private/shared accesses and checking if the contents can be
leaked to/accessed by vmm via shared memory view before/after conversions.
Updates in V2:
1) Simplified vcpu run loop implementation API
2) Removed VM creation logic from private mem library
Updates in V1 (Compared to RFC v3 patches):
1) Incorporated suggestions from Sean around simplifying KVM changes
2) Addressed comments from Sean
3) Added private mem test with shared memory backed by 2MB hugepages.
V1 series:
https://lore.kernel.org/lkml/20221111014244.1714148-1-vannapurve@google.com…
This series has dependency on following patches:
1) V10 series patches from Chao mentioned above.
Github link for the patches posted as part of this series:
https://github.com/vishals4gh/linux/commits/priv_memfd_selftests_v2
Vishal Annapurve (6):
KVM: x86: Add support for testing private memory
KVM: Selftests: Add support for private memory
KVM: selftests: x86: Add IS_ALIGNED/IS_PAGE_ALIGNED helpers
KVM: selftests: x86: Add helpers to execute VMs with private memory
KVM: selftests: Add get_free_huge_2m_pages
KVM: selftests: x86: Add selftest for private memory
arch/x86/kvm/mmu/mmu_internal.h | 6 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 2 +
.../selftests/kvm/include/kvm_util_base.h | 15 +-
.../testing/selftests/kvm/include/test_util.h | 5 +
.../kvm/include/x86_64/private_mem.h | 24 ++
.../selftests/kvm/include/x86_64/processor.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 58 ++++-
tools/testing/selftests/kvm/lib/test_util.c | 29 +++
.../selftests/kvm/lib/x86_64/private_mem.c | 139 ++++++++++++
.../selftests/kvm/x86_64/private_mem_test.c | 212 ++++++++++++++++++
virt/kvm/Kconfig | 4 +
virt/kvm/kvm_main.c | 3 +-
13 files changed, 490 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/x86_64/private_mem.h
create mode 100644 tools/testing/selftests/kvm/lib/x86_64/private_mem.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_test.c
--
2.39.0.rc0.267.gcb52ba06e7-goog