v3: - pull forward to v6.8 - style and small fixups recommended by jcameron - update syscall number (will do all archs when RFC tag drops) - update for new folio code - added OCP link to device-tracked address hotness proposal - kept void* over __u64 simply because it integrates cleanly with existing migration code. If there's strong opinions, I can refactor.
This patch set is a proposal for a syscall analogous to move_pages, that migrates pages between NUMA nodes using physical addressing.
The intent is to better enable user-land system-wide memory tiering as CXL devices begin to provide memory resources on the PCIe bus.
For example, user-land software which is making decisions based on data sources which expose physical address information no longer must convert that information to virtual addressing to act upon it (see background for info on how physical addresses are acquired).
The syscall requires CAP_SYS_ADMIN, since physical address source information is typically protected by the same (or CAP_SYS_NICE).
This patch set broken into 3 patches: 1) refactor of existing migration code for code reuse 2) The sys_move_phys_pages system call. 3) ktest of the syscall
The sys_move_phys_pages system call validates the page may be migrated by checking migratable-status of each vma mapping the page, and the intersection of cpuset policies each vma's task.
Background:
Userspace job schedulers, memory managers, and tiering software solutions depend on page migration syscalls to reallocate resources across NUMA nodes. Currently, these calls enable movement of memory associated with a specific PID. Moves can be requested in coarse, process-sized strokes (as with migrate_pages), and on specific virtual pages (via move_pages).
However, a number of profiling mechanisms provide system-wide information that would benefit from a physical-addressing version move_pages.
There are presently at least 4 ways userland can acquire physical address information for use with this interface, and 1 hardware offload mechanism being proposed by opencompute.
1) /proc/pid/pagemap: can be used to do page table translations. This is only really useful for testing, and the ktest was written using this functionality.
2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical and/or vitual address information.
3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones
4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle. So long as the page size is known, this can be used to identify system-wide idle pages that could be migrated to lower tiers.
https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
5) CXL Offloaded Hotness Monitoring (Proposed): a CXL memory device may provide hot/cold information about its memory. For example, it may report the hottest device addresses (0-based) or a physical address (if it has access to decoders for convert bases).
DPA can be cheaply converted to HPA by combining it with data exposed by /sys/bus/cxl/ information (region address bases).
See: https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Information from these sources facilitates systemwide resource management, but with the limitations of migrate_pages and move_pages applying to individual tasks, their outputs must be converted back to virtual addresses and re-associated with specific PIDs.
Doing this reverse-translation outside of the kernel requires considerable space and compute, and it will have to be performed again by the existing system calls. Much of this work can be avoided if the pages can be migrated directly with physical memory addressing.
Gregory Price (3): mm/migrate: refactor add_page_for_migration for code re-use mm/migrate: Create move_phys_pages syscall ktest: sys_move_phys_pages ktest
arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 5 + include/uapi/asm-generic/unistd.h | 8 +- kernel/sys_ni.c | 1 + mm/migrate.c | 288 ++++++++++++++++++++---- tools/include/uapi/asm-generic/unistd.h | 8 +- tools/testing/selftests/mm/migration.c | 99 ++++++++ 8 files changed, 370 insertions(+), 41 deletions(-)
add_page_for_migration presently does two actions: 1) validates the page is present and migratable 2) isolates the page from LRU and puts it into the migration list
Break add_page_for_migration into 2 functions: add_page_for_migration - isolate the page from LUR and add to list add_virt_page_for_migration - validate the page and call the above
add_page_for_migration does not require the mm_struct and so can be re-used for a physical addressing version of move_pages
Signed-off-by: Gregory Price gregory.price@memverge.com --- mm/migrate.c | 84 +++++++++++++++++++++++++++++++--------------------- 1 file changed, 50 insertions(+), 34 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index c27b1f8097d4..27071a07ffbb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2066,6 +2066,46 @@ static int do_move_pages_to_node(struct list_head *pagelist, int node) return err; }
+/* + * Isolates the page from the LRU and puts it into the given pagelist + * Returns: + * errno - if the page cannot be isolated + * 0 - when it doesn't have to be migrated because it is already on the + * target node + * 1 - when it has been queued + */ +static int add_page_for_migration(struct page *page, + struct folio *folio, + int node, + struct list_head *pagelist, + bool migrate_all) +{ + if (folio_is_zone_device(folio)) + return -ENOENT; + + if (folio_nid(folio) == node) + return 0; + + if (page_mapcount(page) > 1 && !migrate_all) + return -EACCES; + + if (folio_test_hugetlb(folio)) { + if (isolate_hugetlb(folio, pagelist)) + return 1; + return -EBUSY; + } + + if (!folio_isolate_lru(folio)) + return -EBUSY; + + list_add_tail(&folio->lru, pagelist); + node_stat_mod_folio(folio, + NR_ISOLATED_ANON + folio_is_file_lru(folio), + folio_nr_pages(folio)); + + return 1; +} + /* * Resolves the given address to a struct page, isolates it from the LRU and * puts it to the given pagelist. @@ -2075,19 +2115,19 @@ static int do_move_pages_to_node(struct list_head *pagelist, int node) * target node * 1 - when it has been queued */ -static int add_page_for_migration(struct mm_struct *mm, const void __user *p, - int node, struct list_head *pagelist, bool migrate_all) +static int add_virt_page_for_migration(struct mm_struct *mm, + const void __user *p, int node, struct list_head *pagelist, + bool migrate_all) { struct vm_area_struct *vma; unsigned long addr; struct page *page; struct folio *folio; - int err; + int err = -EFAULT;
mmap_read_lock(mm); addr = (unsigned long)untagged_addr_remote(mm, p);
- err = -EFAULT; vma = vma_lookup(mm, addr); if (!vma || !vma_migratable(vma)) goto out; @@ -2095,41 +2135,17 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p, /* FOLL_DUMP to ignore special (like zero) pages */ page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
- err = PTR_ERR(page); - if (IS_ERR(page)) - goto out; - err = -ENOENT; if (!page) goto out;
- folio = page_folio(page); - if (folio_is_zone_device(folio)) - goto out_putfolio; - - err = 0; - if (folio_nid(folio) == node) - goto out_putfolio; + err = PTR_ERR(page); + if (IS_ERR(page)) + goto out;
- err = -EACCES; - if (page_mapcount(page) > 1 && !migrate_all) - goto out_putfolio; + folio = page_folio(page); + err = add_page_for_migration(page, folio, node, pagelist, migrate_all);
- err = -EBUSY; - if (folio_test_hugetlb(folio)) { - if (isolate_hugetlb(folio, pagelist)) - err = 1; - } else { - if (!folio_isolate_lru(folio)) - goto out_putfolio; - - err = 1; - list_add_tail(&folio->lru, pagelist); - node_stat_mod_folio(folio, - NR_ISOLATED_ANON + folio_is_file_lru(folio), - folio_nr_pages(folio)); - } -out_putfolio: /* * Either remove the duplicate refcount from folio_isolate_lru() * or drop the folio ref if it was not isolated. @@ -2229,7 +2245,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, * Errors in the page lookup or isolation are not fatal and we simply * report them via status */ - err = add_page_for_migration(mm, p, current_node, &pagelist, + err = add_virt_page_for_migration(mm, p, current_node, &pagelist, flags & MPOL_MF_MOVE_ALL);
if (err > 0) {
Similar to the move_pages system call, instead of taking a pid and list of virtual addresses, this system call takes a list of physical addresses.
Because there is no task to validate the memory policy against, each page needs to be interrogated to determine whether the migration is valid, and all tasks that map it need to be interrogated.
This is accomplished in via a rmap_walk on the folio containing the page, and an interrogation of all tasks that map the page (by way of each task's vma).
Each page must be interrogated individually, which should be considered when using this to migrate shared regions.
The remaining logic is the same as the move_pages syscall. One change to do_pages_move is made (to check whether an mm_struct is passed) in order to re-use the existing migration code.
Signed-off-by: Gregory Price gregory.price@memverge.com --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 5 + include/uapi/asm-generic/unistd.h | 8 +- kernel/sys_ni.c | 1 + mm/migrate.c | 206 +++++++++++++++++++++++- tools/include/uapi/asm-generic/unistd.h | 8 +- 7 files changed, 222 insertions(+), 8 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 5f8591ce7f25..250c00281029 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -466,3 +466,4 @@ 459 i386 lsm_get_self_attr sys_lsm_get_self_attr 460 i386 lsm_set_self_attr sys_lsm_set_self_attr 461 i386 lsm_list_modules sys_lsm_list_modules +462 i386 move_phys_pages sys_move_phys_pages diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 7e8d46f4147f..a928df7c6f52 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -383,6 +383,7 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common move_phys_pages sys_move_phys_pages
# # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 77eb9b0e7685..575ba9d26e30 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -840,6 +840,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages, const int __user *nodes, int __user *status, int flags); +asmlinkage long sys_move_phys_pages(unsigned long nr_pages, + const void __user * __user *pages, + const int __user *nodes, + int __user *status, + int flags); asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig, siginfo_t __user *uinfo); asmlinkage long sys_perf_event_open( diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 75f00965ab15..13bc8dd16d6b 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -842,8 +842,14 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
+/* CONFIG_MMU only */ +#ifndef __ARCH_NOMMU +#define __NR_move_phys_pages 462 +__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages) +#endif + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463
/* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index faad00cce269..254915fd1e2c 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -196,6 +196,7 @@ COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); COND_SYSCALL(cachestat); +COND_SYSCALL(move_phys_pages);
COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); diff --git a/mm/migrate.c b/mm/migrate.c index 27071a07ffbb..7213703441f8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2182,9 +2182,119 @@ static int move_pages_and_store_status(int node, return store_status(status, start, node, i - start); }
+struct rmap_page_ctxt { + bool found; + bool migratable; + bool node_allowed; + int node; +}; + +/* + * Walks each vma mapping a given page and determines if those + * vma's are both migratable, and that the target node is within + * the allowed cpuset of the owning task. + */ +static bool phys_page_migratable(struct folio *folio, + struct vm_area_struct *vma, + unsigned long address, + void *arg) +{ + struct rmap_page_ctxt *ctxt = arg; +#ifdef CONFIG_MEMCG + struct task_struct *owner = vma->vm_mm->owner; + nodemask_t task_nodes = cpuset_mems_allowed(owner); +#else + nodemask_t task_nodes = node_possible_map; +#endif + + ctxt->found = true; + ctxt->migratable &= vma_migratable(vma); + ctxt->node_allowed &= node_isset(ctxt->node, task_nodes); + + return ctxt->migratable && ctxt->node_allowed; +} + +static struct folio *phys_migrate_get_folio(struct page *page) +{ + struct folio *folio; + + folio = page_folio(page); + if (!folio_test_lru(folio) || !folio_try_get(folio)) + return NULL; + if (unlikely(page_folio(page) != folio || !folio_test_lru(folio))) { + folio_put(folio); + folio = NULL; + } + return folio; +} + +/* + * Validates the physical address is online and migratable. Walks the folio + * containing the page to validate the vma is migratable and the cpuset node + * restrictions. Then calls add_page_for_migration to isolate it from the + * LRU and place it into the given pagelist. + * Returns: + * errno - if the page is not online, migratable, or can't be isolated + * 0 - when it doesn't have to be migrated because it is already on the + * target node + * 1 - when it has been queued + */ +static int add_phys_page_for_migration(const void __user *p, int node, + struct list_head *pagelist, + bool migrate_all) +{ + unsigned long pfn; + struct page *page; + struct folio *folio; + int err; + struct rmap_page_ctxt rmctxt = { + .found = false, + .migratable = true, + .node_allowed = true, + .node = node + }; + struct rmap_walk_control rwc = { + .rmap_one = phys_page_migratable, + .arg = &rmctxt + }; + + pfn = ((unsigned long)p) >> PAGE_SHIFT; + page = pfn_to_online_page(pfn); + if (!page || PageTail(page)) + return -ENOENT; + + folio = phys_migrate_get_folio(page); + if (!folio) + return -ENOENT; + + rmap_walk(folio, &rwc); + + if (!rmctxt.found) + err = -ENOENT; + else if (!rmctxt.migratable) + err = -EFAULT; + else if (!rmctxt.node_allowed) + err = -EACCES; + else + err = add_page_for_migration(page, folio, node, pagelist, + migrate_all); + + folio_put(folio); + + return err; +} + /* * Migrate an array of page address onto an array of nodes and fill * the corresponding array of status. + * + * When the mm argument is not NULL, task_nodes is expected to be the + * cpuset nodemask for the task which owns the mm_struct, and the + * values located in (*pages) are expected to be virtual addresses. + * + * When the mm argument is NULL, the values located at (*pages) are + * expected to be physical addresses, and task_nodes is expected to + * be empty. */ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, unsigned long nr_pages, @@ -2226,7 +2336,14 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, goto out_flush;
err = -EACCES; - if (!node_isset(node, task_nodes)) + /* + * if mm is NULL, then the pages are addressed via physical + * address and the task_nodes structure is empty. Validation + * of migratability is deferred to add_phys_page_for_migration + * where vma's that map the address will have their node_mask + * checked to ensure the requested node bit is set. + */ + if (mm && !node_isset(node, task_nodes)) goto out_flush;
if (current_node == NUMA_NO_NODE) { @@ -2243,10 +2360,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
/* * Errors in the page lookup or isolation are not fatal and we simply - * report them via status + * report them via status. + * + * If mm is NULL, then p treated as is a physical address. */ - err = add_virt_page_for_migration(mm, p, current_node, &pagelist, - flags & MPOL_MF_MOVE_ALL); + if (mm) + err = add_virt_page_for_migration(mm, p, current_node, &pagelist, + flags & MPOL_MF_MOVE_ALL); + else + err = add_phys_page_for_migration(p, current_node, &pagelist, + flags & MPOL_MF_MOVE_ALL); +
if (err > 0) { /* The page is successfully queued for migration */ @@ -2334,6 +2458,37 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages, mmap_read_unlock(mm); }
+/* + * Determine the nodes pages pointed to by the physical addresses in the + * pages array, and store those node values in the status array + */ +static void do_phys_pages_stat_array(unsigned long nr_pages, + const void __user **pages, int *status) +{ + unsigned long i; + + for (i = 0; i < nr_pages; i++) { + unsigned long pfn = (unsigned long)(*pages) >> PAGE_SHIFT; + struct page *page = pfn_to_online_page(pfn); + int err = -ENOENT; + + if (!page) + goto set_status; + + get_page(page); + + if (!is_zone_device_page(page)) + err = page_to_nid(page); + + put_page(page); +set_status: + *status = err; + + pages++; + status++; + } +} + static int get_compat_pages_array(const void __user *chunk_pages[], const void __user * __user *pages, unsigned long chunk_nr) @@ -2376,7 +2531,10 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, break; }
- do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); + if (mm) + do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); + else + do_phys_pages_stat_array(chunk_nr, chunk_pages, chunk_status);
if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status))) break; @@ -2449,7 +2607,7 @@ static int kernel_move_pages(pid_t pid, unsigned long nr_pages, nodemask_t task_nodes;
/* Check flags */ - if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) + if (flags & ~(MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) return -EINVAL;
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) @@ -2477,6 +2635,42 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags); }
+/* + * Move a list of physically-addressed pages to the list of target nodes + */ +static int kernel_move_phys_pages(unsigned long nr_pages, + const void __user * __user *pages, + const int __user *nodes, + int __user *status, int flags) +{ + nodemask_t dummy_nodes; + + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) + return -EINVAL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (!nodes) + return do_pages_stat(NULL, nr_pages, pages, status); + + /* + * When the mm argument to do_pages_move is null, the task_nodes + * argument is ignored, so pass in an empty nodemask as a dummy. + */ + nodes_clear(dummy_nodes); + return do_pages_move(NULL, dummy_nodes, nr_pages, pages, nodes, status, + flags); +} + +SYSCALL_DEFINE5(move_phys_pages, unsigned long, nr_pages, + const void __user * __user *, pages, + const int __user *, nodes, + int __user *, status, int, flags) +{ + return kernel_move_phys_pages(nr_pages, pages, nodes, status, flags); +} + #ifdef CONFIG_NUMA_BALANCING /* * Returns true if this is a safe migration target node for misplaced NUMA diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index 75f00965ab15..13bc8dd16d6b 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -842,8 +842,14 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
+/* CONFIG_MMU only */ +#ifndef __ARCH_NOMMU +#define __NR_move_phys_pages 462 +__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages) +#endif + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463
/* * 32 bit systems traditionally used different
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
Signed-off-by: Gregory Price gregory.price@memverge.com --- tools/testing/selftests/mm/migration.c | 99 ++++++++++++++++++++++++++ 1 file changed, 99 insertions(+)
diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index 6908569ef406..c005c98dbdc1 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -5,6 +5,8 @@ */
#include "../kselftest_harness.h" +#include <stdint.h> +#include <stdio.h> #include <strings.h> #include <pthread.h> #include <numa.h> @@ -14,11 +16,17 @@ #include <sys/types.h> #include <signal.h> #include <time.h> +#include <unistd.h>
#define TWOMEG (2<<20) #define RUNTIME (20)
+#define GET_BIT(X, Y) ((X & ((uint64_t)1<<Y)) >> Y) +#define GET_PFN(X) (X & 0x7FFFFFFFFFFFFFull) #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) +#define PAGEMAP_ENTRY 8 +const int __endian_bit = 1; +#define is_bigendian() ((*(char *)&__endian_bit) == 0)
FIXTURE(migration) { @@ -94,6 +102,45 @@ int migrate(uint64_t *ptr, int n1, int n2) return 0; }
+int migrate_phys(uint64_t paddr, int n1, int n2) +{ + int ret, tmp; + int status = 0; + struct timespec ts1, ts2; + + if (clock_gettime(CLOCK_MONOTONIC, &ts1)) + return -1; + + while (1) { + if (clock_gettime(CLOCK_MONOTONIC, &ts2)) + return -1; + + if (ts2.tv_sec - ts1.tv_sec >= RUNTIME) + return 0; + + /* + * FIXME: move_phys_pages was syscall 462 during RFC. + * Update this when an official syscall number is adopted + * and the libnuma interface is implemented. + */ + ret = syscall(462, 1, (void **) &paddr, &n2, &status, + MPOL_MF_MOVE_ALL); + if (ret) { + if (ret > 0) + printf("Didn't migrate %d pages\n", ret); + else + perror("Couldn't migrate pages"); + return -2; + } + + tmp = n2; + n2 = n1; + n1 = tmp; + } + + return 0; +} + void *access_mem(void *ptr) { volatile uint64_t y = 0; @@ -199,4 +246,56 @@ TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME) ASSERT_EQ(pthread_cancel(self->threads[i]), 0); }
+/* + * Same as the basic migration, but test move_phys_pages. + */ +TEST_F_TIMEOUT(migration, phys_addr, 2*RUNTIME) +{ + uint64_t *ptr; + uint64_t pagemap_val, paddr, file_offset; + unsigned char c_buf[PAGEMAP_ENTRY]; + int i, c, status; + FILE *f; + + if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0) + SKIP(return, "Not enough threads or NUMA nodes available"); + + ptr = mmap(NULL, TWOMEG, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + memset(ptr, 0xde, TWOMEG); + + /* PFN of ptr from /proc/self/pagemap */ + f = fopen("/proc/self/pagemap", "rb"); + file_offset = ((uint64_t)ptr) / getpagesize() * PAGEMAP_ENTRY; + status = fseek(f, file_offset, SEEK_SET); + ASSERT_EQ(status, 0); + for (i = 0; i < PAGEMAP_ENTRY; i++) { + c = getc(f); + ASSERT_NE(c, EOF); + /* handle endiand differences */ + if (is_bigendian()) + c_buf[i] = c; + else + c_buf[PAGEMAP_ENTRY - i - 1] = c; + } + fclose(f); + + for (i = 0; i < PAGEMAP_ENTRY; i++) + pagemap_val = (pagemap_val << 8) + c_buf[i]; + + ASSERT_TRUE(GET_BIT(pagemap_val, 63)); + /* This reports a pfn, we need to shift this by page size */ + paddr = GET_PFN(pagemap_val) << __builtin_ctz(getpagesize()); + + for (i = 0; i < self->nthreads - 1; i++) + if (pthread_create(&self->threads[i], NULL, access_mem, ptr)) + perror("Couldn't create thread"); + + ASSERT_EQ(migrate_phys(paddr, self->n1, self->n2), 0); + for (i = 0; i < self->nthreads - 1; i++) + ASSERT_EQ(pthread_cancel(self->threads[i]), 0); +} + TEST_HARNESS_MAIN
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Also, how is this v3 and the first one to land on linux-mm?
https://lore.kernel.org/linux-mm/?q=move_phys_pages
Also, where is the syscall itself? The only thing here is the ktest.
On Tue, Mar 19, 2024 at 06:08:39PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Also, how is this v3 and the first one to land on linux-mm?
https://lore.kernel.org/linux-mm/?q=move_phys_pages
Also, where is the syscall itself? The only thing here is the ktest.
OH, I see the confusion now.
There were two other versions, and I have experienced this delivery failure before, i'm not sure why the other commits have not been delivered.
Let me look into this.
~Gregory
On Tue, Mar 19, 2024 at 02:16:01PM -0400, Gregory Price wrote:
On Tue, Mar 19, 2024 at 06:08:39PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Also, how is this v3 and the first one to land on linux-mm?
https://lore.kernel.org/linux-mm/?q=move_phys_pages
Also, where is the syscall itself? The only thing here is the ktest.
OH, I see the confusion now.
There were two other versions, and I have experienced this delivery failure before, i'm not sure why the other commits have not been delivered.
Let me look into this.
~Gregory
Full set of patches: https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.c...
I've experienced silent linux-mm delivery failures like this before, I still do not understand this issue.
v1: https://lore.kernel.org/all/20230907075453.350554-1-gregory.price@memverge.c... v2: https://lore.kernel.org/all/20230919230909.530174-1-gregory.price@memverge.c...
~Gregory
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Certainly the test is stupid and requires admin, but I could not come up an easier test to demonstrate the concept - and the docs say to include a test with all syscall proposals.
Am I missing something else important? (stupid question: of course I am, but alas I must ask it)
~Gregory
On Tue, Mar 19, 2024 at 02:14:33PM -0400, Gregory Price wrote:
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Certainly the test is stupid and requires admin, but I could not come up an easier test to demonstrate the concept - and the docs say to include a test with all syscall proposals.
Am I missing something else important? (stupid question: of course I am, but alas I must ask it)
It's not that the test is stupid. It's the concept that's stupid.
On Tue, Mar 19, 2024 at 06:20:33PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 02:14:33PM -0400, Gregory Price wrote:
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Certainly the test is stupid and requires admin, but I could not come up an easier test to demonstrate the concept - and the docs say to include a test with all syscall proposals.
Am I missing something else important? (stupid question: of course I am, but alas I must ask it)
It's not that the test is stupid. It's the concept that's stupid.
Ok i'll bite.
The 2 major ways page-hotness is detected right now is page-faults (induced or otherwise) and things like IBS/PEBS.
page-faults cause overhead, and IBS/PEBS actually miss upwards of ~66% of all traffic (if you want the details i can dig up the presentation, but TL;DR: prefetcher traffic is missed entirely).
so OCP folks have been proposing hotness-tracking offloaded to the memory devices themselves:
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
(it's come along further than this white paper, but i need to dig up the new information).
These devices are incapable of providing virtual addressing information, and doing reverse lookups of addresses is inordinately expensive from user space. This leaves: Do it all in a kernel task, or give user space an an interface to operate on data provided by the device.
The syscall design is mostly being posted right now to collaborate via public channels, but if the idea is so fundamentally offensive then i'll drop it and relay the opinion accordingly.
~Gregory
On Tue, Mar 19, 2024 at 02:32:17PM -0400, Gregory Price wrote:
On Tue, Mar 19, 2024 at 06:20:33PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 02:14:33PM -0400, Gregory Price wrote:
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
Implement simple ktest that looks up the physical address via /proc/self/pagemap and migrates the page based on that information.
What? LOL. No.
Certainly the test is stupid and requires admin, but I could not come up an easier test to demonstrate the concept - and the docs say to include a test with all syscall proposals.
Am I missing something else important? (stupid question: of course I am, but alas I must ask it)
It's not that the test is stupid. It's the concept that's stupid.
Ok i'll bite.
The 2 major ways page-hotness is detected right now is page-faults (induced or otherwise) and things like IBS/PEBS.
page-faults cause overhead, and IBS/PEBS actually miss upwards of ~66% of all traffic (if you want the details i can dig up the presentation, but TL;DR: prefetcher traffic is missed entirely).
so OCP folks have been proposing hotness-tracking offloaded to the memory devices themselves:
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
(it's come along further than this white paper, but i need to dig up the new information).
These devices are incapable of providing virtual addressing information, and doing reverse lookups of addresses is inordinately expensive from user space. This leaves: Do it all in a kernel task, or give user space an an interface to operate on data provided by the device.
The syscall design is mostly being posted right now to collaborate via public channels, but if the idea is so fundamentally offensive then i'll drop it and relay the opinion accordingly.
The syscall design is wrong. Exposing physical addresses to userspace is never the right answer. Think rowhammer.
I'm vehemently opposed to all of the bullshit around CXL. However, if you are going to propose something, it should be based around an abstraction. Say "We have 8 pools of memory. This VMA is backed by memory from pools 3 & 6. The relative hotness of the 8 pools are <vector>. The quantities of memory in the 8 ppols are <vector>". And then you can say "migrate this range of memory to pool 2".
That's just an initial response to the idea. I refuse to invest a serious amount of time in a dead-end idea like CXL memory pooling.
On Tue, Mar 19, 2024 at 06:38:23PM +0000, Matthew Wilcox wrote:
The syscall design is mostly being posted right now to collaborate via public channels, but if the idea is so fundamentally offensive then i'll drop it and relay the opinion accordingly.
The syscall design is wrong. Exposing physical addresses to userspace is never the right answer. Think rowhammer.
1) The syscall does not expose physical addresses information, it consumes it.
2) The syscall does not allow the user to select target physical address only the target node. Now, that said, if source-pages are zeroed on migration, that's definitely a concern. I did not see this to be the case, however, and the frequency of write required to make use of that for rowhammer seems to be a mitigating factor.
3) there exist 4 interfaces which do expose physical address information - /proc/pid/pagemap - perf / IBS and PEBs - zoneinfo - /sys/kerne/mm/page_idle (PFNs)
4) The syscall requires CAP_SYS_ADMIN because these other sources require the same, though as v1/v2 discussed there could be an argument for CAP_SYS_NIDE.
I'm vehemently opposed to all of the bullshit around CXL. However, if you are going to propose something, it should be based around an abstraction. Say "We have 8 pools of memory. This VMA is backed by memory from pools 3 & 6. The relative hotness of the 8 pools are <vector>. The quantities of memory in the 8 ppols are <vector>". And then you can say "migrate this range of memory to pool 2".
That's just an initial response to the idea. I refuse to invest a serious amount of time in a dead-end idea like CXL memory pooling.
Who said anything about pools? Local memory expanders are capable of hosting hotness tracking offload.
~Gregory
Gregory Price gourry.memverge@gmail.com writes:
v3:
- pull forward to v6.8
- style and small fixups recommended by jcameron
- update syscall number (will do all archs when RFC tag drops)
- update for new folio code
- added OCP link to device-tracked address hotness proposal
- kept void* over __u64 simply because it integrates cleanly with existing migration code. If there's strong opinions, I can refactor.
This patch set is a proposal for a syscall analogous to move_pages, that migrates pages between NUMA nodes using physical addressing.
The intent is to better enable user-land system-wide memory tiering as CXL devices begin to provide memory resources on the PCIe bus.
For example, user-land software which is making decisions based on data sources which expose physical address information no longer must convert that information to virtual addressing to act upon it (see background for info on how physical addresses are acquired).
The syscall requires CAP_SYS_ADMIN, since physical address source information is typically protected by the same (or CAP_SYS_NICE).
This patch set broken into 3 patches:
- refactor of existing migration code for code reuse
- The sys_move_phys_pages system call.
- ktest of the syscall
The sys_move_phys_pages system call validates the page may be migrated by checking migratable-status of each vma mapping the page, and the intersection of cpuset policies each vma's task.
Background:
Userspace job schedulers, memory managers, and tiering software solutions depend on page migration syscalls to reallocate resources across NUMA nodes. Currently, these calls enable movement of memory associated with a specific PID. Moves can be requested in coarse, process-sized strokes (as with migrate_pages), and on specific virtual pages (via move_pages).
However, a number of profiling mechanisms provide system-wide information that would benefit from a physical-addressing version move_pages.
There are presently at least 4 ways userland can acquire physical address information for use with this interface, and 1 hardware offload mechanism being proposed by opencompute.
/proc/pid/pagemap: can be used to do page table translations. This is only really useful for testing, and the ktest was written using this functionality.
X86: IBS (AMD) and PEBS (Intel) can be configured to return physical and/or vitual address information.
zoneinfo: /proc/zoneinfo exposes the start PFN of zones
/sys/kernel/mm/page_idle: A way to query whether a PFN is idle. So long as the page size is known, this can be used to identify system-wide idle pages that could be migrated to lower tiers.
https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
CXL Offloaded Hotness Monitoring (Proposed): a CXL memory device may provide hot/cold information about its memory. For example, it may report the hottest device addresses (0-based) or a physical address (if it has access to decoders for convert bases).
DPA can be cheaply converted to HPA by combining it with data exposed by /sys/bus/cxl/ information (region address bases).
See: https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Information from these sources facilitates systemwide resource management, but with the limitations of migrate_pages and move_pages applying to individual tasks, their outputs must be converted back to virtual addresses and re-associated with specific PIDs.
Doing this reverse-translation outside of the kernel requires considerable space and compute, and it will have to be performed again by the existing system calls. Much of this work can be avoided if the pages can be migrated directly with physical memory addressing.
One difficulty of the idea of the physical address is that we lacks some user space specified policy information to make decision. For example, users may want to pin some pages in DRAM to improve latency, or pin some pages in CXL memory to do some best effort work. To make the correct decision, we need PID and virtual address.
Yes, I found that you have tried to avoid to break the existing policy in the code. But it seems better to consider the policy beforehand to avoid to make the wrong decision in the first place.
-- Best Regards, Huang, Ying
On Wed, Mar 20, 2024 at 10:48:44AM +0800, Huang, Ying wrote:
Gregory Price gourry.memverge@gmail.com writes:
Doing this reverse-translation outside of the kernel requires considerable space and compute, and it will have to be performed again by the existing system calls. Much of this work can be avoided if the pages can be migrated directly with physical memory addressing.
One difficulty of the idea of the physical address is that we lacks some user space specified policy information to make decision. For example, users may want to pin some pages in DRAM to improve latency, or pin some pages in CXL memory to do some best effort work. To make the correct decision, we need PID and virtual address.
I think of this as a second or third order problem. The core problem right now isn't the practicality of how userland would actually use this interface - the core problem is whether the data generated by offloaded monitoring is even worth collecting and operating on in the first place.
So this is a quick hack to do some research about whether it's even worth developing the whole abstraction described by Willy.
This is why it's labeled RFC. I upped a v3 because I know of two groups actively looking at using it for research, and because the folio updates broke the old version. It's also easier for me to engage through the list than via private channels for this particular work.
Do I suggest we merge this interface as-is? No, too many concerns about side channels. However, it's a clean reuse of move_pages code to bootstrap the investigation, and it at least gets the gears turning.
Example notes from a sidebar earlier today:
* An interesting proposal from Dan Williams would be to provide some sort of `/sys/.../memory_tiering/tierN/promote_hot` interface, with a callback mechanism into the relevant hardware drivers that allows for this to be abstracted. This could be done on some interval and some threshhold (# pages, hotness threshhold, etc).
The code to execute promotions ends up looking like what I have now
1) Validate the page is elgibile to be promoted by walking the vmas 2) invoking the existing move_pages code
The above idea can be implemented trivially in userland without having to plumb through a whole brand new callback system.
Sometimes you have to post stupid ideas to get to the good ones :]
~Gregory
Gregory Price gregory.price@memverge.com writes:
On Wed, Mar 20, 2024 at 10:48:44AM +0800, Huang, Ying wrote:
Gregory Price gourry.memverge@gmail.com writes:
Doing this reverse-translation outside of the kernel requires considerable space and compute, and it will have to be performed again by the existing system calls. Much of this work can be avoided if the pages can be migrated directly with physical memory addressing.
One difficulty of the idea of the physical address is that we lacks some user space specified policy information to make decision. For example, users may want to pin some pages in DRAM to improve latency, or pin some pages in CXL memory to do some best effort work. To make the correct decision, we need PID and virtual address.
I think of this as a second or third order problem. The core problem right now isn't the practicality of how userland would actually use this interface - the core problem is whether the data generated by offloaded monitoring is even worth collecting and operating on in the first place.
So this is a quick hack to do some research about whether it's even worth developing the whole abstraction described by Willy.
This is why it's labeled RFC. I upped a v3 because I know of two groups actively looking at using it for research, and because the folio updates broke the old version. It's also easier for me to engage through the list than via private channels for this particular work.
Do I suggest we merge this interface as-is? No, too many concerns about side channels. However, it's a clean reuse of move_pages code to bootstrap the investigation, and it at least gets the gears turning.
Got it! Thanks for detailed explanation.
I think that one of the difficulties of offloaded monitoring is that it's hard to obey these user specified policies. The policies may become more complex in the future, for example, allocate DRAM among workloads.
Example notes from a sidebar earlier today:
- An interesting proposal from Dan Williams would be to provide some sort of `/sys/.../memory_tiering/tierN/promote_hot` interface, with a callback mechanism into the relevant hardware drivers that allows for this to be abstracted. This could be done on some interval and some threshhold (# pages, hotness threshhold, etc).
The code to execute promotions ends up looking like what I have now
- Validate the page is elgibile to be promoted by walking the vmas
- invoking the existing move_pages code
The above idea can be implemented trivially in userland without having to plumb through a whole brand new callback system.
Sometimes you have to post stupid ideas to get to the good ones :]
-- Best Regards, Huang, Ying
linux-kselftest-mirror@lists.linaro.org