[RFC PATCH 0/6] Split huge pages to any lower order pages.

List overview All Threads
Download

newer

older

[PATCH v4 00/27]Fix several bad...

[PATCH v8 0/9] mm: introduce...

Zi Yan

11 Nov 2020 11 Nov '20

8:40 p.m.

From: Zi Yan ziy@nvidia.com

Hi all,

With Matthew's THP in pagecache patches[1], we will be able to handle any size pagecache THPs, but currently split_huge_page can only split a THP to order-0 pages. This can easily erase the benefit of having pagecache THPs, when operations like truncate might want to keep pages larger than order-0. In response, here is the patches to add support for splitting a THP to any lower order pages. In addition, this patchset prepares for my PUD THP patchset[2], since splitting a PUD THP to multiple PMD THPs can be handled by split_huge_page_to_list_to_order function added by this patchset, which reduces a lot of redundant code without just replicating split_huge_page for PUD THP.

The patchset is on top of Matthew's pagecache/next tree[3].

To ease the tests of split_huge_page functions, I added a new debugfs interface at <debugfs>/split_huge_pages_in_range_pid, so developers can split THPs in a given range from a process with the given pid by writing "<pid>,<vaddr_start>,<vaddr_end>,<to_order>" to the interface. I also added a new test program to test 1) split PMD THPs, 2) split pagecache THPs to any lower order, and 3) truncating a pagecache THP to a page with a lower order.

Suggestions and comments are welcome. Thanks.

[1] https://lore.kernel.org/linux-mm/20201029193405.29125-1-willy@infradead.org/ [2] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi.yan@sent.com/ [3] https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/next

Zi Yan (6): mm: huge_memory: add new debugfs interface to trigger split huge page on any page range. mm: memcg: make memcg huge page split support any order split. mm: page_owner: add support for splitting to any order in split page_owner. mm: thp: add support for split huge page to any lower order pages. mm: truncate: split thp to a non-zero order if possible. mm: huge_memory: enable debugfs to split huge pages to any order.

-- 2.28.0

Show replies by date

Zi Yan

11 Nov 11 Nov

8:40 p.m.

New subject: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

From: Zi Yan ziy@nvidia.com

Huge pages in the process with the given pid and virtual address range are split. It is used to test split huge page function. In addition, a testing program is added to tools/testing/selftests/vm to utilize the interface by splitting PMD THPs.

Signed-off-by: Zi Yan ziy@nvidia.com --- mm/huge_memory.c | 98 +++++++++++ mm/internal.h | 1 + mm/migrate.c | 2 +- tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/split_huge_page_test.c | 161 ++++++++++++++++++ 5 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 207ebca8c654..c4fead5ead31 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -7,6 +7,7 @@

#include <linux/mm.h> #include <linux/sched.h> +#include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/numa_balancing.h> #include <linux/highmem.h> @@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val) DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set, "%llu\n");

+static ssize_t split_huge_pages_in_range_pid_write(struct file *file, + const char __user *buf, size_t count, loff_t *ppops) +{ + static DEFINE_MUTEX(mutex); + ssize_t ret; + char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */ + int pid; + unsigned long vaddr_start, vaddr_end, addr; + nodemask_t task_nodes; + struct mm_struct *mm; + + ret = mutex_lock_interruptible(&mutex); + if (ret) + return ret; + + ret = -EFAULT; + + memset(input_buf, 0, 80); + if (copy_from_user(input_buf, buf, min_t(size_t, count, 80))) + goto out; + + input_buf[80] = '\0'; + ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end); + if (ret != 3) { + ret = -EINVAL; + goto out; + } + vaddr_start &= PAGE_MASK; + vaddr_end &= PAGE_MASK; + + ret = strlen(input_buf); + pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n", + pid, vaddr_start, vaddr_end); + + mm = find_mm_struct(pid, &task_nodes); + if (IS_ERR(mm)) { + ret = -EINVAL; + goto out; + } + + mmap_read_lock(mm); + for (addr = vaddr_start; addr < vaddr_end;) { + struct vm_area_struct *vma = find_vma(mm, addr); + unsigned int follflags; + struct page *page; + + if (!vma || addr < vma->vm_start || !vma_migratable(vma)) + break; + + /* FOLL_DUMP to ignore special (like zero) pages */ + follflags = FOLL_GET | FOLL_DUMP; + page = follow_page(vma, addr, follflags); + + if (IS_ERR(page)) + break; + if (!page) + break; + + if (!is_transparent_hugepage(page)) + goto next; + + if (!can_split_huge_page(page, NULL)) + goto next; + + if (!trylock_page(page)) + goto next; + + addr += page_size(page) - PAGE_SIZE; + + /* reset addr if split fails */ + if (split_huge_page(page)) + addr -= (page_size(page) - PAGE_SIZE); + + unlock_page(page); +next: + /* next page */ + addr += page_size(page); + put_page(page); + } + mmap_read_unlock(mm); + + + mmput(mm); +out: + mutex_unlock(&mutex); + return ret; + +} + +static const struct file_operations split_huge_pages_in_range_pid_fops = { + .owner = THIS_MODULE, + .write = split_huge_pages_in_range_pid_write, + .llseek = no_llseek, +}; + static int __init split_huge_pages_debugfs(void) { debugfs_create_file("split_huge_pages", 0200, NULL, NULL, &split_huge_pages_fops); + debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL, + &split_huge_pages_in_range_pid_fops); return 0; } late_initcall(split_huge_pages_debugfs); diff --git a/mm/internal.h b/mm/internal.h index 3ea43642b99d..fd841a38830f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -624,4 +624,5 @@ struct migration_target_control {

bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end); void page_cache_free_page(struct address_space *mapping, struct page *page); +struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes); #endif /* __MM_INTERNAL_H */ diff --git a/mm/migrate.c b/mm/migrate.c index a50bbb0e029b..e35654d1087d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1851,7 +1851,7 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, return nr_pages ? -EFAULT : 0; }

-static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes) +struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes) { struct task_struct *task; struct mm_struct *mm; diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 62fb15f286ee..d9ead0cdd3e9 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -42,6 +42,7 @@ TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd +TEST_GEN_FILES += split_huge_page_test

ifeq ($(ARCH),x86_64) CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32) diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c new file mode 100644 index 000000000000..c8a32ae9e13a --- /dev/null +++ b/tools/testing/selftests/vm/split_huge_page_test.c @@ -0,0 +1,161 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include "numa.h" +#include <unistd.h> +#include <errno.h> +#include <inttypes.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/mman.h> +#include <sys/time.h> +#include <sys/wait.h> +#include <malloc.h> +#include <stdbool.h> + +#define PAGE_4KB (4096UL) +#define PAGE_2MB (512UL*PAGE_4KB) +#define PAGE_1GB (512UL*PAGE_2MB) + +#define PRESENT_MASK (1UL<<63) +#define SWAPPED_MASK (1UL<<62) +#define PAGE_TYPE_MASK (1UL<<61) +#define PFN_MASK ((1UL<<55)-1) + +#define KPF_THP (1UL<<22) +#define KPF_PUD_THP (1UL<<27) + +#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid" +#define SMAP_PATH "/proc/self/smaps" +#define INPUT_MAX 80 + +static int write_file(const char *path, const char *buf, size_t buflen) +{ + int fd; + ssize_t numwritten; + + fd = open(path, O_WRONLY); + if (fd == -1) + return 0; + + numwritten = write(fd, buf, buflen - 1); + close(fd); + if (numwritten < 1) + return 0; + + return (unsigned int) numwritten; +} + +static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end) +{ + char input[INPUT_MAX]; + int ret; + + ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start, + vaddr_end); + if (ret >= INPUT_MAX) { + printf("%s: Debugfs input is too long\n", __func__); + exit(EXIT_FAILURE); + } + + if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) { + perror(SPLIT_DEBUGFS); + exit(EXIT_FAILURE); + } +} + +#define MAX_LINE_LENGTH 500 + +static bool check_for_pattern(FILE *fp, char *pattern, char *buf) +{ + while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) { + if (!strncmp(buf, pattern, strlen(pattern))) + return true; + } + return false; +} + +static uint64_t check_huge(void *addr) +{ + uint64_t thp = 0; + int ret; + FILE *fp; + char buffer[MAX_LINE_LENGTH]; + char addr_pattern[MAX_LINE_LENGTH]; + + ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-", + (unsigned long) addr); + if (ret >= MAX_LINE_LENGTH) { + printf("%s: Pattern is too long\n", __func__); + exit(EXIT_FAILURE); + } + + + fp = fopen(SMAP_PATH, "r"); + if (!fp) { + printf("%s: Failed to open file %s\n", __func__, SMAP_PATH); + exit(EXIT_FAILURE); + } + if (!check_for_pattern(fp, addr_pattern, buffer)) + goto err_out; + + /* + * Fetch the AnonHugePages: in the same block and check the number of + * hugepages. + */ + if (!check_for_pattern(fp, "AnonHugePages:", buffer)) + goto err_out; + + if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1) { + printf("Reading smap error\n"); + exit(EXIT_FAILURE); + } + +err_out: + fclose(fp); + return thp; +} + +void split_pmd_thp(void) +{ + char *one_page; + size_t len = 4 * PAGE_2MB; + uint64_t thp_size; + + one_page = memalign(PAGE_1GB, len); + + madvise(one_page, len, MADV_HUGEPAGE); + + memset(one_page, 1, len); + + thp_size = check_huge(one_page); + if (!thp_size) { + printf("No THP is allocatd"); + exit(EXIT_FAILURE); + } + + /* split all possible huge pages */ + write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len); + + *one_page = 0; + + thp_size = check_huge(one_page); + if (thp_size) { + printf("Still %ld kB AnonHugePages not split\n", thp_size); + exit(EXIT_FAILURE); + } + + printf("Split huge pages successful\n"); + free(one_page); +} + +int main(int argc, char **argv) +{ + split_pmd_thp(); + + return 0; +}

-- 2.28.0

Ralph Campbell

12 Nov 12 Nov

10:22 p.m.

New subject: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

On 11/11/20 12:40 PM, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

Huge pages in the process with the given pid and virtual address range are split. It is used to test split huge page function. In addition, a testing program is added to tools/testing/selftests/vm to utilize the interface by splitting PMD THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

mm/huge_memory.c | 98 +++++++++++ mm/internal.h | 1 + mm/migrate.c | 2 +- tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/split_huge_page_test.c | 161 ++++++++++++++++++ 5 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

Don't forget to update ".gitignore" to include "split_huge_page_test".

Zi Yan

10:38 p.m.

New subject: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

On 12 Nov 2020, at 17:22, Ralph Campbell wrote:

...

On 11/11/20 12:40 PM, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

Huge pages in the process with the given pid and virtual address range are split. It is used to test split huge page function. In addition, a testing program is added to tools/testing/selftests/vm to utilize the interface by splitting PMD THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

mm/huge_memory.c | 98 +++++++++++ mm/internal.h | 1 + mm/migrate.c | 2 +- tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/split_huge_page_test.c | 161 ++++++++++++++++++ 5 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

Don't forget to update ".gitignore" to include "split_huge_page_test".

Sure. Thanks for pointing this out.

— Best Regards, Yan Zi

Kirill A. Shutemov

16 Nov 16 Nov

4:06 p.m.

New subject: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

On Wed, Nov 11, 2020 at 03:40:03PM -0500, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

Huge pages in the process with the given pid and virtual address range are split. It is used to test split huge page function. In addition, a testing program is added to tools/testing/selftests/vm to utilize the interface by splitting PMD THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

mm/huge_memory.c | 98 +++++++++++ mm/internal.h | 1 + mm/migrate.c | 2 +- tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/split_huge_page_test.c | 161 ++++++++++++++++++ 5 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 207ebca8c654..c4fead5ead31 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -7,6 +7,7 @@ #include <linux/mm.h> #include <linux/sched.h> +#include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/numa_balancing.h> #include <linux/highmem.h> @@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val) DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set, "%llu\n"); +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppops)
+{
static DEFINE_MUTEX(mutex);

ssize_t ret;

char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */

int pid;

unsigned long vaddr_start, vaddr_end, addr;

nodemask_t task_nodes;

struct mm_struct *mm;

ret = mutex_lock_interruptible(&mutex);

if (ret)
return ret;
ret = -EFAULT;

memset(input_buf, 0, 80);

if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
goto out;
input_buf[80] = '\0';

Hm. Out-of-buffer access?

...

ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end);

Why hex without 0x prefix?

...

if (ret != 3) {
ret = -EINVAL;
goto out;
}

vaddr_start &= PAGE_MASK;

vaddr_end &= PAGE_MASK;

ret = strlen(input_buf);

pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
 pid, vaddr_start, vaddr_end);
mm = find_mm_struct(pid, &task_nodes);

I don't follow why you need nodemask.

...

if (IS_ERR(mm)) {
```
ret = -EINVAL;
```
```
goto out;
```
}
mmap_read_lock(mm);
for (addr = vaddr_start; addr < vaddr_end;) {

struct vm_area_struct *vma = find_vma(mm, addr);

```
unsigned int follflags;
```
```
struct page *page;
```

if (!vma || addr < vma->vm_start || !vma_migratable(vma))

```
	break;
```

/* FOLL_DUMP to ignore special (like zero) pages */

```
follflags = FOLL_GET | FOLL_DUMP;
```

page = follow_page(vma, addr, follflags);

```
if (IS_ERR(page))
```
```
	break;
```
```
if (!page)
```
```
	break;
```
```
if (!is_transparent_hugepage(page))
```
```
	goto next;
```
```
if (!can_split_huge_page(page, NULL))
```
```
	goto next;
```
```
if (!trylock_page(page))
```
```
	goto next;
```
```
addr += page_size(page) - PAGE_SIZE;
```

Who said it was mapped as huge? mremap() allows to construct an PTE page table that filled with PTE-mapped THPs, each of them distinct.

...

```
/* reset addr if split fails */
```
```
if (split_huge_page(page))
```

	addr -= (page_size(page) - PAGE_SIZE);

```
unlock_page(page);
```

+next:

```
/* next page */
```
```
addr += page_size(page);
```

Isn't it the second time if split_huge_page() succeed.

...

put_page(page);
}

mmap_read_unlock(mm);

mmput(mm);
+out:

mutex_unlock(&mutex);

return ret;

+}

+static const struct file_operations split_huge_pages_in_range_pid_fops = {

.owner = THIS_MODULE,

.write = split_huge_pages_in_range_pid_write,

.llseek = no_llseek,

+};

static int __init split_huge_pages_debugfs(void) { debugfs_create_file("split_huge_pages", 0200, NULL, NULL, &split_huge_pages_fops);
debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
	    &split_huge_pages_in_range_pid_fops);
return 0;
} late_initcall(split_huge_pages_debugfs); diff --git a/mm/internal.h b/mm/internal.h index 3ea43642b99d..fd841a38830f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -624,4 +624,5 @@ struct migration_target_control { bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end); void page_cache_free_page(struct address_space *mapping, struct page *page); +struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes); #endif /* __MM_INTERNAL_H */ diff --git a/mm/migrate.c b/mm/migrate.c index a50bbb0e029b..e35654d1087d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1851,7 +1851,7 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, return nr_pages ? -EFAULT : 0; } -static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes) +struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes) { struct task_struct *task; struct mm_struct *mm; diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 62fb15f286ee..d9ead0cdd3e9 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -42,6 +42,7 @@ TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd +TEST_GEN_FILES += split_huge_page_test ifeq ($(ARCH),x86_64) CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32) diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c new file mode 100644 index 000000000000..c8a32ae9e13a --- /dev/null +++ b/tools/testing/selftests/vm/split_huge_page_test.c @@ -0,0 +1,161 @@ +// SPDX-License-Identifier: GPL-2.0

+#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include "numa.h" +#include <unistd.h> +#include <errno.h> +#include <inttypes.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/mman.h> +#include <sys/time.h> +#include <sys/wait.h> +#include <malloc.h> +#include <stdbool.h>

+#define PAGE_4KB (4096UL) +#define PAGE_2MB (512UL*PAGE_4KB) +#define PAGE_1GB (512UL*PAGE_2MB)

+#define PRESENT_MASK (1UL<<63) +#define SWAPPED_MASK (1UL<<62) +#define PAGE_TYPE_MASK (1UL<<61) +#define PFN_MASK ((1UL<<55)-1)

+#define KPF_THP (1UL<<22) +#define KPF_PUD_THP (1UL<<27)

+#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid" +#define SMAP_PATH "/proc/self/smaps" +#define INPUT_MAX 80

+static int write_file(const char *path, const char *buf, size_t buflen) +{
int fd;

ssize_t numwritten;

fd = open(path, O_WRONLY);

if (fd == -1)
return 0;
numwritten = write(fd, buf, buflen - 1);

close(fd);

if (numwritten < 1)
return 0;
return (unsigned int) numwritten;
+}

+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end) +{
char input[INPUT_MAX];

int ret;

ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start,
	vaddr_end);
if (ret >= INPUT_MAX) {
printf("%s: Debugfs input is too long\n", __func__);
exit(EXIT_FAILURE);
}

if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) {
perror(SPLIT_DEBUGFS);
exit(EXIT_FAILURE);
}
+}

+#define MAX_LINE_LENGTH 500

+static bool check_for_pattern(FILE *fp, char *pattern, char *buf) +{
while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
if (!strncmp(buf, pattern, strlen(pattern)))
	return true;
}

return false;
+}

+static uint64_t check_huge(void *addr) +{
uint64_t thp = 0;

int ret;

FILE *fp;

char buffer[MAX_LINE_LENGTH];

char addr_pattern[MAX_LINE_LENGTH];

ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-",
       (unsigned long) addr);
if (ret >= MAX_LINE_LENGTH) {
printf("%s: Pattern is too long\n", __func__);
exit(EXIT_FAILURE);
}

fp = fopen(SMAP_PATH, "r");

if (!fp) {
printf("%s: Failed to open file %s\n", __func__, SMAP_PATH);
exit(EXIT_FAILURE);
}

if (!check_for_pattern(fp, addr_pattern, buffer))
goto err_out;
/*
* Fetch the AnonHugePages: in the same block and check the number of
* hugepages.
*/
if (!check_for_pattern(fp, "AnonHugePages:", buffer))
goto err_out;
if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1) {
printf("Reading smap error\n");
exit(EXIT_FAILURE);
}
+err_out:

fclose(fp);

return thp;

+}

+void split_pmd_thp(void) +{
char *one_page;

size_t len = 4 * PAGE_2MB;

uint64_t thp_size;

one_page = memalign(PAGE_1GB, len);

madvise(one_page, len, MADV_HUGEPAGE);

memset(one_page, 1, len);

thp_size = check_huge(one_page);

if (!thp_size) {
printf("No THP is allocatd");
exit(EXIT_FAILURE);
}

/* split all possible huge pages */

write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);

*one_page = 0;

thp_size = check_huge(one_page);

if (thp_size) {
printf("Still %ld kB AnonHugePages not split\n", thp_size);
exit(EXIT_FAILURE);
}

printf("Split huge pages successful\n");

free(one_page);
+}

+int main(int argc, char **argv) +{

split_pmd_thp();

return 0;

+}

2.28.0

-- Kirill A. Shutemov

Zi Yan

5:26 p.m.

New subject: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

On 16 Nov 2020, at 11:06, Kirill A. Shutemov wrote:

...

On Wed, Nov 11, 2020 at 03:40:03PM -0500, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

Huge pages in the process with the given pid and virtual address range are split. It is used to test split huge page function. In addition, a testing program is added to tools/testing/selftests/vm to utilize the interface by splitting PMD THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

mm/huge_memory.c | 98 +++++++++++ mm/internal.h | 1 + mm/migrate.c | 2 +- tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/split_huge_page_test.c | 161 ++++++++++++++++++ 5 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 207ebca8c654..c4fead5ead31 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -7,6 +7,7 @@

#include <linux/mm.h> #include <linux/sched.h> +#include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/numa_balancing.h> #include <linux/highmem.h> @@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val) DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set, "%llu\n");

+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppops)
+{
static DEFINE_MUTEX(mutex);

ssize_t ret;

char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */

int pid;

unsigned long vaddr_start, vaddr_end, addr;

nodemask_t task_nodes;

struct mm_struct *mm;

ret = mutex_lock_interruptible(&mutex);

if (ret)
return ret;
ret = -EFAULT;

memset(input_buf, 0, 80);

if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
goto out;
input_buf[80] = '\0';
Hm. Out-of-buffer access?

Sorry. Will fix it.

...

...

ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end);

Why hex without 0x prefix?

No particular reason. Let me add the prefix.

...

...
if (ret != 3) {
ret = -EINVAL;
goto out;
}

vaddr_start &= PAGE_MASK;

vaddr_end &= PAGE_MASK;

ret = strlen(input_buf);

pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
 pid, vaddr_start, vaddr_end);
mm = find_mm_struct(pid, &task_nodes);
I don't follow why you need nodemask.

I don’t need it. I just reuse the find_mm_struct function from mm/migrate.c.

...

if (IS_ERR(mm)) {
```
ret = -EINVAL;
```
```
goto out;
```
}
mmap_read_lock(mm);
for (addr = vaddr_start; addr < vaddr_end;) {

struct vm_area_struct *vma = find_vma(mm, addr);

```
unsigned int follflags;
```
```
struct page *page;
```

if (!vma || addr < vma->vm_start || !vma_migratable(vma))

```
	break;
```

/* FOLL_DUMP to ignore special (like zero) pages */

```
follflags = FOLL_GET | FOLL_DUMP;
```

page = follow_page(vma, addr, follflags);

```
if (IS_ERR(page))
```
```
	break;
```
```
if (!page)
```
```
	break;
```
```
if (!is_transparent_hugepage(page))
```
```
	goto next;
```
```
if (!can_split_huge_page(page, NULL))
```
```
	goto next;
```
```
if (!trylock_page(page))
```
```
	goto next;
```
```
addr += page_size(page) - PAGE_SIZE;
```

Who said it was mapped as huge? mremap() allows to construct an PTE page table that filled with PTE-mapped THPs, each of them distinct.

I forgot about this. I was trying to be smart to skip the rest of subpages if we split a THP. I will increase addr always by PAGE_SIZE to handle this situation.

...

```
/* reset addr if split fails */
```
```
if (split_huge_page(page))
```

	addr -= (page_size(page) - PAGE_SIZE);

```
unlock_page(page);
```

+next:

```
/* next page */
```
```
addr += page_size(page);
```

Isn't it the second time if split_huge_page() succeed.

If split_huge_page() succeeds, page_size(page) would be PAGE_SIZE and addr was increased by THP size - PAGE_SIZE above, so addr now should be at the end of the original THP.

Anyway, I will change the code to something like:

/* * always increase addr by PAGE_SIZE, since we could have a PTE page * table filled with PTE-mapped THPs, each of which is distinct. */ for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) { ...

if (!trylock_page(page)) continue;

split_huge_page(page);

unlock_page(page); put_page(page); } mmap_read_unlock(mm);

Thanks for reviewing the patch.

— Best Regards, Yan Zi

Zi Yan

11 Nov 11 Nov

8:40 p.m.

New subject: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

From: Zi Yan ziy@nvidia.com

It reads thp_nr_pages and splits to provided new_nr. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com --- include/linux/memcontrol.h | 5 +++-- mm/huge_memory.c | 2 +- mm/memcontrol.c | 4 ++-- 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0f4dd7829fb2..b3bac79ceed6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, }

#ifdef CONFIG_TRANSPARENT_HUGEPAGE -void mem_cgroup_split_huge_fixup(struct page *head); +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr); #endif

#else /* CONFIG_MEMCG */ @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, return 0; }

-static inline void mem_cgroup_split_huge_fixup(struct page *head) +static inline void mem_cgroup_split_huge_fixup(struct page *head, + unsigned int new_nr) { }

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c4fead5ead31..f599f5b9bf7f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, lruvec = mem_cgroup_page_lruvec(head, pgdat);

/* complete memcg works before add pages to LRU */ - mem_cgroup_split_huge_fixup(head); + mem_cgroup_split_huge_fixup(head, 1);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33f632689cee..e9705ba6bbcc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) * Because tail pages are not marked as "used", set it. We're under * pgdat->lru_lock and migration entries setup in all page mappings. */ -void mem_cgroup_split_huge_fixup(struct page *head) +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr) { struct mem_cgroup *memcg = page_memcg(head); int i; @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head) if (mem_cgroup_disabled()) return;

- for (i = 1; i < thp_nr_pages(head); i++) { + for (i = new_nr; i < thp_nr_pages(head); i += new_nr) { css_get(&memcg->css); head[i].memcg_data = (unsigned long)memcg; }

-- 2.28.0

Ralph Campbell

12 Nov 12 Nov

5:58 p.m.

New subject: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

On 11/11/20 12:40 PM, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

It reads thp_nr_pages and splits to provided new_nr. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

Looks OK to me. Reviewed-by: Ralph Campbell rcampbell@nvidia.com

...

include/linux/memcontrol.h | 5 +++-- mm/huge_memory.c | 2 +- mm/memcontrol.c | 4 ++-- 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0f4dd7829fb2..b3bac79ceed6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -void mem_cgroup_split_huge_fixup(struct page *head); +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr); #endif #else /* CONFIG_MEMCG */ @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, return 0; } -static inline void mem_cgroup_split_huge_fixup(struct page *head) +static inline void mem_cgroup_split_huge_fixup(struct page *head,
			       unsigned int new_nr)
{ }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c4fead5ead31..f599f5b9bf7f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, lruvec = mem_cgroup_page_lruvec(head, pgdat); /* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head);

mem_cgroup_split_huge_fixup(head, 1);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33f632689cee..e9705ba6bbcc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)

Because tail pages are not marked as "used", set it. We're under

pgdat->lru_lock and migration entries setup in all page mappings.

*/ -void mem_cgroup_split_huge_fixup(struct page *head) +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr) { struct mem_cgroup *memcg = page_memcg(head); int i; @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head) if (mem_cgroup_disabled()) return;

for (i = 1; i < thp_nr_pages(head); i++) {

for (i = new_nr; i < thp_nr_pages(head); i += new_nr) { css_get(&memcg->css); head[i].memcg_data = (unsigned long)memcg; }

Zi Yan

6 p.m.

New subject: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

On 12 Nov 2020, at 12:58, Ralph Campbell wrote:

...

On 11/11/20 12:40 PM, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

It reads thp_nr_pages and splits to provided new_nr. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

Looks OK to me. Reviewed-by: Ralph Campbell rcampbell@nvidia.com

Thanks.

— Best Regards, Yan Zi

Roman Gushchin

14 Nov 14 Nov

12:23 a.m.

New subject: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

On Wed, Nov 11, 2020 at 03:40:04PM -0500, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

It reads thp_nr_pages and splits to provided new_nr. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/memcontrol.h | 5 +++-- mm/huge_memory.c | 2 +- mm/memcontrol.c | 4 ++-- 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0f4dd7829fb2..b3bac79ceed6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -void mem_cgroup_split_huge_fixup(struct page *head); +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr); #endif #else /* CONFIG_MEMCG */ @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, return 0; } -static inline void mem_cgroup_split_huge_fixup(struct page *head) +static inline void mem_cgroup_split_huge_fixup(struct page *head,
			       unsigned int new_nr)
{ } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c4fead5ead31..f599f5b9bf7f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, lruvec = mem_cgroup_page_lruvec(head, pgdat); /* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head);

mem_cgroup_split_huge_fixup(head, 1);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33f632689cee..e9705ba6bbcc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)

Because tail pages are not marked as "used", set it. We're under

pgdat->lru_lock and migration entries setup in all page mappings.

*/ -void mem_cgroup_split_huge_fixup(struct page *head) +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)

I'd go with unsigned int new_order, then it's obvious that we can split the original page without any leftovers.

Other than that the patch looks good! Acked-by: Roman Gushchin guro@fb.com

Thanks!

...

{ struct mem_cgroup *memcg = page_memcg(head); int i; @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head) if (mem_cgroup_disabled()) return;

for (i = 1; i < thp_nr_pages(head); i++) {

for (i = new_nr; i < thp_nr_pages(head); i += new_nr) { css_get(&memcg->css); head[i].memcg_data = (unsigned long)memcg; }

-- 2.28.0

Zi Yan

12:56 a.m.

New subject: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

On 13 Nov 2020, at 19:23, Roman Gushchin wrote:

...

On Wed, Nov 11, 2020 at 03:40:04PM -0500, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

It reads thp_nr_pages and splits to provided new_nr. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/memcontrol.h | 5 +++-- mm/huge_memory.c | 2 +- mm/memcontrol.c | 4 ++-- 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0f4dd7829fb2..b3bac79ceed6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, }

#ifdef CONFIG_TRANSPARENT_HUGEPAGE -void mem_cgroup_split_huge_fixup(struct page *head); +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr); #endif

#else /* CONFIG_MEMCG */ @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, return 0; }

-static inline void mem_cgroup_split_huge_fixup(struct page *head) +static inline void mem_cgroup_split_huge_fixup(struct page *head,
			       unsigned int new_nr)
{ }

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c4fead5ead31..f599f5b9bf7f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, lruvec = mem_cgroup_page_lruvec(head, pgdat);

/* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head);

mem_cgroup_split_huge_fixup(head, 1);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) };

diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33f632689cee..e9705ba6bbcc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)

Because tail pages are not marked as "used", set it. We're under

pgdat->lru_lock and migration entries setup in all page mappings.

*/ -void mem_cgroup_split_huge_fixup(struct page *head) +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
I'd go with unsigned int new_order, then it's obvious that we can split the original page without any leftovers.

Makes sense. Will change it.

...

Other than that the patch looks good! Acked-by: Roman Gushchin guro@fb.com

Thanks.

...

...
{ struct mem_cgroup *memcg = page_memcg(head); int i; @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head) if (mem_cgroup_disabled()) return;

for (i = 1; i < thp_nr_pages(head); i++) {

for (i = new_nr; i < thp_nr_pages(head); i += new_nr) { css_get(&memcg->css); head[i].memcg_data = (unsigned long)memcg; }

-- 2.28.0

— Best Regards, Yan Zi

Zi Yan

11 Nov 11 Nov

8:40 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com --- include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); }

-static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr, + unsigned int new_order) { if (static_branch_unlikely(&page_owner_inited)) - __split_page_owner(page, nr); + __split_page_owner(page, nr, new_order); } static inline void copy_page_owner(struct page *oldpage, struct page *newpage) { @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page, { } static inline void split_page_owner(struct page *page, - unsigned int order) + unsigned int nr, unsigned int new_order) { } static inline void copy_page_owner(struct page *oldpage, struct page *newpage) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f599f5b9bf7f..8b7d771ee962 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,

ClearPageCompound(head);

- split_page_owner(head, nr); + split_page_owner(head, nr, 1);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d77220615fd5..a9eead0e091a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)

for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); - split_page_owner(page, 1 << order); + split_page_owner(page, 1 << order, 1); } EXPORT_SYMBOL_GPL(split_page);

diff --git a/mm/page_owner.c b/mm/page_owner.c index b735a8eafcdb..2b7f7e9056dc 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_owner->last_migrate_reason = reason; }

-void __split_page_owner(struct page *page, unsigned int nr) +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order) { int i; struct page_ext *page_ext = lookup_page_ext(page); @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr) if (unlikely(!page_ext)) return;

- for (i = 0; i < nr; i++) { + for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext); - page_owner->order = 0; + page_owner->order = new_order; page_ext = page_ext_next(page_ext); } }

-- 2.28.0

Ralph Campbell

12 Nov 12 Nov

5:57 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 11/11/20 12:40 PM, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

Except for a minor fix below, you can add: Reviewed-by: Ralph Campbell rcampbell@nvidia.com

...

include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); } -static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);
} static inline void copy_page_owner(struct page *oldpage, struct page *newpage) {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page, { } static inline void split_page_owner(struct page *page,
	unsigned int order)
	unsigned int nr, unsigned int new_order)
{ } static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f599f5b9bf7f..8b7d771ee962 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageCompound(head);

split_page_owner(head, nr);

split_page_owner(head, nr, 1);

Shouldn't this be 0, not 1? (new_order not new_nr).

...

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d77220615fd5..a9eead0e091a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i);

split_page_owner(page, 1 << order);

split_page_owner(page, 1 << order, 1);

Ditto, 0.

...

} EXPORT_SYMBOL_GPL(split_page);

...

diff --git a/mm/page_owner.c b/mm/page_owner.c index b735a8eafcdb..2b7f7e9056dc 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_owner->last_migrate_reason = reason; } -void __split_page_owner(struct page *page, unsigned int nr) +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order) { int i; struct page_ext *page_ext = lookup_page_ext(page); @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr) if (unlikely(!page_ext)) return;

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext);
page_owner->order = 0;
page_owner->order = new_order;
page_ext = page_ext_next(page_ext); } }

Zi Yan

5:59 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 12 Nov 2020, at 12:57, Ralph Campbell wrote:

...

On 11/11/20 12:40 PM, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

Except for a minor fix below, you can add: Reviewed-by: Ralph Campbell rcampbell@nvidia.com

Thanks.

...

...
include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); } -static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);
} static inline void copy_page_owner(struct page *oldpage, struct page *newpage) {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page, { } static inline void split_page_owner(struct page *page,
	unsigned int order)
	unsigned int nr, unsigned int new_order)
{ } static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f599f5b9bf7f..8b7d771ee962 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageCompound(head);

split_page_owner(head, nr);

split_page_owner(head, nr, 1);
Shouldn't this be 0, not 1? (new_order not new_nr).

Yes, I forgot to fix the call site after I change the function signature. Thanks.

...

...
/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d77220615fd5..a9eead0e091a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i);

split_page_owner(page, 1 << order);

split_page_owner(page, 1 << order, 1);

Ditto, 0.

Sure, will fix this too.

— Best Regards, Yan Zi

Roman Gushchin

14 Nov 14 Nov

12:15 a.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); } -static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);
} static inline void copy_page_owner(struct page *oldpage, struct page *newpage) { @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page, { } static inline void split_page_owner(struct page *page,
	unsigned int order)
	unsigned int nr, unsigned int new_order)

With the addition of the new argument it's a bit hard to understand what the function is supposed to do. It seems like nr == page_order(page), is it right? Maybe we can pass old_order and new_order? Or just the page and the new order?

...

{ } static inline void copy_page_owner(struct page *oldpage, struct page *newpage) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f599f5b9bf7f..8b7d771ee962 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageCompound(head);

split_page_owner(head, nr);

split_page_owner(head, nr, 1);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d77220615fd5..a9eead0e091a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i);

split_page_owner(page, 1 << order);

split_page_owner(page, 1 << order, 1);

} EXPORT_SYMBOL_GPL(split_page); diff --git a/mm/page_owner.c b/mm/page_owner.c index b735a8eafcdb..2b7f7e9056dc 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_owner->last_migrate_reason = reason; } -void __split_page_owner(struct page *page, unsigned int nr) +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order) { int i; struct page_ext *page_ext = lookup_page_ext(page); @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr) if (unlikely(!page_ext)) return;

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext);
page_owner->order = 0;
page_owner->order = new_order;
page_ext = page_ext_next(page_ext);

I believe there cannot be any leftovers because nr is always a power of 2. Is it true? Converting nr argument to order (if it's possible) will make it obvious.

Other than that the patch looks good to me.

Thanks!

Zi Yan

1:08 a.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 13 Nov 2020, at 19:15, Roman Gushchin wrote:

...

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); }

-static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);
} static inline void copy_page_owner(struct page *oldpage, struct page *newpage) { @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page, { } static inline void split_page_owner(struct page *page,
	unsigned int order)
	unsigned int nr, unsigned int new_order)
With the addition of the new argument it's a bit hard to understand what the function is supposed to do. It seems like nr == page_order(page), is it right? Maybe we can pass old_order and new_order? Or just the page and the new order?

Yeah, it is a bit confusing. Please see more below.

...

...
{ } static inline void copy_page_owner(struct page *oldpage, struct page *newpage) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f599f5b9bf7f..8b7d771ee962 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,

ClearPageCompound(head);

split_page_owner(head, nr);

split_page_owner(head, nr, 1);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) {

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d77220615fd5..a9eead0e091a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)

for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i);

split_page_owner(page, 1 << order);

split_page_owner(page, 1 << order, 1);

} EXPORT_SYMBOL_GPL(split_page);

diff --git a/mm/page_owner.c b/mm/page_owner.c index b735a8eafcdb..2b7f7e9056dc 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_owner->last_migrate_reason = reason; }

-void __split_page_owner(struct page *page, unsigned int nr) +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order) { int i; struct page_ext *page_ext = lookup_page_ext(page); @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr) if (unlikely(!page_ext)) return;

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext);
page_owner->order = 0;
page_owner->order = new_order;
page_ext = page_ext_next(page_ext);
I believe there cannot be any leftovers because nr is always a power of 2. Is it true? Converting nr argument to order (if it's possible) will make it obvious.

Right. nr = thp_nr_pages(head), which is a power of 2. There would not be any leftover.

Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

[1]https://lore.kernel.org/linux-mm/20200908195539.25896-4-willy@infradead.org/

— Best Regards, Yan Zi

Roman Gushchin

1:38 a.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...

On 13 Nov 2020, at 19:15, Roman Gushchin wrote:

...
On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); }

-static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);
} static inline void copy_page_owner(struct page *oldpage, struct page *newpage) { @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page, { } static inline void split_page_owner(struct page *page,
	unsigned int order)
	unsigned int nr, unsigned int new_order)
With the addition of the new argument it's a bit hard to understand what the function is supposed to do. It seems like nr == page_order(page), is it right? Maybe we can pass old_order and new_order? Or just the page and the new order?
Yeah, it is a bit confusing. Please see more below.

...
...
{ } static inline void copy_page_owner(struct page *oldpage, struct page *newpage) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f599f5b9bf7f..8b7d771ee962 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,

ClearPageCompound(head);

split_page_owner(head, nr);

split_page_owner(head, nr, 1);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) {

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d77220615fd5..a9eead0e091a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)

for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i);

split_page_owner(page, 1 << order);

split_page_owner(page, 1 << order, 1);

} EXPORT_SYMBOL_GPL(split_page);

diff --git a/mm/page_owner.c b/mm/page_owner.c index b735a8eafcdb..2b7f7e9056dc 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_owner->last_migrate_reason = reason; }

-void __split_page_owner(struct page *page, unsigned int nr) +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order) { int i; struct page_ext *page_ext = lookup_page_ext(page); @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr) if (unlikely(!page_ext)) return;

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext);
page_owner->order = 0;
page_owner->order = new_order;
page_ext = page_ext_next(page_ext);
I believe there cannot be any leftovers because nr is always a power of 2. Is it true? Converting nr argument to order (if it's possible) will make it obvious.
Right. nr = thp_nr_pages(head), which is a power of 2. There would not be any leftover.

Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

Thanks!

Matthew Wilcox

17 Nov 17 Nov

9:05 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:

...

On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...
Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!

mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the old order of the page. When we support a variable size THP, we won't have the order at this point, but we will have the number of pages. So change the interface to what the caller and callee would prefer.

Zi Yan

9:12 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

...

On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:

...
On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...
Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!
mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the
old order of the page.  When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.

There are two callers, split_page in mm/page_alloc.c and __split_huge_page in mm/huge_memory.c. The former has the page order. The latter has the page order information before __split_huge_page_tail is called, so we can do old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order. What am I missing there?

Thanks.

— Best Regards, Yan Zi

Matthew Wilcox

9:22 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:

...

On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

...
On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:

...
On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...
Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!
mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the
old order of the page.  When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.
There are two callers, split_page in mm/page_alloc.c and __split_huge_page in mm/huge_memory.c. The former has the page order. The latter has the page order information before __split_huge_page_tail is called, so we can do old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order. What am I missing there?

Sure, we could also do that. But what I wrote was true at the time I wrote it.

Zi Yan

9:25 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 17 Nov 2020, at 16:22, Matthew Wilcox wrote:

...

On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:

...
On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

...
On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:

...
On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...
Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!
mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the
old order of the page.  When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.
There are two callers, split_page in mm/page_alloc.c and __split_huge_page in mm/huge_memory.c. The former has the page order. The latter has the page order information before __split_huge_page_tail is called, so we can do old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order. What am I missing there?
Sure, we could also do that. But what I wrote was true at the time I wrote it.

Got it. Thanks. Will change it to use old_order to make split_page_owner parameters look more consistent.

— Best Regards, Yan Zi

Roman Gushchin

9:35 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Tue, Nov 17, 2020 at 09:22:55PM +0000, Matthew Wilcox wrote:

...

On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:

...
On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

...
On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:

...
On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...
Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!
mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the
old order of the page.  When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.
There are two callers, split_page in mm/page_alloc.c and __split_huge_page in mm/huge_memory.c. The former has the page order. The latter has the page order information before __split_huge_page_tail is called, so we can do old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order. What am I missing there?
Sure, we could also do that. But what I wrote was true at the time I wrote it.

Sure, I was asking about if you're ok with going back to orders or there are better ideas. I'm sorry if it wasn't clear and sounded differently.

It just seems to me than a function is taking nr and order (as in Zi's last version), I'd expect that it's a number of pages of given order, or something like this. So I'd avoid mixing them. Orders are slightly better if nr is always a power of two, it's just more obvious from looking at the code.

Thanks!

Matthew Wilcox

9:43 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Tue, Nov 17, 2020 at 01:35:37PM -0800, Roman Gushchin wrote:

...

On Tue, Nov 17, 2020 at 09:22:55PM +0000, Matthew Wilcox wrote:

...
On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:

...
On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

...
On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:

...
On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:

...
Matthew recently converted split_page_owner to take nr instead of order.[1] But I am not sure why, since it seems to me that two call sites (__split_huge_page in mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order information.

Yeah, I'm not sure why too. Maybe Matthew has some input here? You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!
mm/page_owner: change split_page_owner to take a count

The implementation of split_page_owner() prefers a count rather than the
old order of the page.  When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.
There are two callers, split_page in mm/page_alloc.c and __split_huge_page in mm/huge_memory.c. The former has the page order. The latter has the page order information before __split_huge_page_tail is called, so we can do old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order. What am I missing there?
Sure, we could also do that. But what I wrote was true at the time I wrote it.
Sure, I was asking about if you're ok with going back to orders or there are better ideas. I'm sorry if it wasn't clear and sounded differently.

It just seems to me than a function is taking nr and order (as in Zi's last version), I'd expect that it's a number of pages of given order, or something like this. So I'd avoid mixing them. Orders are slightly better if nr is always a power of two, it's just more obvious from looking at the code.

I think it's awkward no matter which way round we do it.

If we pass old_order, new_order then we create extra work for both caller and callee.

If we pass old_nr, new_order, it looks weird for humans.

At the end of the day, I'm not that invested in which we do.

Kirill A. Shutemov

16 Nov 16 Nov

4:25 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); } -static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);

Hm. Where do you correct __split_page_owner() declaration. I don't see it.

-- Kirill A. Shutemov

Zi Yan

5:27 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 16 Nov 2020, at 11:25, Kirill A. Shutemov wrote:

...

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/page_owner.h | 7 ++++--- mm/huge_memory.c | 2 +- mm/page_alloc.c | 2 +- mm/page_owner.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 3468794f83d2..215cbb159568 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); }

-static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, unsigned int nr,
	unsigned int new_order)
{ if (static_branch_unlikely(&page_owner_inited))
__split_page_owner(page, nr);
__split_page_owner(page, nr, new_order);
Hm. Where do you correct __split_page_owner() declaration. I don't see it.

I missed it. Will add it. Thanks.

— Best Regards, Yan Zi

Matthew Wilcox

17 Nov 17 Nov

9:10 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext);
page_owner->order = 0;
page_owner->order = new_order;
page_ext = page_ext_next(page_ext); }

This doesn't do what you're hoping it will. It's going to set ->order to new_order for the first N pages instead of every 1/N pages.

You'll need to do something like

page_ext = lookup_page_ext(page + i);

or add a new page_ext_add(page_ext, 1 << new_order);

Zi Yan

9:13 p.m.

New subject: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

On 17 Nov 2020, at 16:10, Matthew Wilcox wrote:

...

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

...
for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += (1 << new_order)) { page_owner = get_page_owner(page_ext);
page_owner->order = 0;
page_owner->order = new_order;
page_ext = page_ext_next(page_ext); }
This doesn't do what you're hoping it will. It's going to set ->order to new_order for the first N pages instead of every 1/N pages.

You'll need to do something like
page_ext = lookup_page_ext(page + i);

Will use this. Thanks.

...

or add a new page_ext_add(page_ext, 1 << new_order);

— Best Regards, Yan Zi

Zi Yan

11 Nov 11 Nov

8:40 p.m.

New subject: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

From: Zi Yan ziy@nvidia.com

To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after truncating a pagecache THP. For anonymous THPs, we can only split them to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan ziy@nvidia.com --- include/linux/huge_mm.h | 8 +++++ mm/huge_memory.c | 78 +++++++++++++++++++++++++++++------------ mm/swap.c | 1 - 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 60a907a19f7d..9819cd9b4619 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);

bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order); static inline int split_huge_page(struct page *page) { return split_huge_page_to_list(page, NULL); @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list) { return 0; } +static inline int +split_huge_page_to_order_to_list(struct page *page, struct list_head *list, + unsigned int new_order) +{ + return 0; +} static inline int split_huge_page(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b7d771ee962..88f50da40c9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | - TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; + TTU_RMAP_LOCKED; bool unmap_success;

VM_BUG_ON_PAGE(!PageHead(page), page);

+ if (thp_order(page) >= HPAGE_PMD_ORDER) + ttu_flags |= TTU_SPLIT_HUGE_PMD; + if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE;

@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page) VM_BUG_ON_PAGE(!unmap_success, page); }

-static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr) { int i; - if (PageTransHuge(page)) { + if (thp_nr_pages(page) == nr) { remove_migration_ptes(page, page, true); } else { - for (i = 0; i < nr; i++) + for (i = 0; i < nr; i += new_nr) remove_migration_ptes(page + i, page + i, true); } }

static void __split_huge_page_tail(struct page *head, int tail, - struct lruvec *lruvec, struct list_head *list) + struct lruvec *lruvec, struct list_head *list, unsigned int new_order) { struct page *page_tail = head + tail; + unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;

VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);

@@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif + compound_head_flag | (1L << PG_dirty)));

/* ->mapping in first tail page is compound_mapcount */ @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail); + if (new_order) { + prep_compound_page(page_tail, new_order); + thp_prep(page_tail); + }

/* Finally unfreeze refcount. Additional reference from page cache. */ - page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) || - PageSwapCache(head))); + page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) || + PageSwapCache(head)) ? + thp_nr_pages(page_tail) : 0));

if (page_is_young(head)) set_page_young(page_tail); @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail, }

static void __split_huge_page(struct page *page, struct list_head *list, - pgoff_t end, unsigned long flags) + pgoff_t end, unsigned long flags, unsigned int new_order) { struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head); @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head); + unsigned int new_nr = 1 << new_order; int i;

lruvec = mem_cgroup_page_lruvec(head, pgdat);

/* complete memcg works before add pages to LRU */ - mem_cgroup_split_huge_fixup(head, 1); + mem_cgroup_split_huge_fixup(head, new_nr);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); }

- for (i = nr - 1; i >= 1; i--) { - __split_huge_page_tail(head, i, lruvec, list); + for (i = nr - new_nr; i >= new_nr; i -= new_nr) { + __split_huge_page_tail(head, i, lruvec, list, new_order); /* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); __delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head)) - shmem_uncharge(head->mapping->host, 1); + shmem_uncharge(head->mapping->host, new_nr); put_page(head + i); } else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index, @@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list, } }

- ClearPageCompound(head); + if (!new_order) + ClearPageCompound(head); + else + set_compound_order(head, new_order);

- split_page_owner(head, nr, 1); + split_page_owner(head, nr, new_nr);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) { - page_ref_add(head, 2); + page_ref_add(head, 1 + new_nr); xa_unlock(&swap_cache->i_pages); } else { page_ref_inc(head); } } else { /* Additional pin to page cache */ - page_ref_add(head, 2); + page_ref_add(head, 1 + new_nr); xa_unlock(&head->mapping->i_pages); }

spin_unlock_irqrestore(&pgdat->lru_lock, flags);

- remap_page(head, nr); + remap_page(head, nr, new_nr);

if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_swap_cluster(entry); }

- for (i = 0; i < nr; i++) { + for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue; @@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins) * us. */ int split_huge_page_to_list(struct page *page, struct list_head *list) +{ + return split_huge_page_to_list_to_order(page, list, 0); +} + +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order) { struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head); - XA_STATE(xas, &head->mapping->i_pages, head->index); + /* reset xarray order to new order after split */ + XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; unsigned long flags; pgoff_t end;

+ VM_BUG_ON(thp_order(head) <= new_order); VM_BUG_ON_PAGE(is_huge_zero_page(head), head); VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head);

+ if (new_order == 1) { + WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)"); + return -EINVAL; + } + + if (PageAnon(head) && new_order) { + WARN_ONCE(1, "Split anonymous THP to non-zero order not support"); + return -EINVAL; + } + if (PageWriteback(head)) return -EBUSY;

@@ -2720,18 +2752,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) { if (!list_empty(page_deferred_list(head))) { ds_queue->split_queue_len--; - list_del(page_deferred_list(head)); + list_del_init(page_deferred_list(head)); } spin_unlock(&ds_queue->split_queue_lock); if (mapping) { if (PageSwapBacked(head)) __dec_lruvec_page_state(head, NR_SHMEM_THPS); - else + else if (!new_order) __mod_lruvec_page_state(head, NR_FILE_THPS, -thp_nr_pages(head)); }

- __split_huge_page(page, list, end, flags); + __split_huge_page(page, list, end, flags, new_order); ret = 0; } else { if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) { @@ -2746,7 +2778,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) fail: if (mapping) xas_unlock(&xas); spin_unlock_irqrestore(&pgdata->lru_lock, flags); - remap_page(head, thp_nr_pages(head)); + remap_page(head, thp_nr_pages(head), 0); ret = -EBUSY; }

diff --git a/mm/swap.c b/mm/swap.c index 14c3bac607a6..6c33e6165597 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *list) { VM_BUG_ON_PAGE(!PageHead(page), page); - VM_BUG_ON_PAGE(PageCompound(page_tail), page); VM_BUG_ON_PAGE(PageLRU(page_tail), page); lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

-- 2.28.0

Ralph Campbell

12 Nov 12 Nov

10:01 p.m.

New subject: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

On 11/11/20 12:40 PM, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after truncating a pagecache THP. For anonymous THPs, we can only split them to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/huge_mm.h | 8 +++++ mm/huge_memory.c | 78 +++++++++++++++++++++++++++++------------ mm/swap.c | 1 - 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 60a907a19f7d..9819cd9b4619 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page); bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
static inline int split_huge_page(struct page *page) { return split_huge_page_to_list(page, NULL);
@@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list) { return 0; } +static inline int +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
unsigned int new_order)
+{

return 0;

+} static inline int split_huge_page(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b7d771ee962..88f50da40c9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
TTU_RMAP_LOCKED;
bool unmap_success;
VM_BUG_ON_PAGE(!PageHead(page), page);
if (thp_order(page) >= HPAGE_PMD_ORDER)
ttu_flags |= TTU_SPLIT_HUGE_PMD;
if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE;
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page) VM_BUG_ON_PAGE(!unmap_success, page); } -static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr) { int i;

Use unsigned int i? Maybe a blank line here and the {}'s around if/else aren't needed.

...

if (PageTransHuge(page)) {

if (thp_nr_pages(page) == nr) { remove_migration_ptes(page, page, true); } else {
for (i = 0; i < nr; i++)
for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
} }
static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
{ struct page *page_tail = head + tail;
unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail); @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif
	 compound_head_flag |
 (1L << PG_dirty)));
/* ->mapping in first tail page is compound_mapcount */ @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail);
if (new_order) {
prep_compound_page(page_tail, new_order);
thp_prep(page_tail);
}
/* Finally unfreeze refcount. Additional reference from page cache. */
page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
			  PageSwapCache(head)));
page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
			   PageSwapCache(head)) ?
				thp_nr_pages(page_tail) : 0));
if (page_is_young(head)) set_page_young(page_tail); @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail, } static void __split_huge_page(struct page *page, struct list_head *list,
pgoff_t end, unsigned long flags)
pgoff_t end, unsigned long flags, unsigned int new_order)
{ struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head);
@@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head);

unsigned int new_nr = 1 << new_order; int i;

lruvec = mem_cgroup_page_lruvec(head, pgdat); /* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head, 1);

mem_cgroup_split_huge_fixup(head, new_nr);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); }
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
__split_huge_page_tail(head, i, lruvec, list, new_order);
/* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); __delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
		shmem_uncharge(head->mapping->host, 1);
		shmem_uncharge(head->mapping->host, new_nr);
put_page(head + i);
} else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index,
@@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list, } }

ClearPageCompound(head);
if (!new_order)
ClearPageCompound(head);
else
set_compound_order(head, new_order);
split_page_owner(head, nr, 1);

split_page_owner(head, nr, new_nr);

This needs to be "new_order" instead of "new_nr".

...

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) {
	page_ref_add(head, 2);
	page_ref_add(head, 1 + new_nr);
xa_unlock(&swap_cache->i_pages);
} else { page_ref_inc(head); } } else { /* Additional pin to page cache */
page_ref_add(head, 2);
page_ref_add(head, 1 + new_nr);
xa_unlock(&head->mapping->i_pages); }
spin_unlock_irqrestore(&pgdat->lru_lock, flags);

remap_page(head, nr);

remap_page(head, nr, new_nr);

if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_swap_cluster(entry); }

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue;

@@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)

us.

*/ int split_huge_page_to_list(struct page *page, struct list_head *list) +{

return split_huge_page_to_list_to_order(page, list, 0);

+}

+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
		     unsigned int new_order)
{ struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head);
XA_STATE(xas, &head->mapping->i_pages, head->index);

/* reset xarray order to new order after split */

XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; unsigned long flags; pgoff_t end;

VM_BUG_ON(thp_order(head) <= new_order); VM_BUG_ON_PAGE(is_huge_zero_page(head), head); VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head);
if (new_order == 1) {
WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
return -EINVAL;
}

if (PageAnon(head) && new_order) {
WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
return -EINVAL;
}

if (PageWriteback(head)) return -EBUSY;
@@ -2720,18 +2752,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) { if (!list_empty(page_deferred_list(head))) { ds_queue->split_queue_len--;
	list_del(page_deferred_list(head));
	list_del_init(page_deferred_list(head));
} spin_unlock(&ds_queue->split_queue_lock); if (mapping) { if (PageSwapBacked(head)) __dec_lruvec_page_state(head, NR_SHMEM_THPS);
	else
	else if (!new_order)
	__mod_lruvec_page_state(head, NR_FILE_THPS,
			-thp_nr_pages(head));
}
__split_huge_page(page, list, end, flags);
__split_huge_page(page, list, end, flags, new_order);
ret = 0; } else { if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2746,7 +2778,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) fail: if (mapping) xas_unlock(&xas); spin_unlock_irqrestore(&pgdata->lru_lock, flags);
remap_page(head, thp_nr_pages(head));
remap_page(head, thp_nr_pages(head), 0);

Shouldn't this be "1" instead of zero? remap_page() takes new_nr.

...

ret = -EBUSY;
} diff --git a/mm/swap.c b/mm/swap.c index 14c3bac607a6..6c33e6165597 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *list) { VM_BUG_ON_PAGE(!PageHead(page), page);

VM_BUG_ON_PAGE(PageCompound(page_tail), page); VM_BUG_ON_PAGE(PageLRU(page_tail), page); lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

Zi Yan

10:20 p.m.

New subject: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

On 12 Nov 2020, at 17:01, Ralph Campbell wrote:

...

On 11/11/20 12:40 PM, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after truncating a pagecache THP. For anonymous THPs, we can only split them to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/huge_mm.h | 8 +++++ mm/huge_memory.c | 78 +++++++++++++++++++++++++++++------------ mm/swap.c | 1 - 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 60a907a19f7d..9819cd9b4619 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page); bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
static inline int split_huge_page(struct page *page) { return split_huge_page_to_list(page, NULL);
@@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list) { return 0; } +static inline int +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
unsigned int new_order)
+{

return 0;

+} static inline int split_huge_page(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b7d771ee962..88f50da40c9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
TTU_RMAP_LOCKED;
bool unmap_success; VM_BUG_ON_PAGE(!PageHead(page), page);
if (thp_order(page) >= HPAGE_PMD_ORDER)
ttu_flags |= TTU_SPLIT_HUGE_PMD;
if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE;
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page) VM_BUG_ON_PAGE(!unmap_success, page); } -static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr) { int i;
Use unsigned int i? Maybe a blank line here and the {}'s around if/else aren't needed.

...
if (PageTransHuge(page)) {

if (thp_nr_pages(page) == nr) { remove_migration_ptes(page, page, true); } else {
for (i = 0; i < nr; i++)
for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
} } static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
{ struct page *page_tail = head + tail;
unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0; VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
@@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif
	 compound_head_flag |
 (1L << PG_dirty)));
/* ->mapping in first tail page is compound_mapcount */
@@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail);
if (new_order) {
prep_compound_page(page_tail, new_order);
thp_prep(page_tail);
} /* Finally unfreeze refcount. Additional reference from page cache. */
page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
			  PageSwapCache(head)));
page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
			   PageSwapCache(head)) ?
				thp_nr_pages(page_tail) : 0));
if (page_is_young(head)) set_page_young(page_tail);
@@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail, } static void __split_huge_page(struct page *page, struct list_head *list,
pgoff_t end, unsigned long flags)
pgoff_t end, unsigned long flags, unsigned int new_order)
{ struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head);
@@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head);

unsigned int new_nr = 1 << new_order; int i; lruvec = mem_cgroup_page_lruvec(head, pgdat); /* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head, 1);

mem_cgroup_split_huge_fixup(head, new_nr); if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) };

@@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); }
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
__split_huge_page_tail(head, i, lruvec, list, new_order);
/* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); __delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
		shmem_uncharge(head->mapping->host, 1);
		shmem_uncharge(head->mapping->host, new_nr);
put_page(head + i);
} else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index,
@@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list, } }

ClearPageCompound(head);
if (!new_order)
ClearPageCompound(head);
else
set_compound_order(head, new_order);
split_page_owner(head, nr, 1);

split_page_owner(head, nr, new_nr);
This needs to be "new_order" instead of "new_nr".

...
/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) {
	page_ref_add(head, 2);
	page_ref_add(head, 1 + new_nr);
xa_unlock(&swap_cache->i_pages);
} else { page_ref_inc(head); } } else { /* Additional pin to page cache */
page_ref_add(head, 2);
page_ref_add(head, 1 + new_nr);
xa_unlock(&head->mapping->i_pages); } spin_unlock_irqrestore(&pgdat->lru_lock, flags);
remap_page(head, nr);

remap_page(head, nr, new_nr); if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) };

@@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_swap_cluster(entry); }

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue;

@@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)

us.

*/ int split_huge_page_to_list(struct page *page, struct list_head *list) +{

return split_huge_page_to_list_to_order(page, list, 0);

+}

+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
		     unsigned int new_order)
{ struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head);
XA_STATE(xas, &head->mapping->i_pages, head->index);
/* reset xarray order to new order after split */

XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; unsigned long flags; pgoff_t end;

VM_BUG_ON(thp_order(head) <= new_order); VM_BUG_ON_PAGE(is_huge_zero_page(head), head); VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head);

if (new_order == 1) {
WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
return -EINVAL;
}

if (PageAnon(head) && new_order) {
WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
return -EINVAL;
}

if (PageWriteback(head)) return -EBUSY;
@@ -2720,18 +2752,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) { if (!list_empty(page_deferred_list(head))) { ds_queue->split_queue_len--;
	list_del(page_deferred_list(head));
	list_del_init(page_deferred_list(head));
} spin_unlock(&ds_queue->split_queue_lock); if (mapping) { if (PageSwapBacked(head)) __dec_lruvec_page_state(head, NR_SHMEM_THPS);
	else
	else if (!new_order)
	__mod_lruvec_page_state(head, NR_FILE_THPS,
			-thp_nr_pages(head));
}
__split_huge_page(page, list, end, flags);
__split_huge_page(page, list, end, flags, new_order);
ret = 0; } else { if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2746,7 +2778,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) fail: if (mapping) xas_unlock(&xas); spin_unlock_irqrestore(&pgdata->lru_lock, flags);
remap_page(head, thp_nr_pages(head));
remap_page(head, thp_nr_pages(head), 0);
Shouldn't this be "1" instead of zero? remap_page() takes new_nr.

...
ret = -EBUSY;
} diff --git a/mm/swap.c b/mm/swap.c index 14c3bac607a6..6c33e6165597 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *list) { VM_BUG_ON_PAGE(!PageHead(page), page);

VM_BUG_ON_PAGE(PageCompound(page_tail), page); VM_BUG_ON_PAGE(PageLRU(page_tail), page); lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

Thanks for catching all these. Will fix them in the next version.

— Best Regards, Yan Zi

Roman Gushchin

14 Nov 14 Nov

12:52 a.m.

New subject: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after truncating a pagecache THP. For anonymous THPs, we can only split them to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/huge_mm.h | 8 +++++ mm/huge_memory.c | 78 +++++++++++++++++++++++++++++------------ mm/swap.c | 1 - 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 60a907a19f7d..9819cd9b4619 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page); bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
static inline int split_huge_page(struct page *page) { return split_huge_page_to_list(page, NULL); @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list) { return 0; } +static inline int +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
unsigned int new_order)

It was int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); above.

...

+{

return 0;

+} static inline int split_huge_page(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b7d771ee962..88f50da40c9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
TTU_RMAP_LOCKED;
bool unmap_success;
VM_BUG_ON_PAGE(!PageHead(page), page);
if (thp_order(page) >= HPAGE_PMD_ORDER)
ttu_flags |= TTU_SPLIT_HUGE_PMD;
if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE;
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page) VM_BUG_ON_PAGE(!unmap_success, page); } -static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr) { int i;

if (PageTransHuge(page)) {

if (thp_nr_pages(page) == nr) { remove_migration_ptes(page, page, true); } else {
for (i = 0; i < nr; i++)
for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
}
} static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
{ struct page *page_tail = head + tail;

unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;

VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail); @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif
	 compound_head_flag |
 (1L << PG_dirty)));
/* ->mapping in first tail page is compound_mapcount */ @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail);
if (new_order) {
prep_compound_page(page_tail, new_order);
thp_prep(page_tail);
}
/* Finally unfreeze refcount. Additional reference from page cache. */
page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
			  PageSwapCache(head)));
page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
			   PageSwapCache(head)) ?
				thp_nr_pages(page_tail) : 0));
if (page_is_young(head)) set_page_young(page_tail); @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail, } static void __split_huge_page(struct page *page, struct list_head *list,
pgoff_t end, unsigned long flags)
pgoff_t end, unsigned long flags, unsigned int new_order)
{ struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head); @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head);

unsigned int new_nr = 1 << new_order; int i;

lruvec = mem_cgroup_page_lruvec(head, pgdat); /* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head, 1);

mem_cgroup_split_huge_fixup(head, new_nr);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); }
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
__split_huge_page_tail(head, i, lruvec, list, new_order);
/* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); __delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
		shmem_uncharge(head->mapping->host, 1);
		shmem_uncharge(head->mapping->host, new_nr);
put_page(head + i);
} else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index,
@@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list, } }

ClearPageCompound(head);
if (!new_order)
ClearPageCompound(head);
else
set_compound_order(head, new_order);
split_page_owner(head, nr, 1);

split_page_owner(head, nr, new_nr);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) {
	page_ref_add(head, 2);
	page_ref_add(head, 1 + new_nr);
xa_unlock(&swap_cache->i_pages);
} else { page_ref_inc(head); } } else { /* Additional pin to page cache */
page_ref_add(head, 2);
page_ref_add(head, 1 + new_nr);
xa_unlock(&head->mapping->i_pages); }
spin_unlock_irqrestore(&pgdat->lru_lock, flags);

remap_page(head, nr);

remap_page(head, nr, new_nr);

if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_swap_cluster(entry); }

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue;

@@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)

us.

*/ int split_huge_page_to_list(struct page *page, struct list_head *list) +{

return split_huge_page_to_list_to_order(page, list, 0);

+}

+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
		     unsigned int new_order)
{ struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head);

XA_STATE(xas, &head->mapping->i_pages, head->index);

/* reset xarray order to new order after split */

XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; unsigned long flags; pgoff_t end;

VM_BUG_ON(thp_order(head) <= new_order); VM_BUG_ON_PAGE(is_huge_zero_page(head), head); VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head);
if (new_order == 1) {
WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
return -EINVAL;
}

if (PageAnon(head) && new_order) {
WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
return -EINVAL;
}

I'd convert those into VM_BUG_ON()'s. Unlikely they will be hit at arbitrary moments by random users.

Thanks!

Zi Yan

1 a.m.

New subject: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

On 13 Nov 2020, at 19:52, Roman Gushchin wrote:

...

On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after truncating a pagecache THP. For anonymous THPs, we can only split them to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan ziy@nvidia.com

include/linux/huge_mm.h | 8 +++++ mm/huge_memory.c | 78 +++++++++++++++++++++++++++++------------ mm/swap.c | 1 - 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 60a907a19f7d..9819cd9b4619 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);

bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
static inline int split_huge_page(struct page *page) { return split_huge_page_to_list(page, NULL); @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list) { return 0; } +static inline int +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
unsigned int new_order)
It was int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); above.

Right. It should be split_huge_page_to_list_to_order. Will fix it.

...

...
+{

return 0;

+} static inline int split_huge_page(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b7d771ee962..88f50da40c9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
TTU_RMAP_LOCKED;
bool unmap_success;

VM_BUG_ON_PAGE(!PageHead(page), page);
if (thp_order(page) >= HPAGE_PMD_ORDER)
ttu_flags |= TTU_SPLIT_HUGE_PMD;
if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE;
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page) VM_BUG_ON_PAGE(!unmap_success, page); }

-static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr) { int i;

if (PageTransHuge(page)) {

if (thp_nr_pages(page) == nr) { remove_migration_ptes(page, page, true); } else {
for (i = 0; i < nr; i++)
for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
}
}

static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
{ struct page *page_tail = head + tail;

unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;

VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);

@@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif
	 compound_head_flag |
 (1L << PG_dirty)));
/* ->mapping in first tail page is compound_mapcount */
@@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail);
if (new_order) {
prep_compound_page(page_tail, new_order);
thp_prep(page_tail);
}

/* Finally unfreeze refcount. Additional reference from page cache. */
page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
			  PageSwapCache(head)));
page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
			   PageSwapCache(head)) ?
				thp_nr_pages(page_tail) : 0));
if (page_is_young(head)) set_page_young(page_tail);
@@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail, }

static void __split_huge_page(struct page *page, struct list_head *list,
pgoff_t end, unsigned long flags)
pgoff_t end, unsigned long flags, unsigned int new_order)
{ struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head); @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head);

unsigned int new_nr = 1 << new_order; int i;

lruvec = mem_cgroup_page_lruvec(head, pgdat);

/* complete memcg works before add pages to LRU */

mem_cgroup_split_huge_fixup(head, 1);

mem_cgroup_split_huge_fixup(head, new_nr);

if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) };

@@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); }
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
__split_huge_page_tail(head, i, lruvec, list, new_order);
/* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); __delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
		shmem_uncharge(head->mapping->host, 1);
		shmem_uncharge(head->mapping->host, new_nr);
put_page(head + i);
} else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index,
@@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list, } }

ClearPageCompound(head);
if (!new_order)
ClearPageCompound(head);
else
set_compound_order(head, new_order);
split_page_owner(head, nr, 1);

split_page_owner(head, nr, new_nr);

/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) {
	page_ref_add(head, 2);
	page_ref_add(head, 1 + new_nr);
xa_unlock(&swap_cache->i_pages);
} else { page_ref_inc(head); } } else { /* Additional pin to page cache */
page_ref_add(head, 2);
page_ref_add(head, 1 + new_nr);
xa_unlock(&head->mapping->i_pages); }

spin_unlock_irqrestore(&pgdat->lru_lock, flags);
remap_page(head, nr);

remap_page(head, nr, new_nr);

if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) };

@@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_swap_cluster(entry); }

for (i = 0; i < nr; i++) {

for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue;

@@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)

us.

*/ int split_huge_page_to_list(struct page *page, struct list_head *list) +{

return split_huge_page_to_list_to_order(page, list, 0);

+}

+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
		     unsigned int new_order)
{ struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head);

XA_STATE(xas, &head->mapping->i_pages, head->index);
/* reset xarray order to new order after split */

XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; unsigned long flags; pgoff_t end;

VM_BUG_ON(thp_order(head) <= new_order); VM_BUG_ON_PAGE(is_huge_zero_page(head), head); VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head);

if (new_order == 1) {
WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
return -EINVAL;
}

if (PageAnon(head) && new_order) {
WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
return -EINVAL;
}
I'd convert those into VM_BUG_ON()'s. Unlikely they will be hit at arbitrary moments by random users.

Sure. Will change them.

Thanks.

— Best Regards, Yan Zi

Zi Yan

11 Nov 11 Nov

8:40 p.m.

New subject: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.

From: Zi Yan ziy@nvidia.com

To minimize the number of pages after a truncation, when truncating a THP, we do not need to split it all the way down to order-0. The THP has at most three parts, the part before offset, the part to be truncated, the part left at the end. Use the non-zero minimum of them to decide what order we split the THP to.

Signed-off-by: Zi Yan ziy@nvidia.com --- mm/truncate.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c index 20bd17538ec2..6d8e3c6115bc 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page) bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) { loff_t pos = page_offset(page); - unsigned int offset, length; + unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;

if (pos < start) offset = start - pos; @@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) length = length - offset; else length = end + 1 - pos - offset; + left = thp_size(page) - offset - length;

wait_on_page_writeback(page); if (length == thp_size(page)) { @@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) do_invalidatepage(page, offset, length); if (!PageTransHuge(page)) return true; - return split_huge_page(page) == 0; + + /* + * find the non-zero minimum of offset, length, and left and use it to + * decide the new order of the page after split + */ + if (offset && left) + min_subpage_size = min_t(unsigned int, + min_t(unsigned int, offset, length), + left); + else if (!offset) + min_subpage_size = min_t(unsigned int, length, left); + else /* !left */ + min_subpage_size = min_t(unsigned int, length, offset); + + min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size); + + return split_huge_page_to_list_to_order(page, NULL, + ilog2(min_subpage_size/PAGE_SIZE)) == 0; }

-- 2.28.0

Ralph Campbell

12 Nov 12 Nov

10:08 p.m.

New subject: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.

On 11/11/20 12:40 PM, Zi Yan wrote:

...

From: Zi Yan ziy@nvidia.com

To minimize the number of pages after a truncation, when truncating a THP, we do not need to split it all the way down to order-0. The THP has at most three parts, the part before offset, the part to be truncated, the part left at the end. Use the non-zero minimum of them to decide what order we split the THP to.

Signed-off-by: Zi Yan ziy@nvidia.com

mm/truncate.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c index 20bd17538ec2..6d8e3c6115bc 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page) bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) { loff_t pos = page_offset(page);

unsigned int offset, length;

unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;

Maybe use "remaining" instead of "left" since I think of the latter as the length of the left side (offset).

...

if (pos < start) offset = start - pos; @@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) length = length - offset; else length = end + 1 - pos - offset;

left = thp_size(page) - offset - length;

wait_on_page_writeback(page); if (length == thp_size(page)) { @@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) do_invalidatepage(page, offset, length); if (!PageTransHuge(page)) return true;

return split_huge_page(page) == 0;
/*
* find the non-zero minimum of offset, length, and left and use it to
* decide the new order of the page after split
*/
if (offset && left)
min_subpage_size = min_t(unsigned int,
			 min_t(unsigned int, offset, length),
			 left);
else if (!offset)
min_subpage_size = min_t(unsigned int, length, left);
else /* !left */
min_subpage_size = min_t(unsigned int, length, offset);
min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);

return split_huge_page_to_list_to_order(page, NULL,
		ilog2(min_subpage_size/PAGE_SIZE)) == 0;
}

What if "min_subpage_size" is 1/2 the THP but offset isn't aligned to 1/2? Splitting the page in half wouldn't result in a page that could be freed but maybe splitting to 1/4 would (assuming the THP is at least 8x PAGE_SIZE).

Zi Yan

10:37 p.m.

New subject: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.

On 12 Nov 2020, at 17:08, Ralph Campbell wrote:

...

On 11/11/20 12:40 PM, Zi Yan wrote:

...
From: Zi Yan ziy@nvidia.com

To minimize the number of pages after a truncation, when truncating a THP, we do not need to split it all the way down to order-0. The THP has at most three parts, the part before offset, the part to be truncated, the part left at the end. Use the non-zero minimum of them to decide what order we split the THP to.

Signed-off-by: Zi Yan ziy@nvidia.com

mm/truncate.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c index 20bd17538ec2..6d8e3c6115bc 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page) bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) { loff_t pos = page_offset(page);

unsigned int offset, length;

unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;

Maybe use "remaining" instead of "left" since I think of the latter as the length of the left side (offset).

Sure. Will change the name.

...

...
if (pos < start) offset = start - pos; @@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) length = length - offset; else length = end + 1 - pos - offset;

left = thp_size(page) - offset - length; wait_on_page_writeback(page); if (length == thp_size(page)) {

@@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end) do_invalidatepage(page, offset, length); if (!PageTransHuge(page)) return true;

return split_huge_page(page) == 0;
/*
* find the non-zero minimum of offset, length, and left and use it to
* decide the new order of the page after split
*/
if (offset && left)
min_subpage_size = min_t(unsigned int,
			 min_t(unsigned int, offset, length),
			 left);
else if (!offset)
min_subpage_size = min_t(unsigned int, length, left);
else /* !left */
min_subpage_size = min_t(unsigned int, length, offset);
min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);

return split_huge_page_to_list_to_order(page, NULL,
		ilog2(min_subpage_size/PAGE_SIZE)) == 0;
}
What if "min_subpage_size" is 1/2 the THP but offset isn't aligned to 1/2? Splitting the page in half wouldn't result in a page that could be freed but maybe splitting to 1/4 would (assuming the THP is at least 8x PAGE_SIZE).

Is it possible? The whole THP is divided into three parts, offset, length, and remaining (renamed from left). If offset is not aligned to 1/2, it is either greater than 1/2 or smaller than 1/2. If it is the former, length and remaining will be smaller than 1/2, so min_subpage_size cannot be 1/2. If it is the latter, min_subpage_size cannot be 1/2 either. Because min_subpage_size is the smallest non-zero value of offset, length, and remaining. Let me know if I miss anything.

— Best Regards, Yan Zi

Zi Yan

11 Nov 11 Nov

8:40 p.m.

New subject: [RFC PATCH 6/6] mm: huge_memory: enable debugfs to split huge pages to any order.

From: Zi Yan ziy@nvidia.com

It is used to test split_huge_page_to_list_to_order for pagecache THPs. Also add test cases for split_huge_page_to_list_to_order via both debugfs and truncating a file.

Signed-off-by: Zi Yan ziy@nvidia.com --- mm/huge_memory.c | 13 +-- .../selftests/vm/split_huge_page_test.c | 102 +++++++++++++++++- 2 files changed, 105 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 88f50da40c9b..b7470607a08b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2974,7 +2974,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file, static DEFINE_MUTEX(mutex); ssize_t ret; char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */ - int pid; + int pid, to_order = 0; unsigned long vaddr_start, vaddr_end, addr; nodemask_t task_nodes; struct mm_struct *mm; @@ -2990,8 +2990,9 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file, goto out;

input_buf[80] = '\0'; - ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end); - if (ret != 3) { + ret = sscanf(input_buf, "%d,%lx,%lx,%d", &pid, &vaddr_start, &vaddr_end, &to_order); + /* cannot split to order-1 THP, which is not possible */ + if ((ret != 3 && ret != 4) || to_order == 1) { ret = -EINVAL; goto out; } @@ -2999,8 +3000,8 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file, vaddr_end &= PAGE_MASK;

ret = strlen(input_buf); - pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n", - pid, vaddr_start, vaddr_end); + pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx], to order: %d\n", + pid, vaddr_start, vaddr_end, to_order);

mm = find_mm_struct(pid, &task_nodes); if (IS_ERR(mm)) { @@ -3038,7 +3039,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file, addr += page_size(page) - PAGE_SIZE;

/* reset addr if split fails */ - if (split_huge_page(page)) + if (split_huge_page_to_list_to_order(page, NULL, to_order)) addr -= (page_size(page) - PAGE_SIZE);

unlock_page(page); diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c index c8a32ae9e13a..bcbc5a9d327c 100644 --- a/tools/testing/selftests/vm/split_huge_page_test.c +++ b/tools/testing/selftests/vm/split_huge_page_test.c @@ -16,6 +16,7 @@ #include <sys/wait.h> #include <malloc.h> #include <stdbool.h> +#include <time.h>

#define PAGE_4KB (4096UL) #define PAGE_2MB (512UL*PAGE_4KB) @@ -31,6 +32,7 @@

#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid" #define SMAP_PATH "/proc/self/smaps" +#define THP_FS_PATH "/mnt/thp_fs" #define INPUT_MAX 80

static int write_file(const char *path, const char *buf, size_t buflen) @@ -50,13 +52,13 @@ static int write_file(const char *path, const char *buf, size_t buflen) return (unsigned int) numwritten; }

-static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end) +static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end, int order) { char input[INPUT_MAX]; int ret;

- ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start, - vaddr_end); + ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx,%d", pid, vaddr_start, + vaddr_end, order); if (ret >= INPUT_MAX) { printf("%s: Debugfs input is too long\n", __func__); exit(EXIT_FAILURE); @@ -139,7 +141,7 @@ void split_pmd_thp(void) }

/* split all possible huge pages */ - write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len); + write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len, 0);

*one_page = 0;

@@ -153,9 +155,101 @@ void split_pmd_thp(void) free(one_page); }

+void create_pagecache_thp_and_fd(size_t fd_size, int *fd, char **addr) +{ + const char testfile[] = THP_FS_PATH "/test"; + size_t i; + int dummy; + + srand(time(NULL)); + + *fd = open(testfile, O_CREAT | O_RDWR, 0664); + + for (i = 0; i < fd_size; i++) { + unsigned char byte = rand(); + + write(*fd, &byte, sizeof(byte)); + } + close(*fd); + sync(); + *fd = open("/proc/sys/vm/drop_caches", O_WRONLY); + if (*fd == -1) { + perror("open drop_caches"); + exit(EXIT_FAILURE); + } + if (write(*fd, "3", 1) != 1) { + perror("write to drop_caches"); + exit(EXIT_FAILURE); + } + close(*fd); + + *fd = open(testfile, O_RDWR); + + *addr = mmap(NULL, fd_size, PROT_READ|PROT_WRITE, MAP_SHARED, *fd, 0); + if (*addr == (char *)-1) { + perror("cannot mmap"); + exit(1); + } + madvise(*addr, fd_size, MADV_HUGEPAGE); + + for (size_t i = 0; i < fd_size; i++) + dummy += *(*addr + i); +} + +void split_thp_in_pagecache_to_order(int order) +{ + int fd; + char *addr; + size_t fd_size = 2 * PAGE_2MB, i; + + create_pagecache_thp_and_fd(fd_size, &fd, &addr); + + printf("split %ld kB pagecache page to order %d ... ", fd_size >> 10, order); + write_debugfs(getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order); + + for (i = 0; i < fd_size; i++) + *(addr + i) = (char)i; + + close(fd); + printf("done\n"); +} + +void truncate_thp_in_pagecache_to_order(int order) +{ + int fd; + char *addr; + size_t fd_size = 2 * PAGE_2MB, i; + + create_pagecache_thp_and_fd(fd_size, &fd, &addr); + + printf("truncate %ld kB pagecache page to size %lu kB ... ", fd_size >> 10, 4UL << order); + ftruncate(fd, PAGE_4KB << order); + + for (i = 0; i < (PAGE_4KB << order); i++) + *(addr + i) = (char)i; + + close(fd); + printf("done\n"); +} + int main(int argc, char **argv) { + int i; + + if (geteuid() != 0) { + printf("Please run the benchmark as root\n"); + exit(EXIT_FAILURE); + } + split_pmd_thp();

+ for (i = 8; i >= 0; i--) + if (i != 1) + split_thp_in_pagecache_to_order(i); + + for (i = 8; i >= 0; i--) + if (i != 1) + truncate_thp_in_pagecache_to_order(i); + return 0; }

-- 2.28.0

1805

days inactive

1811

days old

linux-kselftest-mirror@lists.linaro.org

35 comments

participants

tags (0)

participants (6)

Kirill A. Shutemov
Matthew Wilcox
Ralph Campbell
Roman Gushchin
Zi Yan
Zi Yan