Bingbu reported an issue in [1] that udmabuf vmap failed and in [2], we discussed the scenario of folio vmap due to the misuse of vmap_pfn in udmabuf.
We reached the conclusion that vmap_pfn prohibits the use of page-based PFNs: Christoph Hellwig : 'No, vmap_pfn is entirely for memory not backed by pages or folios, i.e. PCIe BARs and similar memory. This must not be mixed with proper folio backed memory.'
But udmabuf still need consider HVO based folio's vmap, and need fix vmap issue. This RFC code want to show the two point that I mentioned in [2], and more deep talk it:
Point1. simple copy vmap_pfn code, don't bother common vmap_pfn, use by itself and remove pfn_valid check.
Point2. implement folio array based vmap(vmap_folios), which can given a range of each folio(offset, nr_pages), so can suit HVO folio's vmap.
Patch 1-2 implement point1, and add a test simple set in udmabuf driver. Patch 3-5 implement point2, also can test it.
Kasireddy also show that 'another option is to just limit udmabuf's vmap() to only shmem folios'(This I guess folio_test_hugetlb_vmemmap_optimized can help.)
But I prefer point2 to solution this issue, and IMO, folio based vmap still need.
Compare to page based vmap(or pfn based), we need split each large folio into single page struct, this need more large array struct and more longer iter. If each tail page struct not exist(like HVO), can only use pfn vmap, but there are no common api to do this.
In [2], we talked that udmabuf can use hugetlb as the memory provider, and can give a range use. So if HVO used in hugetlb, each folio's tail page may freed, so we can't use page based vmap, only can use pfn based, which show in point1.
Further more, Folio based vmap only need record each folio(and offset, nr_pages if range need). For 20MB vmap, page based need 5120 pages(40KB), 2MB folios only need 10 folio(80Byte).
Matthew show that Vishal also offered a folio based vmap - vmap_file[3]. This RFC patch want a range based folio, not only a full folio's map(like file's folio), to resolve some problem like HVO's range folio vmap.
Please give me more suggestion.
Test case: //enable/disable HVO 1. echo [1|0] > /proc/sys/vm/hugetlb_optimize_vmemmap //prepare HUGETLB 2. echo 10 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 3. ./udmabuf_vmap 4. check output, and dmesg if any warn.
[1] https://lore.kernel.org/all/9172a601-c360-0d5b-ba1b-33deba430455@linux.intel... [2] https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ [3] https://lore.kernel.org/linux-mm/20250131001806.92349-1-vishal.moola@gmail.c...
Huan Yang (6): udmabuf: try fix udmabuf vmap udmabuf: try udmabuf vmap test mm/vmalloc: try add vmap folios range udmabuf: use vmap_range_folios udmabuf: vmap test suit for pages and pfns compare udmabuf: remove no need code
drivers/dma-buf/udmabuf.c | 29 +++++++++----------- include/linux/vmalloc.h | 57 +++++++++++++++++++++++++++++++++++++++ mm/vmalloc.c | 47 ++++++++++++++++++++++++++++++++ 3 files changed, 117 insertions(+), 16 deletions(-)
-- 2.48.1
vmap_pfn design to only allow none page based pfn map.
This patch try simple copy pfn vmap code to fix it.
Fix vunmap_udmabuf fix match with vmap_udmabuf, use vunmap.
Signed-off-by: Huan Yang link@vivo.com --- drivers/dma-buf/udmabuf.c | 49 +++++++++++++++++++++++++++++++++++---- 1 file changed, 45 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index cc7398cc17d6..2dfe639230dc 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -106,6 +106,49 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma) return 0; }
+struct udmabuf_pfn_data { + unsigned long *pfns; + pgprot_t prot; + unsigned int idx; +}; + +static int udmabuf_vmap_pfn_apply(pte_t *pte, unsigned long addr, void *private) +{ + struct udmabuf_pfn_data *data = private; + pte_t ptent; + + ptent = pte_mkspecial(pfn_pte(data->pfns[data->idx], data->prot)); + set_pte_at(&init_mm, addr, pte, ptent); + + data->idx++; + return 0; +} + +static void *udmabuf_vmap_pfn(unsigned long *pfns, unsigned int count, + pgprot_t prot) +{ + struct udmabuf_pfn_data data = { .pfns = pfns, + .prot = pgprot_nx(prot) }; + struct vm_struct *area; + + area = get_vm_area_caller(count * PAGE_SIZE, 0, + __builtin_return_address(0)); + if (!area) + return NULL; + + if (apply_to_page_range(&init_mm, (unsigned long)area->addr, + count * PAGE_SIZE, udmabuf_vmap_pfn_apply, + &data)) { + free_vm_area(area); + return NULL; + } + + flush_cache_vmap((unsigned long)area->addr, + (unsigned long)area->addr + count * PAGE_SIZE); + + return area->addr; +} + static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) { struct udmabuf *ubuf = buf->priv; @@ -130,7 +173,7 @@ static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) pfns[pg] = pfn; }
- vaddr = vmap_pfn(pfns, ubuf->pagecount, PAGE_KERNEL); + vaddr = udmabuf_vmap_pfn(pfns, ubuf->pagecount, PAGE_KERNEL); kvfree(pfns); if (!vaddr) return -EINVAL; @@ -141,11 +184,9 @@ static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map)
static void vunmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) { - struct udmabuf *ubuf = buf->priv; - dma_resv_assert_held(buf->resv);
- vm_unmap_ram(map->vaddr, ubuf->pagecount); + vunmap(map->vaddr); }
static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
This patch add a test ioctl in udmabuf to show vmap can work. by compare pfn vmap and pages vmap.
But this skip HVO folio compare due to can't use pages vmap.
Signed-off-by: Huan Yang link@vivo.com --- drivers/dma-buf/udmabuf.c | 71 ++++++++ include/uapi/linux/udmabuf.h | 5 + .../selftests/drivers/dma-buf/Makefile | 1 + .../selftests/drivers/dma-buf/udmabuf_vmap.c | 166 ++++++++++++++++++ 4 files changed, 243 insertions(+) create mode 100644 tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index 2dfe639230dc..fbe4b59b4c97 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -557,6 +557,74 @@ static long udmabuf_ioctl_create_list(struct file *filp, unsigned long arg) return ret; }
+static long udmabuf_vmap_test(struct file *filp, unsigned long arg) +{ + struct udmabuf_vmap uv; + struct dma_buf *dmabuf; + bool can_page = true; + struct iosys_map map; + struct udmabuf *ubuf; + struct page **pages; + void *vaddr, *pvaddr; + struct file *file; + int ret = 0, i; + + if (copy_from_user(&uv, (void __user *)arg, sizeof(uv))) + return -EFAULT; + file = fget(uv.dma_buf_fd); + if (!file) + return -EINVAL; + + dmabuf = file->private_data; + ret = dma_buf_vmap(dmabuf, &map); + if (ret) + goto out; + vaddr = map.vaddr; + + ubuf = dmabuf->priv; + for (i = 0; i < ubuf->pagecount; ++i) { + struct folio *folio = ubuf->folios[i]; + + if (folio_test_hugetlb_vmemmap_optimized(folio)) { + can_page = false; + break; + } + } + + if (!can_page) + goto out_vaddr; + + pages = kvmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL); + if (WARN_ON(!pages)) { + ret = -ENOMEM; + goto out_vaddr; + } + + for (i = 0; i < ubuf->pagecount; ++i) + pages[i] = folio_page(ubuf->folios[i], + ubuf->offsets[i] >> PAGE_SHIFT); + + pvaddr = vmap(pages, ubuf->pagecount, 0, PAGE_KERNEL); + if (WARN_ON(!pvaddr)) { + ret = -ENOMEM; + goto out_pages; + } + + // compare if pages and pfns is same? + if (WARN_ON(memcmp(vaddr, pvaddr, ubuf->pagecount * PAGE_SIZE) != 0)) + ret = -EINVAL; + + vunmap(pvaddr); +out_pages: + kvfree(pages); +out_vaddr: + dma_buf_vunmap(dmabuf, &map); +out: + fput(file); + + return ret; +} + static long udmabuf_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -569,6 +637,9 @@ static long udmabuf_ioctl(struct file *filp, unsigned int ioctl, case UDMABUF_CREATE_LIST: ret = udmabuf_ioctl_create_list(filp, arg); break; + case UDMABUF_VMAP: + ret = udmabuf_vmap_test(filp, arg); + break; default: ret = -ENOTTY; break; diff --git a/include/uapi/linux/udmabuf.h b/include/uapi/linux/udmabuf.h index 46b6532ed855..88f5e5516286 100644 --- a/include/uapi/linux/udmabuf.h +++ b/include/uapi/linux/udmabuf.h @@ -27,7 +27,12 @@ struct udmabuf_create_list { struct udmabuf_create_item list[]; };
+struct udmabuf_vmap { + int dma_buf_fd; +}; + #define UDMABUF_CREATE _IOW('u', 0x42, struct udmabuf_create) #define UDMABUF_CREATE_LIST _IOW('u', 0x43, struct udmabuf_create_list) +#define UDMABUF_VMAP _IOW('u', 0x44, struct udmabuf_vmap)
#endif /* _UAPI_LINUX_UDMABUF_H */ diff --git a/tools/testing/selftests/drivers/dma-buf/Makefile b/tools/testing/selftests/drivers/dma-buf/Makefile index 441407bb0e80..e5b131dcc2c3 100644 --- a/tools/testing/selftests/drivers/dma-buf/Makefile +++ b/tools/testing/selftests/drivers/dma-buf/Makefile @@ -2,6 +2,7 @@ CFLAGS += $(KHDR_INCLUDES)
TEST_GEN_PROGS := udmabuf +TEST_GEN_PROGS := udmabuf_vmap
top_srcdir ?=../../../../..
diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c b/tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c new file mode 100644 index 000000000000..7bd46c909bdf --- /dev/null +++ b/tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c @@ -0,0 +1,166 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <fcntl.h> +#include <malloc.h> +#include <stdbool.h> + +#include <sys/ioctl.h> +#include <sys/syscall.h> +#include <sys/mman.h> +#include <linux/memfd.h> +#include <linux/udmabuf.h> +#include "../../kselftest.h" + +#define TEST_PREFIX "drivers/dma-buf/udmabuf" +#define NUM_PAGES 4 +#define NUM_ENTRIES 4 +#define MEMFD_SIZE 1024 /* in pages */ + +static unsigned int page_size; + +static int create_memfd_with_seals(off64_t size, bool hpage) +{ + int memfd, ret; + unsigned int flags = MFD_ALLOW_SEALING; + + if (hpage) + flags |= MFD_HUGETLB; + + memfd = memfd_create("udmabuf-test", flags); + if (memfd < 0) { + ksft_print_msg("%s: [skip,no-memfd]\n", TEST_PREFIX); + exit(KSFT_SKIP); + } + + ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) { + ksft_print_msg("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + exit(KSFT_SKIP); + } + + ret = ftruncate(memfd, size); + if (ret == -1) { + ksft_print_msg("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + return memfd; +} + +static int create_udmabuf_list(int devfd, int memfd, off64_t memfd_size) +{ + struct udmabuf_create_list *list; + int ubuf_fd, i; + + list = malloc(sizeof(struct udmabuf_create_list) + + sizeof(struct udmabuf_create_item) * NUM_ENTRIES); + if (!list) { + ksft_print_msg("%s: [FAIL, udmabuf-malloc]\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + for (i = 0; i < NUM_ENTRIES; i++) { + list->list[i].memfd = memfd; + list->list[i].offset = i * (memfd_size / NUM_ENTRIES); + list->list[i].size = memfd_size / NUM_ENTRIES; + } + + list->count = NUM_ENTRIES; + list->flags = UDMABUF_FLAGS_CLOEXEC; + ubuf_fd = ioctl(devfd, UDMABUF_CREATE_LIST, list); + free(list); + if (ubuf_fd < 0) { + ksft_print_msg("%s: [FAIL, udmabuf-create]\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + return ubuf_fd; +} + +static void *mmap_fd(int fd, off64_t size) +{ + void *addr; + + addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (addr == MAP_FAILED) { + ksft_print_msg("%s: ubuf_fd mmap fail\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + return addr; +} + +int main(int argc, char *argv[]) +{ + struct udmabuf_create create; + int devfd, memfd, buf, ret; + struct udmabuf_vmap vm; + unsigned long *vaddr; + off64_t size; + int i; + + ksft_print_header(); + ksft_set_plan(2); + + devfd = open("/dev/udmabuf", O_RDWR); + if (devfd < 0) { + ksft_print_msg( + "%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n", + TEST_PREFIX); + exit(KSFT_SKIP); + } + + /** + * Normal test + */ + size = getpagesize() * 512 + getpagesize() * 256; + memfd = create_memfd_with_seals(size, false); + buf = create_udmabuf_list(devfd, memfd, size); + vaddr = (unsigned long *)mmap_fd(buf, size); + for (i = 0; i < size / sizeof(unsigned long); i++) + vaddr[i] = random(); + + vm.dma_buf_fd = buf; + + ret = ioctl(devfd, UDMABUF_VMAP, &vm); + if (ret < 0) + ksft_test_result_fail("%s: [FAIL, normal test]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS, normal test]\n", TEST_PREFIX); + munmap(vaddr, size); + close(buf); + close(memfd); + + /** + * Hugetlb test, 2MB + */ + size = getpagesize() * 512; + memfd = create_memfd_with_seals(size, true); + buf = create_udmabuf_list(devfd, memfd, size); + vaddr = (unsigned long *)mmap_fd(buf, size); + for (i = 0; i < size / sizeof(unsigned long); i++) + vaddr[i] = random(); + + vm.dma_buf_fd = buf; + + ret = ioctl(devfd, UDMABUF_VMAP, &vm); + if (ret < 0) + ksft_test_result_fail("%s: [FAIL, huge test]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS, huge test]\n", TEST_PREFIX); + munmap(vaddr, size); + close(buf); + close(memfd); + + ksft_print_msg("%s: ok\n", TEST_PREFIX); + ksft_print_cnts(); + + return 0; +}
This patch add a new api vmap_range_folios.
It support both entire folio array and range based folio array.
By this, some situation can support, like HVO folio range vmap. Also can reduce array size, compare to 4k based page array, only need to record each folio and it's range(offset, nr_pages).
Signed-off-by: Huan Yang link@vivo.com --- include/linux/vmalloc.h | 57 +++++++++++++++++++++++++++++++++++++++++ mm/vmalloc.c | 47 +++++++++++++++++++++++++++++++++ 2 files changed, 104 insertions(+)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 31e9ffd936e3..007a398152af 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -11,6 +11,7 @@ #include <asm/page.h> /* pgprot_t */ #include <linux/rbtree.h> #include <linux/overflow.h> +#include <linux/mm.h>
#include <asm/vmalloc.h>
@@ -83,6 +84,24 @@ struct vmap_area { unsigned long flags; /* mark type of vm_map_ram area */ };
+/** + * folio range [pgoff, pgoff + nr_pages) in [0, folio_nr_pages) + */ +struct vmap_folios_range { + struct folio *folio; + + pgoff_t pgoff; + unsigned int nr_pages; +}; + +struct vmap_folios_struct { + union { + struct folio **_folios; + struct vmap_folios_range *_range_folios; + }; + bool range; +}; + /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */ #ifndef arch_vmap_p4d_supported static inline bool arch_vmap_p4d_supported(pgprot_t prot) @@ -195,6 +214,44 @@ extern void *vmap(struct page **pages, unsigned int count, void *vmap_pfn(unsigned long *pfns, unsigned int count, pgprot_t prot); extern void vunmap(const void *addr);
+extern void *__vmap_range_folios(struct vmap_folios_struct *folios, + unsigned int count, unsigned int size, + unsigned long flags, pgprot_t prot); + +static inline void *vmap_range_folios(struct vmap_folios_struct *folios, + unsigned int count, unsigned long flags, + pgprot_t prot) +{ + unsigned int size, i; + + for (i = 0, size = 0; i < count; ++i) { + struct vmap_folios_range *range = &folios->_range_folios[i]; + + if (WARN_ON(!range || !range->folio)) + return NULL; + + if (WARN_ON(range->pgoff + range->nr_pages > + folio_nr_pages(range->folio))) + return NULL; + + size += PAGE_SIZE * range->nr_pages; + } + + return __vmap_range_folios(folios, count, size, flags, prot); +} + +static inline void *vmap_folios(struct vmap_folios_struct *folios, + unsigned int count, unsigned long flags, + pgprot_t prot) +{ + unsigned int size, i; + + for (i = 0, size = 0; i < count; ++i) + size += folio_size(folios->_folios[i]); + + return __vmap_range_folios(folios, count, size, flags, prot); +} + extern int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr, void *kaddr, unsigned long pgoff, unsigned long size); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 044af7088359..247a4b940be1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3530,6 +3530,53 @@ void *vmap_pfn(unsigned long *pfns, unsigned int count, pgprot_t prot) EXPORT_SYMBOL_GPL(vmap_pfn); #endif /* CONFIG_VMAP_PFN */
+void *__vmap_range_folios(struct vmap_folios_struct *folios, unsigned int count, + unsigned int size, unsigned long flags, pgprot_t prot) +{ + struct vm_struct *area; + unsigned long addr; + unsigned int i; + + area = get_vm_area_caller(size, flags, __builtin_return_address(0)); + if (!area) + return NULL; + + addr = (unsigned long)area->addr; + for (i = 0; i < count; ++i) { + unsigned int nr_pages; + unsigned long pfn; + int err; + + if (folios->range) { + struct vmap_folios_range *folio_range = + &folios->_range_folios[i]; + + pfn = folio_pfn(folio_range->folio) + + folio_range->pgoff; + nr_pages = folio_range->nr_pages; + } else { + struct folio *folio = folios->_folios[i]; + + pfn = folio_pfn(folio); + nr_pages = folio_nr_pages(folio); + } + + err = vmap_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT), + PFN_PHYS(pfn), prot, PAGE_SHIFT); + if (err) { + free_vm_area(area); + return NULL; + } + + addr += (nr_pages << PAGE_SHIFT); + } + + flush_cache_vmap((unsigned long)area->addr, + (unsigned long)area->addr + size); + return area->addr; +} +EXPORT_SYMBOL_GPL(__vmap_range_folios); + static inline unsigned int vm_area_alloc_pages(gfp_t gfp, int nid, unsigned int order, unsigned int nr_pages, struct page **pages)
This patch change udmabuf from use pfn based into folio based vmap.
Signed-off-by: Huan Yang link@vivo.com --- drivers/dma-buf/udmabuf.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index fbe4b59b4c97..f50e11f0f034 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -152,29 +152,28 @@ static void *udmabuf_vmap_pfn(unsigned long *pfns, unsigned int count, static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) { struct udmabuf *ubuf = buf->priv; - unsigned long *pfns; + struct vmap_folios_struct folios; + struct vmap_folios_range *ranges; void *vaddr; pgoff_t pg;
dma_resv_assert_held(buf->resv);
- /** - * HVO may free tail pages, so just use pfn to map each folio - * into vmalloc area. - */ - pfns = kvmalloc_array(ubuf->pagecount, sizeof(*pfns), GFP_KERNEL); - if (!pfns) + ranges = kvmalloc_array(ubuf->pagecount, sizeof(*ranges), GFP_KERNEL); + if (!ranges) return -ENOMEM;
for (pg = 0; pg < ubuf->pagecount; pg++) { - unsigned long pfn = folio_pfn(ubuf->folios[pg]); - - pfn += ubuf->offsets[pg] >> PAGE_SHIFT; - pfns[pg] = pfn; + ranges[pg].folio = ubuf->folios[pg]; + ranges[pg].pgoff = ubuf->offsets[pg] >> PAGE_SHIFT; + ranges[pg].nr_pages = 1; }
- vaddr = udmabuf_vmap_pfn(pfns, ubuf->pagecount, PAGE_KERNEL); - kvfree(pfns); + folios._range_folios = ranges; + folios.range = true; + + vaddr = vmap_range_folios(&folios, ubuf->pagecount, 0, PAGE_KERNEL); + kvfree(ranges); if (!vaddr) return -EINVAL;
This patch both compare pages and pfns based vmap, to show folios range vmap and pfn range vmap can work.
Signed-off-by: Huan Yang link@vivo.com --- drivers/dma-buf/udmabuf.c | 64 ++++++++++++++++++++++++++++----------- 1 file changed, 46 insertions(+), 18 deletions(-)
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index f50e11f0f034..78549a9f24ca 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -556,6 +556,48 @@ static long udmabuf_ioctl_create_list(struct file *filp, unsigned long arg) return ret; }
+static void *udmabuf_vmap_by_pfns(struct udmabuf *udmabuf) +{ + unsigned long *pfns; + void *vaddr = NULL; + unsigned int i; + + pfns = kvmalloc_array(udmabuf->pagecount, sizeof(*pfns), GFP_KERNEL); + if (WARN_ON(!pfns)) + return NULL; + + for (i = 0; i < udmabuf->pagecount; ++i) + pfns[i] = folio_pfn(udmabuf->folios[i]) + + (udmabuf->offsets[i] >> PAGE_SHIFT); + + vaddr = udmabuf_vmap_pfn(pfns, udmabuf->pagecount, PAGE_KERNEL); + WARN_ON(!vaddr); + + kvfree(pfns); + return vaddr; +} + +static void *udmabuf_vmap_by_pages(struct udmabuf *udmabuf) +{ + struct page **pages; + void *vaddr = NULL; + unsigned int i; + + pages = kvmalloc_array(udmabuf->pagecount, sizeof(*pages), GFP_KERNEL); + if (WARN_ON(!pages)) + return NULL; + + for (i = 0; i < udmabuf->pagecount; ++i) + pages[i] = folio_page(udmabuf->folios[i], + udmabuf->offsets[i] >> PAGE_SHIFT); + + vaddr = vmap(pages, udmabuf->pagecount, 0, PAGE_KERNEL); + WARN_ON(!vaddr); + + kvfree(pages); + return vaddr; +} + static long udmabuf_vmap_test(struct file *filp, unsigned long arg) { struct udmabuf_vmap uv; @@ -563,7 +605,6 @@ static long udmabuf_vmap_test(struct file *filp, unsigned long arg) bool can_page = true; struct iosys_map map; struct udmabuf *ubuf; - struct page **pages; void *vaddr, *pvaddr; struct file *file; int ret = 0, i; @@ -591,31 +632,18 @@ static long udmabuf_vmap_test(struct file *filp, unsigned long arg) }
if (!can_page) - goto out_vaddr; + pvaddr = udmabuf_vmap_by_pfns(ubuf); + else + pvaddr = udmabuf_vmap_by_pages(ubuf);
- pages = kvmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL); - if (WARN_ON(!pages)) { - ret = -ENOMEM; + if (!pvaddr) goto out_vaddr; - } - - for (i = 0; i < ubuf->pagecount; ++i) - pages[i] = folio_page(ubuf->folios[i], - ubuf->offsets[i] >> PAGE_SHIFT); - - pvaddr = vmap(pages, ubuf->pagecount, 0, PAGE_KERNEL); - if (WARN_ON(!pvaddr)) { - ret = -ENOMEM; - goto out_pages; - }
// compare if pages and pfns is same? if (WARN_ON(memcmp(vaddr, pvaddr, ubuf->pagecount * PAGE_SIZE) != 0)) ret = -EINVAL;
vunmap(pvaddr); -out_pages: - kvfree(pages); out_vaddr: dma_buf_vunmap(dmabuf, &map); out:
This patch remove each test code.
Signed-off-by: Huan Yang link@vivo.com --- drivers/dma-buf/udmabuf.c | 142 --------------- include/uapi/linux/udmabuf.h | 5 - .../selftests/drivers/dma-buf/Makefile | 1 - .../selftests/drivers/dma-buf/udmabuf_vmap.c | 166 ------------------ 4 files changed, 314 deletions(-) delete mode 100644 tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index 78549a9f24ca..67ab50914a31 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -106,49 +106,6 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma) return 0; }
-struct udmabuf_pfn_data { - unsigned long *pfns; - pgprot_t prot; - unsigned int idx; -}; - -static int udmabuf_vmap_pfn_apply(pte_t *pte, unsigned long addr, void *private) -{ - struct udmabuf_pfn_data *data = private; - pte_t ptent; - - ptent = pte_mkspecial(pfn_pte(data->pfns[data->idx], data->prot)); - set_pte_at(&init_mm, addr, pte, ptent); - - data->idx++; - return 0; -} - -static void *udmabuf_vmap_pfn(unsigned long *pfns, unsigned int count, - pgprot_t prot) -{ - struct udmabuf_pfn_data data = { .pfns = pfns, - .prot = pgprot_nx(prot) }; - struct vm_struct *area; - - area = get_vm_area_caller(count * PAGE_SIZE, 0, - __builtin_return_address(0)); - if (!area) - return NULL; - - if (apply_to_page_range(&init_mm, (unsigned long)area->addr, - count * PAGE_SIZE, udmabuf_vmap_pfn_apply, - &data)) { - free_vm_area(area); - return NULL; - } - - flush_cache_vmap((unsigned long)area->addr, - (unsigned long)area->addr + count * PAGE_SIZE); - - return area->addr; -} - static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) { struct udmabuf *ubuf = buf->priv; @@ -556,102 +513,6 @@ static long udmabuf_ioctl_create_list(struct file *filp, unsigned long arg) return ret; }
-static void *udmabuf_vmap_by_pfns(struct udmabuf *udmabuf) -{ - unsigned long *pfns; - void *vaddr = NULL; - unsigned int i; - - pfns = kvmalloc_array(udmabuf->pagecount, sizeof(*pfns), GFP_KERNEL); - if (WARN_ON(!pfns)) - return NULL; - - for (i = 0; i < udmabuf->pagecount; ++i) - pfns[i] = folio_pfn(udmabuf->folios[i]) + - (udmabuf->offsets[i] >> PAGE_SHIFT); - - vaddr = udmabuf_vmap_pfn(pfns, udmabuf->pagecount, PAGE_KERNEL); - WARN_ON(!vaddr); - - kvfree(pfns); - return vaddr; -} - -static void *udmabuf_vmap_by_pages(struct udmabuf *udmabuf) -{ - struct page **pages; - void *vaddr = NULL; - unsigned int i; - - pages = kvmalloc_array(udmabuf->pagecount, sizeof(*pages), GFP_KERNEL); - if (WARN_ON(!pages)) - return NULL; - - for (i = 0; i < udmabuf->pagecount; ++i) - pages[i] = folio_page(udmabuf->folios[i], - udmabuf->offsets[i] >> PAGE_SHIFT); - - vaddr = vmap(pages, udmabuf->pagecount, 0, PAGE_KERNEL); - WARN_ON(!vaddr); - - kvfree(pages); - return vaddr; -} - -static long udmabuf_vmap_test(struct file *filp, unsigned long arg) -{ - struct udmabuf_vmap uv; - struct dma_buf *dmabuf; - bool can_page = true; - struct iosys_map map; - struct udmabuf *ubuf; - void *vaddr, *pvaddr; - struct file *file; - int ret = 0, i; - - if (copy_from_user(&uv, (void __user *)arg, sizeof(uv))) - return -EFAULT; - file = fget(uv.dma_buf_fd); - if (!file) - return -EINVAL; - - dmabuf = file->private_data; - ret = dma_buf_vmap(dmabuf, &map); - if (ret) - goto out; - vaddr = map.vaddr; - - ubuf = dmabuf->priv; - for (i = 0; i < ubuf->pagecount; ++i) { - struct folio *folio = ubuf->folios[i]; - - if (folio_test_hugetlb_vmemmap_optimized(folio)) { - can_page = false; - break; - } - } - - if (!can_page) - pvaddr = udmabuf_vmap_by_pfns(ubuf); - else - pvaddr = udmabuf_vmap_by_pages(ubuf); - - if (!pvaddr) - goto out_vaddr; - - // compare if pages and pfns is same? - if (WARN_ON(memcmp(vaddr, pvaddr, ubuf->pagecount * PAGE_SIZE) != 0)) - ret = -EINVAL; - - vunmap(pvaddr); -out_vaddr: - dma_buf_vunmap(dmabuf, &map); -out: - fput(file); - - return ret; -} - static long udmabuf_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -664,9 +525,6 @@ static long udmabuf_ioctl(struct file *filp, unsigned int ioctl, case UDMABUF_CREATE_LIST: ret = udmabuf_ioctl_create_list(filp, arg); break; - case UDMABUF_VMAP: - ret = udmabuf_vmap_test(filp, arg); - break; default: ret = -ENOTTY; break; diff --git a/include/uapi/linux/udmabuf.h b/include/uapi/linux/udmabuf.h index 88f5e5516286..46b6532ed855 100644 --- a/include/uapi/linux/udmabuf.h +++ b/include/uapi/linux/udmabuf.h @@ -27,12 +27,7 @@ struct udmabuf_create_list { struct udmabuf_create_item list[]; };
-struct udmabuf_vmap { - int dma_buf_fd; -}; - #define UDMABUF_CREATE _IOW('u', 0x42, struct udmabuf_create) #define UDMABUF_CREATE_LIST _IOW('u', 0x43, struct udmabuf_create_list) -#define UDMABUF_VMAP _IOW('u', 0x44, struct udmabuf_vmap)
#endif /* _UAPI_LINUX_UDMABUF_H */ diff --git a/tools/testing/selftests/drivers/dma-buf/Makefile b/tools/testing/selftests/drivers/dma-buf/Makefile index e5b131dcc2c3..441407bb0e80 100644 --- a/tools/testing/selftests/drivers/dma-buf/Makefile +++ b/tools/testing/selftests/drivers/dma-buf/Makefile @@ -2,7 +2,6 @@ CFLAGS += $(KHDR_INCLUDES)
TEST_GEN_PROGS := udmabuf -TEST_GEN_PROGS := udmabuf_vmap
top_srcdir ?=../../../../..
diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c b/tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c deleted file mode 100644 index 7bd46c909bdf..000000000000 --- a/tools/testing/selftests/drivers/dma-buf/udmabuf_vmap.c +++ /dev/null @@ -1,166 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#define _GNU_SOURCE -#define __EXPORTED_HEADERS__ - -#include <stdio.h> -#include <stdlib.h> -#include <unistd.h> -#include <string.h> -#include <errno.h> -#include <fcntl.h> -#include <malloc.h> -#include <stdbool.h> - -#include <sys/ioctl.h> -#include <sys/syscall.h> -#include <sys/mman.h> -#include <linux/memfd.h> -#include <linux/udmabuf.h> -#include "../../kselftest.h" - -#define TEST_PREFIX "drivers/dma-buf/udmabuf" -#define NUM_PAGES 4 -#define NUM_ENTRIES 4 -#define MEMFD_SIZE 1024 /* in pages */ - -static unsigned int page_size; - -static int create_memfd_with_seals(off64_t size, bool hpage) -{ - int memfd, ret; - unsigned int flags = MFD_ALLOW_SEALING; - - if (hpage) - flags |= MFD_HUGETLB; - - memfd = memfd_create("udmabuf-test", flags); - if (memfd < 0) { - ksft_print_msg("%s: [skip,no-memfd]\n", TEST_PREFIX); - exit(KSFT_SKIP); - } - - ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK); - if (ret < 0) { - ksft_print_msg("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); - exit(KSFT_SKIP); - } - - ret = ftruncate(memfd, size); - if (ret == -1) { - ksft_print_msg("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); - exit(KSFT_FAIL); - } - - return memfd; -} - -static int create_udmabuf_list(int devfd, int memfd, off64_t memfd_size) -{ - struct udmabuf_create_list *list; - int ubuf_fd, i; - - list = malloc(sizeof(struct udmabuf_create_list) + - sizeof(struct udmabuf_create_item) * NUM_ENTRIES); - if (!list) { - ksft_print_msg("%s: [FAIL, udmabuf-malloc]\n", TEST_PREFIX); - exit(KSFT_FAIL); - } - - for (i = 0; i < NUM_ENTRIES; i++) { - list->list[i].memfd = memfd; - list->list[i].offset = i * (memfd_size / NUM_ENTRIES); - list->list[i].size = memfd_size / NUM_ENTRIES; - } - - list->count = NUM_ENTRIES; - list->flags = UDMABUF_FLAGS_CLOEXEC; - ubuf_fd = ioctl(devfd, UDMABUF_CREATE_LIST, list); - free(list); - if (ubuf_fd < 0) { - ksft_print_msg("%s: [FAIL, udmabuf-create]\n", TEST_PREFIX); - exit(KSFT_FAIL); - } - - return ubuf_fd; -} - -static void *mmap_fd(int fd, off64_t size) -{ - void *addr; - - addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); - if (addr == MAP_FAILED) { - ksft_print_msg("%s: ubuf_fd mmap fail\n", TEST_PREFIX); - exit(KSFT_FAIL); - } - - return addr; -} - -int main(int argc, char *argv[]) -{ - struct udmabuf_create create; - int devfd, memfd, buf, ret; - struct udmabuf_vmap vm; - unsigned long *vaddr; - off64_t size; - int i; - - ksft_print_header(); - ksft_set_plan(2); - - devfd = open("/dev/udmabuf", O_RDWR); - if (devfd < 0) { - ksft_print_msg( - "%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n", - TEST_PREFIX); - exit(KSFT_SKIP); - } - - /** - * Normal test - */ - size = getpagesize() * 512 + getpagesize() * 256; - memfd = create_memfd_with_seals(size, false); - buf = create_udmabuf_list(devfd, memfd, size); - vaddr = (unsigned long *)mmap_fd(buf, size); - for (i = 0; i < size / sizeof(unsigned long); i++) - vaddr[i] = random(); - - vm.dma_buf_fd = buf; - - ret = ioctl(devfd, UDMABUF_VMAP, &vm); - if (ret < 0) - ksft_test_result_fail("%s: [FAIL, normal test]\n", TEST_PREFIX); - else - ksft_test_result_pass("%s: [PASS, normal test]\n", TEST_PREFIX); - munmap(vaddr, size); - close(buf); - close(memfd); - - /** - * Hugetlb test, 2MB - */ - size = getpagesize() * 512; - memfd = create_memfd_with_seals(size, true); - buf = create_udmabuf_list(devfd, memfd, size); - vaddr = (unsigned long *)mmap_fd(buf, size); - for (i = 0; i < size / sizeof(unsigned long); i++) - vaddr[i] = random(); - - vm.dma_buf_fd = buf; - - ret = ioctl(devfd, UDMABUF_VMAP, &vm); - if (ret < 0) - ksft_test_result_fail("%s: [FAIL, huge test]\n", TEST_PREFIX); - else - ksft_test_result_pass("%s: [PASS, huge test]\n", TEST_PREFIX); - munmap(vaddr, size); - close(buf); - close(memfd); - - ksft_print_msg("%s: ok\n", TEST_PREFIX); - ksft_print_cnts(); - - return 0; -}
On Thu, Mar 27, 2025 at 05:28:27PM +0800, Huan Yang wrote:
Bingbu reported an issue in [1] that udmabuf vmap failed and in [2], we discussed the scenario of folio vmap due to the misuse of vmap_pfn in udmabuf.
We reached the conclusion that vmap_pfn prohibits the use of page-based PFNs: Christoph Hellwig : 'No, vmap_pfn is entirely for memory not backed by pages or folios, i.e. PCIe BARs and similar memory. This must not be mixed with proper folio backed memory.'
But udmabuf still need consider HVO based folio's vmap, and need fix vmap issue. This RFC code want to show the two point that I mentioned in [2], and more deep talk it:
Point1. simple copy vmap_pfn code, don't bother common vmap_pfn, use by itself and remove pfn_valid check.
Point2. implement folio array based vmap(vmap_folios), which can given a range of each folio(offset, nr_pages), so can suit HVO folio's vmap.
Patch 1-2 implement point1, and add a test simple set in udmabuf driver. Patch 3-5 implement point2, also can test it.
Kasireddy also show that 'another option is to just limit udmabuf's vmap() to only shmem folios'(This I guess folio_test_hugetlb_vmemmap_optimized can help.)
But I prefer point2 to solution this issue, and IMO, folio based vmap still need.
Compare to page based vmap(or pfn based), we need split each large folio into single page struct, this need more large array struct and more longer iter. If each tail page struct not exist(like HVO), can only use pfn vmap, but there are no common api to do this.
In [2], we talked that udmabuf can use hugetlb as the memory provider, and can give a range use. So if HVO used in hugetlb, each folio's tail page may freed, so we can't use page based vmap, only can use pfn based, which show in point1.
Further more, Folio based vmap only need record each folio(and offset, nr_pages if range need). For 20MB vmap, page based need 5120 pages(40KB), 2MB folios only need 10 folio(80Byte).
Matthew show that Vishal also offered a folio based vmap - vmap_file[3]. This RFC patch want a range based folio, not only a full folio's map(like file's folio), to resolve some problem like HVO's range folio vmap.
Hmmm, I should've been more communicative, sorry about that. V1 was poorly implemented, and I've had a V2 sitting around that does Exactly what you want.
I'll send V2 to the mailing list and you can take a look at it; preferrably you integrate that into this patchset instead (it would make both the udma and vmalloc code much neater).
Please give me more suggestion.
Test case: //enable/disable HVO
- echo [1|0] > /proc/sys/vm/hugetlb_optimize_vmemmap
//prepare HUGETLB 2. echo 10 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 3. ./udmabuf_vmap 4. check output, and dmesg if any warn.
[1] https://lore.kernel.org/all/9172a601-c360-0d5b-ba1b-33deba430455@linux.intel... [2] https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ [3] https://lore.kernel.org/linux-mm/20250131001806.92349-1-vishal.moola@gmail.c...
Huan Yang (6): udmabuf: try fix udmabuf vmap udmabuf: try udmabuf vmap test mm/vmalloc: try add vmap folios range udmabuf: use vmap_range_folios udmabuf: vmap test suit for pages and pfns compare udmabuf: remove no need code
drivers/dma-buf/udmabuf.c | 29 +++++++++----------- include/linux/vmalloc.h | 57 +++++++++++++++++++++++++++++++++++++++ mm/vmalloc.c | 47 ++++++++++++++++++++++++++++++++ 3 files changed, 117 insertions(+), 16 deletions(-)
-- 2.48.1
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote:
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote:
On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote:
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
在 2025/4/4 18:07, Muchun Song 写道:
On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote:
On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote:
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding
I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote:
在 2025/4/4 18:07, Muchun Song 写道:
On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote:
On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote:
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding
I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
The answer is you can. It is essential to distinguish between virtual address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
在 2025/4/7 10:57, Muchun Song 写道:
On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote:
在 2025/4/4 18:07, Muchun Song 写道:
On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote:
On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote:
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding
I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
The answer is you can. It is essential to distinguish between virtual
Thanks for your reply, but I still can't understand it. For example, I need vmap a hugetlb HVO folio's
2-7 page:
struct page **pages = kvmalloc(sizeof(*pages), 6, GFP_KENREL);
for (i = 2; i < 8; ++i)
pages[i] = folio_page(folio, i); //set 2-7 range page into pages,
void *vaddr = vmap(pages, 6, 0, PAGE_KERNEL);
For no HVO pages, this can work. If HVO enabled, do "pages[i] = folio_page(folio, i);" just
got the head page? and how vmap can correctly map each page?
Please correct me. :)
Thanks,
Huan Yang
address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
On Apr 7, 2025, at 11:21, Huan Yang link@vivo.com wrote:
在 2025/4/7 10:57, Muchun Song 写道:
On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote:
在 2025/4/4 18:07, Muchun Song 写道:
On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote:
On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote:
After the btrfs compressed bio discussion I think the hugetlb changes that skip the tail pages are fundamentally unsafe in the current kernel.
That is because the bio_vec representation assumes tail pages do exist, so as soon as you are doing direct I/O that generates a bvec starting beyond the present head page things will blow up. Other users of bio_vecs might do the same, but the way the block bio_vecs are generated are very suspect to that. So we'll first need to sort that out and a few other things before we can even think of enabling such a feature.
I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding
I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
The answer is you can. It is essential to distinguish between virtual
Thanks for your reply, but I still can't understand it. For example, I need vmap a hugetlb HVO folio's
2-7 page:
struct page **pages = kvmalloc(sizeof(*pages), 6, GFP_KENREL);
for (i = 2; i < 8; ++i)
pages[i] = folio_page(folio, i); //set 2-7 range page into pages,
void *vaddr = vmap(pages, 6, 0, PAGE_KERNEL);
For no HVO pages, this can work. If HVO enabled, do "pages[i] = folio_page(folio, i);" just
got the head page? and how vmap can correctly map each page?
Why do you think folio_page(folio, i) (i ≠ 0) returns the head page? Is it speculation or tested? Please base it on the actual situation instead of indulging in wild thoughts.
Thanks, Muchun.
Please correct me. :)
Thanks,
Huan Yang
address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
On Apr 7, 2025, at 11:37, Muchun Song muchun.song@linux.dev wrote:
On Apr 7, 2025, at 11:21, Huan Yang link@vivo.com wrote:
在 2025/4/7 10:57, Muchun Song 写道:
On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote:
在 2025/4/4 18:07, Muchun Song 写道:
On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote:
> On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote: > > After the btrfs compressed bio discussion I think the hugetlb changes that > skip the tail pages are fundamentally unsafe in the current kernel. > > That is because the bio_vec representation assumes tail pages do exist, so > as soon as you are doing direct I/O that generates a bvec starting beyond > the present head page things will blow up. Other users of bio_vecs might > do the same, but the way the block bio_vecs are generated are very suspect > to that. So we'll first need to sort that out and a few other things > before we can even think of enabling such a feature. > I would like to express my gratitude to Christoph for including me in the thread. I have carefully read the cover letter in [1], which indicates that an issue has arisen due to the improper use of `vmap_pfn()`. I'm wondering if we could consider using `vmap()` instead. In the HVO scenario, the tail struct pages do **exist**, but they are read-only. I've examined the code of `vmap()`, and it appears that it only reads the struct page. Therefore, it seems feasible for us to use `vmap()` (I am not a expert in udmabuf.). Right?
I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding
I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
The answer is you can. It is essential to distinguish between virtual
Thanks for your reply, but I still can't understand it. For example, I need vmap a hugetlb HVO folio's
2-7 page:
struct page **pages = kvmalloc(sizeof(*pages), 6, GFP_KENREL);
for (i = 2; i < 8; ++i)
pages[i] = folio_page(folio, i); //set 2-7 range page into pages,
void *vaddr = vmap(pages, 6, 0, PAGE_KERNEL);
For no HVO pages, this can work. If HVO enabled, do "pages[i] = folio_page(folio, i);" just
got the head page? and how vmap can correctly map each page?
Why do you think folio_page(folio, i) (i ≠ 0) returns the head page? Is it speculation or tested? Please base it on the actual situation instead of indulging in wild thoughts.
By the way, in case you truly struggle to comprehend the fundamental aspects of HVO, I would like to summarize for you the user-visible behaviors in comparison to the situation where HVO is disabled.
HVO Status Tail Page Structures Head Page Structures Enabled Read-Only (RO) Read-Write (RW) Disabled Read-Write (RW) Read-Write (RW)
The sole distinction between the two scenarios lies in whether the tail page structures are allowed to be written or not. Please refrain from getting bogged down in the details of the implementation of HVO.
Thanks, Muchun.
Thanks, Muchun.
Please correct me. :)
Thanks,
Huan Yang
address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
[1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b...
Thanks, Muchun.
在 2025/4/7 14:43, Muchun Song 写道:
On Apr 7, 2025, at 11:37, Muchun Song muchun.song@linux.dev wrote:
On Apr 7, 2025, at 11:21, Huan Yang link@vivo.com wrote:
在 2025/4/7 10:57, Muchun Song 写道:
On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote:
在 2025/4/4 18:07, Muchun Song 写道:
> On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote: > > > >> On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote: >> >> After the btrfs compressed bio discussion I think the hugetlb changes that >> skip the tail pages are fundamentally unsafe in the current kernel. >> >> That is because the bio_vec representation assumes tail pages do exist, so >> as soon as you are doing direct I/O that generates a bvec starting beyond >> the present head page things will blow up. Other users of bio_vecs might >> do the same, but the way the block bio_vecs are generated are very suspect >> to that. So we'll first need to sort that out and a few other things >> before we can even think of enabling such a feature. >> > I would like to express my gratitude to Christoph for including me in the > thread. I have carefully read the cover letter in [1], which indicates > that an issue has arisen due to the improper use of `vmap_pfn()`. I'm > wondering if we could consider using `vmap()` instead. In the HVO scenario, > the tail struct pages do **exist**, but they are read-only. I've examined > the code of `vmap()`, and it appears that it only reads the struct page. > Therefore, it seems feasible for us to use `vmap()` (I am not a expert in > udmabuf.). Right? I believe my stance is correct. I've also reviewed another thread in [2]. Allow me to clarify and correct the viewpoints you presented. You stated: " So by HVO, it also not backed by pages, only contains folio head, each tail pfn's page struct go away. " This statement is entirely inaccurate. The tail pages do not cease to exist; rather, they are read-only. For your specific use-case, please use `vmap()` to resolve the issue at hand. If you wish to gain a comprehensive understanding
I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
The answer is you can. It is essential to distinguish between virtual
Thanks for your reply, but I still can't understand it. For example, I need vmap a hugetlb HVO folio's
2-7 page:
struct page **pages = kvmalloc(sizeof(*pages), 6, GFP_KENREL);
for (i = 2; i < 8; ++i)
pages[i] = folio_page(folio, i); //set 2-7 range page into pages,
void *vaddr = vmap(pages, 6, 0, PAGE_KERNEL);
For no HVO pages, this can work. If HVO enabled, do "pages[i] = folio_page(folio, i);" just
got the head page? and how vmap can correctly map each page?
Why do you think folio_page(folio, i) (i ≠ 0) returns the head page? Is it speculation or tested? Please base it on the actual situation instead of indulging in wild thoughts.
By the way, in case you truly struggle to comprehend the fundamental aspects of HVO, I would like to summarize for you the user-visible behaviors in comparison to the situation where HVO is disabled.
HVO Status Tail Page Structures Head Page Structures Enabled Read-Only (RO) Read-Write (RW) Disabled Read-Write (RW) Read-Write (RW)
The sole distinction between the two scenarios lies in whether the tail page structures are allowed to be written or not. Please refrain from getting bogged down in the details of the implementation of HVO.
Thanks, I do a test, an figure out that I'm totally misunderstand it.
Even if HVO enabled, tail page struct freed and point to head, linear mapping still exist, so that any page_to_pfn,
page_to_virt(also folio's version), if start from head page can compute each need page like folio_page, can still work:
hvo head 0xfffff9de849d0000, pfn=0x127400, wish offset_pfn 0x1275f1, idx 497 is 0xfffff9de849d7c40, pfn=0x1275f1.
When vmap, we no need to touch actually page's content, just turn to pfn, so, work well.
BTW, even if we need to touch actually input page struct, it point to head page, I guess will effect nothing.
If anything still misunderstand, please corrent me. :)
Muchun, thank you for your patience,
Huan Yang
Thanks, Muchun.
Thanks, Muchun.
Please correct me. :)
Thanks,
Huan Yang
address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
of the fundamentals of HVO, I kindly suggest a thorough review of the document in [3].
[2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... [3] Documentation/mm/vmemmap_dedup.rst
> [1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b... > > Thanks, > Muchun.
On Apr 7, 2025, at 15:09, Huan Yang link@vivo.com wrote:
在 2025/4/7 14:43, Muchun Song 写道:
On Apr 7, 2025, at 11:37, Muchun Song muchun.song@linux.dev wrote:
On Apr 7, 2025, at 11:21, Huan Yang link@vivo.com wrote:
在 2025/4/7 10:57, Muchun Song 写道:
On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote:
在 2025/4/4 18:07, Muchun Song 写道: >> On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote: >> >> >> >>> On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote: >>> >>> After the btrfs compressed bio discussion I think the hugetlb changes that >>> skip the tail pages are fundamentally unsafe in the current kernel. >>> >>> That is because the bio_vec representation assumes tail pages do exist, so >>> as soon as you are doing direct I/O that generates a bvec starting beyond >>> the present head page things will blow up. Other users of bio_vecs might >>> do the same, but the way the block bio_vecs are generated are very suspect >>> to that. So we'll first need to sort that out and a few other things >>> before we can even think of enabling such a feature. >>> >> I would like to express my gratitude to Christoph for including me in the >> thread. I have carefully read the cover letter in [1], which indicates >> that an issue has arisen due to the improper use of `vmap_pfn()`. I'm >> wondering if we could consider using `vmap()` instead. In the HVO scenario, >> the tail struct pages do **exist**, but they are read-only. I've examined >> the code of `vmap()`, and it appears that it only reads the struct page. >> Therefore, it seems feasible for us to use `vmap()` (I am not a expert in >> udmabuf.). Right? > I believe my stance is correct. I've also reviewed another thread in [2]. > Allow me to clarify and correct the viewpoints you presented. You stated: > " > So by HVO, it also not backed by pages, only contains folio head, each > tail pfn's page struct go away. > " > This statement is entirely inaccurate. The tail pages do not cease to exist; > rather, they are read-only. For your specific use-case, please use `vmap()` > to resolve the issue at hand. If you wish to gain a comprehensive understanding I see the document give a simple graph to point:
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | PMD | +-----------+ | | | | level | | 5 | ----------------------+ | | | mapping | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+
If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do
something correctly?
The answer is you can. It is essential to distinguish between virtual
Thanks for your reply, but I still can't understand it. For example, I need vmap a hugetlb HVO folio's
2-7 page:
struct page **pages = kvmalloc(sizeof(*pages), 6, GFP_KENREL);
for (i = 2; i < 8; ++i)
pages[i] = folio_page(folio, i); //set 2-7 range page into pages,
void *vaddr = vmap(pages, 6, 0, PAGE_KERNEL);
For no HVO pages, this can work. If HVO enabled, do "pages[i] = folio_page(folio, i);" just
got the head page? and how vmap can correctly map each page?
Why do you think folio_page(folio, i) (i ≠ 0) returns the head page? Is it speculation or tested? Please base it on the actual situation instead of indulging in wild thoughts.
By the way, in case you truly struggle to comprehend the fundamental aspects of HVO, I would like to summarize for you the user-visible behaviors in comparison to the situation where HVO is disabled.
HVO Status Tail Page Structures Head Page Structures Enabled Read-Only (RO) Read-Write (RW) Disabled Read-Write (RW) Read-Write (RW)
The sole distinction between the two scenarios lies in whether the tail page structures are allowed to be written or not. Please refrain from getting bogged down in the details of the implementation of HVO.
Thanks, I do a test, an figure out that I'm totally misunderstand it.
Even if HVO enabled, tail page struct freed and point to head, linear mapping still exist, so that any page_to_pfn,
page_to_virt(also folio's version), if start from head page can compute each need page like folio_page, can still work:
hvo head 0xfffff9de849d0000, pfn=0x127400, wish offset_pfn 0x1275f1, idx 497 is 0xfffff9de849d7c40, pfn=0x1275f1.
When vmap, we no need to touch actually page's content, just turn to pfn, so, work well.
You are able to read those tail page structures. The reason why vmap can function is not that it doesn't read those page structures. What I mean is that vmap will still work even if it does read the page structures, because those tail page structures do indeed exist.
BTW, even if we need to touch actually input page struct, it point to head page, I guess will effect nothing.
Allow me to clarify this for you to ensure that we have a shared understanding. Those tail page structures (virtual addresses in the vmemmap area) are mapped to the same page frame (physical page) to which the head page structures (virtual addresses in the vmemmap area) are mapped. It is analogous to the shared-mapping mechanism in the user space.
If anything still misunderstand, please corrent me. :)
Muchun, thank you for your patience,
Huan Yang
Thanks, Muchun.
Thanks, Muchun.
Please correct me. :)
Thanks,
Huan Yang
address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
Or something I still misunderstand, please correct me.
Thanks,
Huan Yang
> of the fundamentals of HVO, I kindly suggest a thorough review of the document > in [3]. > > [2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... > [3] Documentation/mm/vmemmap_dedup.rst > >> [1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b... >> >> Thanks, >> Muchun.
在 2025/4/7 15:22, Muchun Song 写道:
On Apr 7, 2025, at 15:09, Huan Yang link@vivo.com wrote:
在 2025/4/7 14:43, Muchun Song 写道:
On Apr 7, 2025, at 11:37, Muchun Song muchun.song@linux.dev wrote:
On Apr 7, 2025, at 11:21, Huan Yang link@vivo.com wrote:
在 2025/4/7 10:57, Muchun Song 写道:
> On Apr 7, 2025, at 09:59, Huan Yang link@vivo.com wrote: > > > 在 2025/4/4 18:07, Muchun Song 写道: >>> On Apr 4, 2025, at 17:38, Muchun Song muchun.song@linux.dev wrote: >>> >>> >>> >>>> On Apr 4, 2025, at 17:01, Christoph Hellwig hch@lst.de wrote: >>>> >>>> After the btrfs compressed bio discussion I think the hugetlb changes that >>>> skip the tail pages are fundamentally unsafe in the current kernel. >>>> >>>> That is because the bio_vec representation assumes tail pages do exist, so >>>> as soon as you are doing direct I/O that generates a bvec starting beyond >>>> the present head page things will blow up. Other users of bio_vecs might >>>> do the same, but the way the block bio_vecs are generated are very suspect >>>> to that. So we'll first need to sort that out and a few other things >>>> before we can even think of enabling such a feature. >>>> >>> I would like to express my gratitude to Christoph for including me in the >>> thread. I have carefully read the cover letter in [1], which indicates >>> that an issue has arisen due to the improper use of `vmap_pfn()`. I'm >>> wondering if we could consider using `vmap()` instead. In the HVO scenario, >>> the tail struct pages do **exist**, but they are read-only. I've examined >>> the code of `vmap()`, and it appears that it only reads the struct page. >>> Therefore, it seems feasible for us to use `vmap()` (I am not a expert in >>> udmabuf.). Right? >> I believe my stance is correct. I've also reviewed another thread in [2]. >> Allow me to clarify and correct the viewpoints you presented. You stated: >> " >> So by HVO, it also not backed by pages, only contains folio head, each >> tail pfn's page struct go away. >> " >> This statement is entirely inaccurate. The tail pages do not cease to exist; >> rather, they are read-only. For your specific use-case, please use `vmap()` >> to resolve the issue at hand. If you wish to gain a comprehensive understanding > I see the document give a simple graph to point: > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > | | | 0 | -------------> | 0 | > | | +-----------+ +-----------+ > | | | 1 | -------------> | 1 | > | | +-----------+ +-----------+ > | | | 2 | ----------------^ ^ ^ ^ ^ ^ > | | +-----------+ | | | | | > | | | 3 | ------------------+ | | | | > | | +-----------+ | | | | > | | | 4 | --------------------+ | | | > | PMD | +-----------+ | | | > | level | | 5 | ----------------------+ | | > | mapping | +-----------+ | | > | | | 6 | ------------------------+ | > | | +-----------+ | > | | | 7 | --------------------------+ > | | +-----------+ > | | > | | > | | > +-----------+ > > If I understand correct, each 2-7 tail's page struct is freed, so if I just need map page 2-7, can we use vmap do > > something correctly? The answer is you can. It is essential to distinguish between virtual
Thanks for your reply, but I still can't understand it. For example, I need vmap a hugetlb HVO folio's
2-7 page:
struct page **pages = kvmalloc(sizeof(*pages), 6, GFP_KENREL);
for (i = 2; i < 8; ++i)
pages[i] = folio_page(folio, i); //set 2-7 range page into pages,
void *vaddr = vmap(pages, 6, 0, PAGE_KERNEL);
For no HVO pages, this can work. If HVO enabled, do "pages[i] = folio_page(folio, i);" just
got the head page? and how vmap can correctly map each page?
Why do you think folio_page(folio, i) (i ≠ 0) returns the head page? Is it speculation or tested? Please base it on the actual situation instead of indulging in wild thoughts.
By the way, in case you truly struggle to comprehend the fundamental aspects of HVO, I would like to summarize for you the user-visible behaviors in comparison to the situation where HVO is disabled.
HVO Status Tail Page Structures Head Page Structures Enabled Read-Only (RO) Read-Write (RW) Disabled Read-Write (RW) Read-Write (RW)
The sole distinction between the two scenarios lies in whether the tail page structures are allowed to be written or not. Please refrain from getting bogged down in the details of the implementation of HVO.
Thanks, I do a test, an figure out that I'm totally misunderstand it.
Even if HVO enabled, tail page struct freed and point to head, linear mapping still exist, so that any page_to_pfn,
page_to_virt(also folio's version), if start from head page can compute each need page like folio_page, can still work:
hvo head 0xfffff9de849d0000, pfn=0x127400, wish offset_pfn 0x1275f1, idx 497 is 0xfffff9de849d7c40, pfn=0x1275f1.
When vmap, we no need to touch actually page's content, just turn to pfn, so, work well.
You are able to read those tail page structures. The reason why vmap can function is not that it doesn't read those page structures. What I mean is that vmap will still work even if it does read the page structures, because those tail page structures do indeed exist.
BTW, even if we need to touch actually input page struct, it point to head page, I guess will effect nothing.
Allow me to clarify this for you to ensure that we have a shared understanding. Those tail page structures (virtual addresses in the vmemmap area) are mapped to the same page frame (physical page) to which the head page structures (virtual addresses in the vmemmap area) are mapped. It is analogous to the shared-mapping mechanism in the user space.
Thank you for your answer. I may understand it.
HVO do not release vmemmap page struct pointer array, just change it's va point to head page's.(vmemmap_remap_pte)
So:
1. any deal of page struct pointer still work, can get right pfn or something.
2. Any read of this va still work, we can get correct folio info, but can't change it.(PAGE_KERNEL_RO)
What I misunderstand ahead is vmemmap's page struct pointer also freed, what a fool. :(
Thanks,
Huan Yang
If anything still misunderstand, please corrent me. :)
Muchun, thank you for your patience,
Huan Yang
Thanks, Muchun.
Thanks, Muchun.
Please correct me. :)
Thanks,
Huan Yang
address (VA) and physical address (PA). The VAs of tail struct pages aren't freed but remapped to the physical page mapped by the VA of the head struct page (since contents of those tail physical pages are the same). Thus, the freed pages are the physical pages mapped by original tail struct pages, not their virtual addresses. Moreover, while it is possible to read the virtual addresses of these tail struct pages, any write operations are prohibited since it is within the realm of acceptability that the kernel is expected to perform write operations solely on the head struct page of a compound head and conduct read operations only on the tail struct pages. BTW, folio infrastructure is also based on this assumption.
Thanks, Muchun.
> Or something I still misunderstand, please correct me. > > Thanks, > > Huan Yang > >> of the fundamentals of HVO, I kindly suggest a thorough review of the document >> in [3]. >> >> [2] https://lore.kernel.org/lkml/5229b24f-1984-4225-ae03-8b952de56e3b@vivo.com/#... >> [3] Documentation/mm/vmemmap_dedup.rst >> >>> [1] https://lore.kernel.org/linux-mm/20250327092922.536-1-link@vivo.com/T/#m055b... >>> >>> Thanks, >>> Muchun.
On Mon, Apr 07, 2025 at 02:43:20PM +0800, Muchun Song wrote:
By the way, in case you truly struggle to comprehend the fundamental aspects of HVO, I would like to summarize for you the user-visible behaviors in comparison to the situation where HVO is disabled.
HVO Status Tail Page Structures Head Page Structures Enabled Read-Only (RO) Read-Write (RW) Disabled Read-Write (RW) Read-Write (RW)
The sole distinction between the two scenarios lies in whether the tail page structures are allowed to be written or not. Please refrain from getting bogged down in the details of the implementation of HVO.
This feels extremely fragile to me. I doubt many people know what operations needs read vs write access to tail pages. Or for higher level operations if needs access to tail pages at all.
On Apr 7, 2025, at 16:59, Christoph Hellwig hch@lst.de wrote:
On Mon, Apr 07, 2025 at 02:43:20PM +0800, Muchun Song wrote:
By the way, in case you truly struggle to comprehend the fundamental aspects of HVO, I would like to summarize for you the user-visible behaviors in comparison to the situation where HVO is disabled.
HVO Status Tail Page Structures Head Page Structures Enabled Read-Only (RO) Read-Write (RW) Disabled Read-Write (RW) Read-Write (RW)
The sole distinction between the two scenarios lies in whether the tail page structures are allowed to be written or not. Please refrain from getting bogged down in the details of the implementation of HVO.
This feels extremely fragile to me. I doubt many people know what operations needs read vs write access to tail pages. Or for higher level operations if needs access to tail pages at all.
A compound page should modify its head page structure (e.g., refcount), which is why `compound_head()` is widely used. Modifying its tail page structures is incorrect. Users needn't worry about whether to modify tail page structures. They should use `compound_head(tail)` to get the head page structure and update it. All users must follow this rule (I think folio-infrastructure also requires this). If a user tries to write to a HugeTLB page's tail page, an exception will be raised as these tail pages are read-only mapped to catch invalid operations.
linux-kselftest-mirror@lists.linaro.org