On Thu, Jun 20, 2024 at 8:28 PM Huang, Ying ying.huang@intel.com wrote:
Barry Song 21cnbao@gmail.com writes:
On Thu, Jun 20, 2024 at 8:01 PM Huang, Ying ying.huang@intel.com wrote:
Barry Song 21cnbao@gmail.com writes:
On Thu, Jun 20, 2024 at 6:36 PM Huang, Ying ying.huang@intel.com wrote:
Barry Song 21cnbao@gmail.com writes:
On Thu, Jun 20, 2024 at 5:22 PM Huang, Ying ying.huang@intel.com wrote: > > Barry Song 21cnbao@gmail.com writes: > > > On Thu, Jun 20, 2024 at 1:55 PM Huang, Ying ying.huang@intel.com wrote: > >> > >> Barry Song 21cnbao@gmail.com writes: > >> > >> > From: Barry Song v-songbaohua@oppo.com > >> > > >> > Both Ryan and Chris have been utilizing the small test program to aid > >> > in debugging and identifying issues with swap entry allocation. While > >> > a real or intricate workload might be more suitable for assessing the > >> > correctness and effectiveness of the swap allocation policy, a small > >> > test program presents a simpler means of understanding the problem and > >> > initially verifying the improvements being made. > >> > > >> > Let's endeavor to integrate it into the self-test suite. Although it > >> > presently only accommodates 64KB and 4KB, I'm optimistic that we can > >> > expand its capabilities to support multiple sizes and simulate more > >> > complex systems in the future as required. > >> > >> IIUC, this is a performance test program instead of functionality test > >> program. Does it match the purpose of the kernel selftest? > > > > I have a differing perspective. I maintain that the functionality is > > not functioning > > as expected. Despite having all the necessary resources for allocation, failure > > persists, indicating a lack of functionality. > > Is there any user visual functionality issue?
Definitely not. If a plane can't take off, taking a train and pretending there's no functionality issue isn't a solution.
I always think that performance optimization is great work. However, it is not functionality work.
I have never assigned blame for any mistakes here. On the contrary, I have 100% appreciation for Ryan's work in at least initiating mTHP swapout w/o being split.
It took countless experiments for humans to make airplanes commercially viable, but the person who created the first flying airplane remains the greatest. Similarly, Ryan's efforts, combined with your review of his patch, have enabled us to achieve a better goal here. Without your work, we can't get here at all.
Thanks!
However, this is never a reason to refuse to acknowledge that this feature is not actually working.
It just works for some workloads, not for some others.
> > >> > >> > Signed-off-by: Barry Song v-songbaohua@oppo.com > >> > --- > >> > tools/testing/selftests/mm/Makefile | 1 + > >> > .../selftests/mm/thp_swap_allocator_test.c | 192 ++++++++++++++++++ > >> > 2 files changed, 193 insertions(+) > >> > create mode 100644 tools/testing/selftests/mm/thp_swap_allocator_test.c > >> > > >> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile > >> > index e1aa09ddaa3d..64164ad66835 100644 > >> > --- a/tools/testing/selftests/mm/Makefile > >> > +++ b/tools/testing/selftests/mm/Makefile > >> > @@ -65,6 +65,7 @@ TEST_GEN_FILES += mseal_test > >> > TEST_GEN_FILES += seal_elf > >> > TEST_GEN_FILES += on-fault-limit > >> > TEST_GEN_FILES += pagemap_ioctl > >> > +TEST_GEN_FILES += thp_swap_allocator_test > >> > TEST_GEN_FILES += thuge-gen > >> > TEST_GEN_FILES += transhuge-stress > >> > TEST_GEN_FILES += uffd-stress > >> > diff --git a/tools/testing/selftests/mm/thp_swap_allocator_test.c b/tools/testing/selftests/mm/thp_swap_allocator_test.c > >> > new file mode 100644 > >> > index 000000000000..4443a906d0f8 > >> > --- /dev/null > >> > +++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c > >> > @@ -0,0 +1,192 @@ > >> > +// SPDX-License-Identifier: GPL-2.0-or-later > >> > +/* > >> > + * thp_swap_allocator_test > >> > + * > >> > + * The purpose of this test program is helping check if THP swpout > >> > + * can correctly get swap slots to swap out as a whole instead of > >> > + * being split. It randomly releases swap entries through madvise > >> > + * DONTNEED and do swapout on two memory areas: a memory area for > >> > + * 64KB THP and the other area for small folios. The second memory > >> > + * can be enabled by "-s". > >> > + * Before running the program, we need to setup a zRAM or similar > >> > + * swap device by: > >> > + * echo lzo > /sys/block/zram0/comp_algorithm > >> > + * echo 64M > /sys/block/zram0/disksize > >> > + * echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >> > + * echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >> > + * mkswap /dev/zram0 > >> > + * swapon /dev/zram0 > >> > + * The expected result should be 0% anon swpout fallback ratio w/ or > >> > + * w/o "-s". > >> > + * > >> > + * Author(s): Barry Song v-songbaohua@oppo.com > >> > + */ > >> > + > >> > +#define _GNU_SOURCE > >> > +#include <stdio.h> > >> > +#include <stdlib.h> > >> > +#include <unistd.h> > >> > +#include <string.h> > >> > +#include <sys/mman.h> > >> > +#include <errno.h> > >> > +#include <time.h> > >> > + > >> > +#define MEMSIZE_MTHP (60 * 1024 * 1024) > >> > +#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024) > >> > +#define ALIGNMENT_MTHP (64 * 1024) > >> > +#define ALIGNMENT_SMALLFOLIO (4 * 1024) > >> > +#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) > >> > +#define TOTAL_DONTNEED_SMALLFOLIO (768 * 1024) > >> > +#define MTHP_FOLIO_SIZE (64 * 1024) > >> > + > >> > +#define SWPOUT_PATH \ > >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout" > >> > +#define SWPOUT_FALLBACK_PATH \ > >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback" > >> > + > >> > +static void *aligned_alloc_mem(size_t size, size_t alignment) > >> > +{ > >> > + void *mem = NULL; > >> > + > >> > + if (posix_memalign(&mem, alignment, size) != 0) { > >> > + perror("posix_memalign"); > >> > + return NULL; > >> > + } > >> > + return mem; > >> > +} > >> > + > >> > +static void random_madvise_dontneed(void *mem, size_t mem_size, > >> > + size_t align_size, size_t total_dontneed_size) > >> > +{ > >> > + size_t num_pages = total_dontneed_size / align_size; > >> > + size_t i; > >> > + size_t offset; > >> > + void *addr; > >> > + > >> > + for (i = 0; i < num_pages; ++i) { > >> > + offset = (rand() % (mem_size / align_size)) * align_size; > >> > + addr = (char *)mem + offset; > >> > + if (madvise(addr, align_size, MADV_DONTNEED) != 0) > >> > + perror("madvise dontneed"); > >> > >> IIUC, this simulates align_size (generally 64KB) swap-in. That is, it > >> simulate the effect of large size swap-in when it's not available in > >> kernel. If we have large size swap-in in kernel in the future, this > >> becomes unnecessary. > >> > >> Additionally, we have not reached the consensus that we should always > >> swap-in with swapped-out size. So, I suspect that this test may not > >> reflect real situation in the future. Although it doesn't reflect > >> current situation too. > > > > Disagree again. releasing the whole mTHP swaps is the best case. Even in > > the best-case scenario, if we fail, it raises concerns for handling potentially > > more challenging situations. > > Repeating sequential anonymous pages writing is the best case.
I define the best case as the scenario with the least chance of creating fragments within swapfiles for mTHP to swap out. There is no real difference whether this is done through swapin or madv_dontneed.
IMO, swapin is much more important than madv_dontneed. Because most users use swapin automatically, but few use madv_dontneed by hand. So, I think swapin/swapout test is much more important than madv_dontneed. I don't like this test case because madv_dontneed isn't typical or basic.
Disliking DONTNEED isn't a sufficient reason to reject this test program because no single small program can report swapout counters, swapout fallback counters, and fallback ratios within several minutes for 100 iterations. That's precisely why we need it, at least initially. We can enhance it further if it lacks certain functionalities that people desire.
The entire purpose of MADV_DONTNEED is to simulate a scenario where all slots are released as a whole, preventing the creation of fragments, which is most favorable for swap allocation. I believe there is no difference between using MADV_DONTNEED or swapin for this purpose. But I am perfectly fine with switching to swapin to replace MADV_DONTNEED in v2.
Great! Thanks for doing this!
And even better, can we not make swap-in address aligned and size aligned? It's too unrealistic. It's good to consider some level of spatial locality, for example, swap-in random number of pages sequentially at some random addresses. That could be a good general test program. We can use it to evaluate further swap optimizations, for example, to evaluate the memory wastage of some swap-in size policy.
I wholeheartedly agree with everything mentioned above; these are actually part of my plan as incremental patches. This initial commit serves as the first step of the three I proposed in the last email.
It will be a small test program to implement all these. Don't need to use 3 steps. IMHO, it's not good to optimize for a unrealistic test case with address aligned and size aligned swap-in. It's trivial to remove the alignment requirements.
I'll preserve alignment as an option for Chris and Ryan to compare aligned cases with unaligned ones, rather than removing the alignment code altogether. Additionally, it can aid us in understanding the potential outcomes when dealing with large folio swap-ins.
And, we don't need PAGEOUT too, just use large virtual address space in test programs. We can trigger swapout in more common way.
I'm not particularly enthusiastic about this idea, as I expect the test program to run quickly. A large virtual address space would result in long waiting times for the test results, as it relies on vmscan. Therefore, I hope we can use real workloads to achieve this instead.
I have use test program with large virtual address space (in vm-scalability) to do swap test before. It runs really fast. Please give it a try.
I'm not so optimistic. Having worked on both server and embedded systems, I understand the disparity in resources between rich server environments and resource-constrained embedded systems. I'm constantly frustrated by how slow my tests run and deploy.
-- Best Regards, Huang, Ying
Thanks Barry