Hello,
Thank you for reviewing and commenting.
On 8/10/22 2:03 PM, David Hildenbrand wrote:
On 26.07.22 18:18, Muhammad Usama Anjum wrote:
Hello,
Hi,
This patch series implements a new syscall, process_memwatch. Currently, only the support to watch soft-dirty PTE bit is added. This syscall is generic to watch the memory of the process. There is enough room to add more operations like this to watch memory in the future.
Soft-dirty PTE bit of the memory pages can be viewed by using pagemap procfs file. The soft-dirty PTE bit for the memory in a process can be cleared by writing to the clear_refs file. This series adds features that weren't possible through the Proc FS interface.
- There is no atomic get soft-dirty PTE bit status and clear operation possible.
Such an interface might be easy to add, no?
Are you referring to ioctl? I think this syscall can be used in future for adding other operations like soft-dirty. This is why syscall has been added.
If community doesn't agree, I can translate this syscall to the ioctl same as it is.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.
Same.
So I'm curious why we need a new syscall for that.
Historically, soft-dirty PTE bit tracking has been used in the CRIU project. The Proc FS interface is enough for that as I think the process is frozen. We have the use case where we need to track the soft-dirty PTE bit for running processes. We need this tracking and clear mechanism of a region of memory while the process is running to emulate the getWriteWatch() syscall of Windows. This syscall is used by games to keep track of dirty pages and keep processing only the dirty pages. This syscall can be used by the CRIU project and other applications which require soft-dirty PTE bit information.
As in the current kernel there is no way to clear a part of memory (instead of clearing the Soft-Dirty bits for the entire processi) and get+clear operation cannot be performed atomically, there are other methods to mimic this information entirely in userspace with poor performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
You write "poor performance". Did you actually implement a prototype using userfaultfd-wp? Can you share numbers for comparison?
Adding an new syscall just for handling a corner case feature (soft-dirty, which we all love, of course) needs good justification.
The cycles are given in thousands. 60 means 60k cycles here which have been measured with rdtsc().
| | Region size in Pages | 1 | 10 | 100 | 1000 | 10000 | |---|----------------------|------|------|-------|-------|--------| | 1 | MEMWATCH | 7 | 58 | 281 | 1178 | 17563 | | 2 | MEMWATCH Perf | 4 | 23 | 107 | 1331 | 8924 | | 3 | USERFAULTFD | 5405 | 6550 | 10387 | 55708 | 621522 | | 4 | MPROTECT_SEGV | 35 | 611 | 1060 | 6646 | 60149 |
1. MEMWATCH --> process_memwatch considering VM_SOFTDIRT (splitting is possible) 2. MEMWATCH Perf --> process_memwatch without considering VM_SOFTDIRTY 3. Userafaultfd --> userfaultfd with handling is userspace 4. Mprotect_segv --> mprotect and signal handler in userspace
Note: Implementation of mprotect_segv is very similar to userfaultfd. In both of these, the signal/fault is being handled in the userspace. In mprotect_segv, the memory region is write-protected through mprotect and SEGV signal is received when something is written to this region. This signal's handler is where we do calculations about soft dirty pages. Mprotect_segv mechanism must be lighter than userfaultfd inside kernel.
My benchmark application is purely single threaded to keep effort to a minimum until we decide to spend more time. It has been written to measure the time taken in a serial execution of these statements without locks. If the multi-threaded application is used and randomization is introduced, it should affect `MPROTECT_SEGV` and `userfaultd` implementations more than memwatch. But in this particular setting, memwatch and mprotect_segv perform closely.
long process_memwatch(int pidfd, unsigned long start, int len, unsigned int flags, void *vec, int vec_len);
This syscall can be used by the CRIU project and other applications which require soft-dirty PTE bit information. The following operations are supported in this syscall:
- Get the pages that are soft-dirty.
- Clear the pages which are soft-dirty.
- The optional flag to ignore the VM_SOFTDIRTY and only track per page
soft-dirty PTE bit
Huh, why? VM_SOFTDIRTY is an internal implementation detail and should remain such.
VM_SOFTDIRTY translates to "all pages in this VMA are soft-dirty".
Clearing soft-dirty bit for a range of memory may result in splitting the VMA. Soft-dirty bit of the per page need to be cleared. The VM_SOFTDIRTY flag from this splitted VMA need to be cleared. The kernel may decide to merge this splitted VMA back. Please note that kernel doesn't take into account the VM_SOFTDIRTY flag of the VMAs when it decides to merge the VMAs. This not only gives performance hit, but also the non-dirty pages of the whole VMA start to appear as dirty again after the VMA merging. To avoid this penalty, MEMWATCH_SD_NO_REUSED_REGIONS flag has been added to ignore the VM_SOFTDIRTY and just rely on the soft-dirty bit present on the per page. The user is aware about the constraint that the new regions will not be found dirty if this flag is specified.