On Tue, Dec 22, 2020 at 08:15:53PM +0000, Matthew Wilcox wrote:
On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
My previous suggestion to use a mutex to serialize userfaultfd_writeprotect with a mutex will still work, but we can run as many wrprotect and un-wrprotect as we want in parallel, as long as they're not simultaneous, we can do much better than a mutex.
Ideally we would need a new two_group_semaphore, where each group can run as many parallel instances as it wants, but no instance of one group can run in parallel with any instance of the other group. AFIK such a kind of lock doesn't exist right now.
Kent and I worked on one for a bit, and we called it a red-black mutex. If team red had the lock, more members of team red could join in. If team black had the lock, more members of team black could join in. I forget what our rule was around fairness (if team red has the lock, and somebody from team black is waiting, can another member of team red take the lock, or must they block?)
In this case they would need to block and provide full fairness.
Well maybe just a bit of unfariness (to let a few more through the door before it shuts) wouldn't be a deal breaker but it would need to be bound or it'd starve the other color/side indefinitely. Otherwise an ioctl mode_wp = true would block forever, if more ioctl mode_wp = false keep coming in other CPUs (or the other way around).
The approximation with rwsem and two atomics provides full fariness in both read and write mode (originally the read would stave the write IIRC which was an issue for all mprotect etc.. not anymore thankfully).
It was to solve the direct-IO vs buffered-IO problem (you can have as many direct-IO readers/writers at once or you can have as many buffered-IO readers/writers at once, but exclude a mix of direct and buffered I/O). In the end, we decided it didn't work all that well.
Well mixing buffered and direct-IO is certainly not a good practice so it's reasonable to leave it up to userland to serialize if such mix is needed, the kernel behavior is undefined if the mix is concurrent out of order.