On Tue, Jun 01, 2021 at 03:45:08PM -0400, Josef Bacik wrote:
We have been hitting some early ENOSPC issues in production with more recent kernels, and I tracked it down to us simply not flushing delalloc as aggressively as we should be. With tracing I was seeing us failing all tickets with all of the block rsvs at or around 0, with very little pinned space, but still around 120mib of outstanding bytes_may_used. Upon further investigation I saw that we were flushing around 14 pages per shrink call for delalloc, despite having around 2gib of delalloc outstanding.
Consider the example of a 8 way machine, all cpu's trying to create a file in parallel, which at the time of this commit requires 5 items to do. Assuming a 16k leaf size, we have 10mib of total metadata reclaim size waiting on reservations. Now assume we have 128mib of delalloc outstanding. With our current math we would set items to 20, and then set to_reclaim to 20 * 256k, or 5mib.
Assuming that we went through this loop all 3 times, for both FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop twice, we'd only flush 60mib of the 128mib delalloc space. This could leave a fair bit of delalloc reservations still hanging around by the time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of the total delalloc bytes on the system. Prior to my change we were calculating based on dirty inodes so our math made more sense, now it's just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've gone through the flush states at least once. This will empty the system of all delalloc so we're sure to be truly out of space when we start failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started using the page stuff heavily again. This affects earlier kernel versions as well, but would be a pain to backport to them as the flushing mechanisms aren't the same.
For 5.10 it depends on f00c42dd4cc8b856e6 ("btrfs: introduce a FORCE_COMMIT_TRANS flush operation") and is followed by the premptive flushing series. Prior to the commit introducing COMMIT_TRANS there are 3 patches that seem lightweight enough for stable backport to 5.10 but that should be evaluated first.
5.11.x stable is EOL, so 5.12 is ok to pick it but in case there's interest to backport it to 5.10, more work is needed than just tagging.