在 2025/10/24 18:37, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 9:44 PM Qu Wenruo wqu@suse.com wrote:
在 2025/10/24 01:56, Filipe Manana 写道:
On Thu, Oct 23, 2025 at 10:33 AM Qu Wenruo wqu@suse.com wrote:
[...]
So two suggestions:
- Move this into btrfs_error_commit_super(), to have all code related
to fs in error state in one single place.
- Instead of of calling filemap_write_and_wait(), make this simpler
by doing the iput() of the btree inode right before calling btrfs_stop_all_workers() and removing the call to invalidate_inode_pages2() which is irrelevant since the final iput() removes everything from the page cache except dirty pages (but the iput() already triggered writeback of them).
In fact for this scenario the call to invalidate_inode_pages2() must be returning -EBUSY due to the dirty pages, but we have always ignored its return value.
From a quick glance, it seems to me that suggestion 2 should work.
Yes, that's the original workaround I went with, the problem is we're still submitting metadata writes after a trans abort.
I don't feel that comfort writing back metadata in that situation. Maybe the trans abort is triggered because a corrupted extent/free space tree which allows us to allocate new tree blocks where we shouldn't (aka, metadata COW is already broken).
Metadata COW is broken why??
Consider this situation, no free space cache, and extent tree by somehow lacks the backref item for tree block A.
Then a new transaction is started, allocator choose the bytenr of tree block A for a new tree block B.
At this stage, tree block B will overwrite tree block A, but no writeback yet. And at this time transaction is not yet aborted.
Then by somehoe the tree block A got its reference dropped by one (e.g. subvolume deletion), but extent tree is corrupted and failed to find the backref item, and transaction is aborted.
Even after a transaction aborts, it's ok, but pointless, to allocate extents and trigger writeback for them.
Writeback can be triggered by a lot of other reasons, e.g. memory pressure, and in that case if we try to writeback tree block B, it will overwrite tree block A even if it's after transaction abort.
Not to mention the final iput() in close_ctree().
Thanks, Qu
As long as we don't allow the transaction to be committed and new transactions to be started, we are safe - in fact that's the only thing a transaction abort guarantees.
We may have many places that could check if a transaction was aborted and avoid extent allocation and writeback, but that's ok, as long as we don't allow transaction commits.
Thus I consider delaying btrfs_stop_all_workers() until iput() is only a workaround, it still allows us to submit unnecessary writes.
I'd prefer the solution 1) in this case, still with the extra handling in write_one_eb().
Thanks for the review and suggestion, will follow the advice of the remaining part.
Thanks, Qu