From: Sarah Newman srn@prgmr.com
[ Upstream commit 00c9c9628b49e368d140cfa61d7df9b8922ec2a8 ]
With `two-primaries` enabled, DRBD tries to detect "concurrent" writes and handle write conflicts, so that even if you write to the same sector simultaneously on both nodes, they end up with the identical data once the writes are completed.
In handling "superseeded" writes, we forgot a kref_get, resulting in a premature drbd_destroy_device and use after free, and further to kernel crashes with symptoms.
Relevance: No one should use DRBD as a random data generator, and apparently all users of "two-primaries" handle concurrent writes correctly on layer up. That is cluster file systems use some distributed lock manager, and live migration in virtualization environments stops writes on one node before starting writes on the other node.
Which means that other than for "test cases", this code path is never taken in real life.
FYI, in DRBD 9, things are handled differently nowadays. We still detect "write conflicts", but no longer try to be smart about them. We decided to disconnect hard instead: upper layers must not submit concurrent writes. If they do, that's their fault.
Signed-off-by: Sarah Newman srn@prgmr.com Signed-off-by: Lars Ellenberg lars@linbit.com Signed-off-by: Christoph Böhmwalder christoph.boehmwalder@linbit.com Link: https://lore.kernel.org/r/20250627095728.800688-1-christoph.boehmwalder@linb... Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
Based on my analysis of the commit and the surrounding code context, here is my assessment:
**YES** - This commit should be backported to stable kernel trees.
Here's my extensive explanation:
1. **Critical Bug Fix**: This commit fixes a missing `kref_get` that causes a use-after-free vulnerability. The code at drivers/block/drbd/drbd_receiver.c:2503-2507 shows the fix adds the missing reference count increment before queuing work that will later decrement the reference count in `drbd_send_acks_wf()`.
2. **Clear Use-After-Free Pattern**: Looking at the code pattern: - Before the fix: The code queues work (`queue_work`) that will call `drbd_send_acks_wf()`, which contains `kref_put(&device->kref, drbd_destroy_device)` at line 6119 - Without the matching `kref_get`, this leads to premature destruction of the device structure - The commit message explicitly states this results in "premature drbd_destroy_device and use after free"
3. **Follows Established Pattern**: The fix follows the exact same pattern used elsewhere in the DRBD code. At drivers/block/drbd/drbd_worker.c:150-152, we see: ```c kref_get(&device->kref); /* put is in drbd_send_acks_wf() */ if (!queue_work(connection->ack_sender, &peer_device->send_acks_work)) kref_put(&device->kref, drbd_destroy_device); ``` This shows the fix is correct and consistent with the codebase.
4. **Small and Contained Fix**: The change is minimal (adding 4 lines), well-understood, and doesn't introduce any architectural changes. It simply ensures proper reference counting.
5. **Real Security Issue**: Use-after-free bugs can lead to kernel crashes and potentially be exploited for privilege escalation. Even though the commit notes this code path is rarely taken in production ("two-primaries" mode with concurrent writes), when it is triggered, it causes kernel crashes.
6. **Meets Stable Criteria**: According to Documentation/process/stable- kernel-rules.rst: - It fixes "a real bug that bothers people" (kernel crashes) - It's "obviously correct and tested" (follows established pattern) - It's under 100 lines - It fixes "data corruption, a real security issue" (use-after-free)
7. **No Side Effects**: The fix only adds proper reference counting and doesn't change any logic or introduce new features. The only behavior change is preventing the use-after-free bug.
The fact that this affects a rarely-used code path ("two-primaries" mode) doesn't diminish its importance for backporting, as users who do use this feature would experience kernel crashes without this fix. The fix is safe, minimal, and prevents a serious bug.
drivers/block/drbd/drbd_receiver.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index 0c9f54197768..ac18d36b0ea8 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -2500,7 +2500,11 @@ static int handle_write_conflicts(struct drbd_device *device, peer_req->w.cb = superseded ? e_send_superseded : e_send_retry_write; list_add_tail(&peer_req->w.list, &device->done_ee); - queue_work(connection->ack_sender, &peer_req->peer_device->send_acks_work); + /* put is in drbd_send_acks_wf() */ + kref_get(&device->kref); + if (!queue_work(connection->ack_sender, + &peer_req->peer_device->send_acks_work)) + kref_put(&device->kref, drbd_destroy_device);
err = -ENOENT; goto out;