On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote:
On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
diff --git a/kernel/module/main.c b/kernel/module/main.c index d02d39c7174e..7a627345d4fd 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) sched_annotate_sleep(); mutex_lock(&module_mutex); mod = find_module_all(name, strlen(name), true);
- ret = !mod || mod->state == MODULE_STATE_LIVE;
ret = !mod || mod->state == MODULE_STATE_LIVE
|| mod->state == MODULE_STATE_GOING;
mutex_unlock(&module_mutex);
return ret;
@@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
mod->state = MODULE_STATE_UNFORMED;
-again:
So this is part of my biggest concern for regression, the removal of this tag and its use.
Before this we always looped back to trying again and again.
Just to be sure that we are on the same page.
The loop was _not_ infinite. It serialized all attempts to load the same module. In our case, it serialized all failures and prolonged the pain.
That's fair yes. The loop happens so long as an already existing module is present with the same name.
mutex_lock(&module_mutex); old = find_module_all(mod->name, strlen(mod->name), true); if (old != NULL) {
if (old->state != MODULE_STATE_LIVE) {
if (old->state == MODULE_STATE_COMING
|| old->state == MODULE_STATE_UNFORMED) { /* Wait in case it fails to load. */ mutex_unlock(&module_mutex); err = wait_event_interruptible(module_wq, finished_loading(mod->name)); if (err) goto out_unlocked;
goto again;
We essentially bound this now, and before we didn't.
Yes we we wait for finished_loading() of the module -- but if udev is hammering tons of same requests, well, we *will* surely hit this, as many requests managed to get in before userspace saw the module present.
While this typically can be useful, it means *quite a bit* of conditions which definitely *did* happen before will now *bail out* fast, to the extent that I'm not even sure why we just re-try once now.
I do not understand this. We do _not_ re-try the load in the new version. We just wait for the result of the parallel attempt to load the module.
Maybe, you are confused that we repeat find_module_all(). But it is the way how to find the result of the parallel load.
My point is that prior to the buggy commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") and even after that commit it we 'goto again' if an old request is found. We now simply bound this right away. Yes, the loop was not infinite, but in theory at least a few iterations were possible before whereas now immediately return -EBUSY and I don't think all use cases may be ready yet.
If we're going to just re-check *once* why not do something graceful like *at least* cond_resched() to let the system breathe for a *tiny bit*.
We must check the result under module_mutex. We have to take this sleeping lock. There is actually a rescheduling. I do not think that cond_resched() would do any difference.
Makes sense.
/* The module might have gone in the meantime. */
mutex_lock(&module_mutex);
old = find_module_all(mod->name, strlen(mod->name),
}true);
err = -EEXIST;
/*
* We are here only when the same module was being loaded. Do
* not try to load it again right now. It prevents long delays
* caused by serialized module load failures. It might happen
* when more devices of the same type trigger load of
* a particular module.
*/
if (old && old->state == MODULE_STATE_LIVE)
err = -EEXIST;
else
err = -EBUSY;
And for all those use cases we end up here now, with -EBUSY. So udev before was not bounded, and kept busy-looping on the retry in-kernel, and we now immediately bound its condition to just 2 tries to see if the old module existed and now *return* a new value to userspace.
My main concerns are:
- Why not use cond_resched() if we're just going to check twice?
We take module_mutex. It should cause even bigger delay than cond_resched().
ACK.
- How are we sure we are not regressing userspace by removing the boundless
loop there? (even if the endless loop was stupid)
We could not be sure. On the other hand, if more attempts help to load the module then it is racy and not reliable. The new approach would make it better reproducible and fix the race.
Yes, but the short cut it is a userspace visible change.
- How is it we expect that we won't resgress userspace now by bounding
that check and pretty much returning -EBUSY right away? This last part seems dangerous, in that if userspace did not expect -EBUSY and if an error before caused a module to fail and fail boot, why wouldn't we fail boot now by bailing out faster??
Same answer as for 1)
- *Fixing* a kernel regression by adding new expected API for testing
against -EBUSY seems not ideal.
IMHO, the right solution is to fix the subsystems so that they send only one uevent.
Makes sense, but that can take time and some folks are stuck on old kernels and perhaps porting fixes for this on subsystems may take time to land to some enterprisy kernels. And then there is also systemd that issues the requests too, at least that was reflected in commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") that commit claims it was systemd issueing the requests which I mean to interpret finit_module(), not calling modprobe.
just a comment on this, if it helps anything: commit 6e6de3dee51a says systemd, but it would be more accurate to say udev or systemd-udev to disambiguate with pid 1.
udev uses libkmod to load modules so it's pretty much the same logic as modprobe.
Lucas De Marchi