On Tue, Jan 29, 2019 at 7:19 AM Greg KH gregkh@linuxfoundation.org wrote:
On Tue, Jan 29, 2019 at 06:35:48AM -0800, Guenter Roeck wrote:
On Mon, Jan 28, 2019 at 10:58 PM Greg KH gregkh@linuxfoundation.org wrote:
On Mon, Jan 28, 2019 at 09:31:30AM -0800, Zubin Mithra wrote:
From: Benjamin Herrenschmidt benh@kernel.crashing.org
commit 726e41097920a73e4c7c33385dcc0debb1281e18 upstream
For devices with a class, we create a "glue" directory between the parent device and the new device with the class name.
This directory is never "explicitely" removed when empty however, this is left to the implicit sysfs removal done by kobject_release() when the object loses its last reference via kobject_put().
This is problematic because as long as it's not been removed from sysfs, it is still present in the class kset and in sysfs directory structure.
The presence in the class kset exposes a use after free bug fixed by the previous patch, but the presence in sysfs means that until the kobject is released, which can take a while (especially with kobject debugging), any attempt at re-creating such as binding a new device for that class/parent pair, will result in a sysfs duplicate file name error.
This fixes it by instead doing an explicit kobject_del() when the glue dir is empty, by keeping track of the number of child devices of the gluedir.
This is made easy by the fact that all glue dir operations are done with a global mutex, and there's already a function (cleanup_glue_dir) called in all the right places taking that mutex that can be enhanced for this. It appears that this was in fact the intent of the function, but the implementation was wrong.
Backport Note: kref_read() is not present in 4.4. Hence, use atomic_read(&kref.refcount) instead of kref_read(&kref).
Signed-off-by: Benjamin Herrenschmidt benh@kernel.crashing.org Acked-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zubin Mithra zsm@chromium.org
drivers/base/core.c | 2 ++ include/linux/kobject.h | 17 +++++++++++++++++ 2 files changed, 19 insertions(+)
Wait, why is this needed?
We have syzcaller/syzbot hits, and there is a reproducer. Is this sufficient ?
Nice, where?
See below.
And why only for 4.4? What about 4.9 and 4.14? Do you want to upgrade and suddenly hit the same "bug" that you fixed before?
Good point. Sorry, that got lost. We are, after all, oly human. I'd ask Zubin to provide backports, or do it myself, but we'll have to resolve the issue you bring up below first.
There was a reason that I did not backport this to the stable tree when it was submitted, and that was because this was an odd race to ever hit. Are you hitting this in the real world without kobject deferred release enabled? And if so, are you hitting the WARN_ON that is added here?
I think we may need updated rules for stable. Many bug fixes are backported to stable releases without having been seen in the "real world" (whatever that means). At the same time we do see many races in the real world, many of them not fully understood. Our policy so far is to fix as many problems as possible if they are understood, in the hope that it fixes at least some of those problems seen in the field.
I don't have a problem with backporting this, but it went in as a "fixes a theoritical issue, and let's WARN_ON if it really ever is hit".
So I would like to see where we are hitting that WARN_ON, as it sounds like you found an easy-to-reproduce way to do it.
If there is a new rule that problems have to have been observed in the real world (ie without debugging options enabled, and without specific reproducer) before a patch is applied to stable, can we have that as generic rule that is not only selectively applied ?
It's not a specific rule, it's just that this specific patch went through a 30+ email chain of us arguing about it, so I really want to see why this is needed, and why Ben was right and I was wrong :)
I'm not trying to be extra hard here, it's just that I have a lot of history with this patch...
So if you all have a reproducer, I would love to see it.
Turns out you were already on Cc: for one of the reports:
https://b.corp.google.com/issues/112630408
I added you to the other:
https://b.corp.google.com/issues/112630534
The syzbot dashboard links in those bugs point to the reproducers.
Oh, and the backports to the other kernel versions as well, you don't want to have to fix this again in a year when you all upgrade to a new release.
For v4.14: The upstream patch (726e41097920) applies directly. For v4.9.y: The patch provided for v4.4.y also applies to v4.9.y.
Thanks, Guenter