On Wed, Feb 28, 2018 at 11:39:56AM -0800, Brian Norris wrote:
Hi Greg,
Hi,
Sorry for the delay, wanted to think about this one for a while...
...
Thanks, but I've now released all of these with this patch committed, so we are now "bug compatible" :)
So, is that to say that the boilerplate above about objections is meaningless? This is the second time that this same "feature" has been pushed (degrading the quality of my systems) despite my objections, under the banner of "bug compatibility" [1]. The first attempt to revert was back around Dec 20 of last year, but I see that there were 10 "stable" 4.4 kernels released in the meantime [2] where that original bug was still present. (Commit fd865802c66b "Bluetooth: btusb: fix QCA Rome suspend/resume" was proven undeniably buggy.)
Sorry, I know you were frustrated, but for some subsystems/minor devices like this, regressions happen and getting them fixed properly can take a few weeks.
And yes, something doesn't feel "minor" when it affects your devices but really you have the control here to revert the change on your side (more on that below...)
Next: we see this current valiant attempt at a less buggy fix, by Hans. It's an OK solution, but it still wastes power for me. I objected above, but instead of delaying applying it, you applied it in the same release as you finally fixed the original crap (v4.4.116). So all-in-all, my system (if using 4.4.x directly) hasn't had decent Bluetooth since v4.4.99.
I'm amazed bluetooth works at all at times, given the mess of the hardware involved, and the horrid spec and all of the intermediate pieces. Luckily 4.15+ seems really good for me now, but I know you can't upgrade :)
That being said, some subsystems have problems with stuff like this due to crazy hardware that one fix breaks another and the like. There is also the issue of maintainers that don't work on the subsystem "full" or even "part" time. From my side, I submitted a known-security-bugfix and it was ignored by the bluetooth maintainers for weeks, so I had to route around them and push it directly to Linus just to get it fixed. So I feel your pain, but we are dealing with people with different priorities, none of which we directly control, so we have to handle it the best we can.
In the end, it's amazing any of this works at all, but it does, it just sometimes takes longer than we all like :)
It's a tough trade-off. If I dropped this patch, the normal mode of operation would be for it to get merged into device kernels and then forgotten about. Only if/when the user with the problem moves to a newer release a long time later would the regression normally appear again, and everyone would have to remember what happened and try to piece it all together again as to what commit caused the issue.
Note that I didn't suggest we have to completely drop the patch. And I also don't suspect you need to delay all -rc1 bugfixes. I'd just suggest delaying the patch for a few weeks, when there are objections raised. (Or, reverting and scheduling to re-queue in a few weeks if no progress...or something like that.) Is that not something that could work, in order to keep "stable" releases *actually* stable? In most software release processes, buggy patches are reverted as quickly as possible while alternatives are worked out. Not all fixes are security fixes that need to be out the door as soon as they see the light of day...
Having the "bug compatible" stable kernels is controversial. And I don't always follow that rule, depending on the subsystem/bug involved (see a recent btrfs bug for one such example.) That being said, I have found that it is the best thing to do overall, as it provides the needed pressure on the developer/maintainer/user to get the bug fixed and pushed to Linus as soon as possible.
And in the meantime, if you, as a user, knows the patch in problem, you can always revert it on your own. We all have local patches, you more than me, but that's just part of dealing with open source projects. Not a big deal at all, add the revert to your stack, when the bug gets fixed you drop your patch and all is good/fine.
If you aren't using tools to make this easier, well that can be fixed (I strongly recommend quilt, not git, to work with a device kernel, but that's another long rant/email for another time...)
By you adding the revert to your device kernel now, you have a record of this being a problem, how upstream isn't fixing the issue, and when/if you do move to a newer kernel, that bugfix will still be there in your patch stack to forward port.
So, you rely entirely on device kernels to manage the pain that your release process causes? We're actively trying to stay much closer to upstream these days, and would essentially like to eliminate the concept of "device" kernels, at least for Chrom{e,ium} OS, if possible. But it's crap like this that proves that we can't.
Oh come on, you all have _thousands_ of graphic driver patches in your tree on top of mainline. Dealing with 2-4 device-specific patches is a total drop in the bucket.
Yes, it takes testing and finding the problem, but look at the benifit! You are getting 10 patches a day that are currated and hopefully tested and maintained by the community. If you only have 1 failure a week, your odds are still way in your favor of having more bugs fixed (security and otherwise) than if you ignored the stable patches entirely.
And that't the point to drive home here. If you stay away from updating to stable patches, you have a huge boatload of KNOWN SECURITY HOLES in your product. If you take them, you have the _possiblity_ of some bugs added, but overall, the rate is _VERY_ small. Guenter has numbers of 2-4 patches per year cause problems. That's lower than ANY other development model I have ever seen anywhere.
So, stick with known buggy/insecure devices? Or take the updates and handle the 1-2 problems a year they provide you. I think the cost-analysis is easy to make here :)
And don't try to say "I'll just cherry-pick the security patches." It never works. I have audited a ton of device kernels, from loads of companies that say they know what they are doing. All of them were full of holes and missed obvious bugs. As proof of that, I can get root and/or crash almost every single major Android device on the market right now due to them not taking updated kernels. It's a sad state, and one that I am working with the OEMs and Google to resolve.
Yeah, you all are normally better than that, and I trust that you will push to get this resolved, hopefully soon. But for the most part, this method works best overall for the majority of the cases like this as not all bug reporters are persistent, and if not, the maintainer usually forgets about it as no one is saying anything and they have other things to work on.
Well, bluetooth is known to not have responsive maintainers, so who am I kidding here, odds are it's only going to get fixed as Hans is involved, despite the bluetooth maintainers :)
You can't pin this completely on the bluetooth maintainers. *You* maintain the -stable trees, yet you effectively ignored both of my objections, forcing me to rely on said original maintainers to queue up alternatives. Yes, yes, I know the "forcing me [and/or Hans] to work" is basically working as intended for you, but the hard facts show that *your* release was broken for far too long.
That's fine, it's my _job_ to push back on the maintainers here, as they are the ones that are doing the wrong thing, I'm not going to do their job, just like they aren't going to do mine.
And I don't scale, they do, as they only have to maintain a single subsystem, I have the whole stable tree to worry about.
And again, just revert and move on, I think we have taken more time with these emails about the whole process than it would have taken to do this on your side :)
[1] BTW, I've had multiple people laugh at me when I mentioned this phrase in explaining our predicament to people.
Those people have no understanding of how this whole thing is developed, that they rely on for their job, which is sad on so many other levels...
thanks,
greg k-h