On Tue, May 14, 2019 at 11:19:34AM +0200, Greg Kroah-Hartman wrote:
On Mon, May 13, 2019 at 11:55:18AM -0600, Raul E Rangel wrote:
I think we should cherry-pick 41e3efd07d5a02c80f503e29d755aa1bbb4245de https://lore.kernel.org/patchwork/patch/856512/ into 4.14. It fixes a potential resource leak when shutting down the request queue.
Potential meaning "it does happen", or "it can happen if we do this", or just "maybe it might happen, we really do not know?"
It does happen if the AMD SDHCI patches are cherry-picked into 4.14. https://lkml.org/lkml/2019/5/1/398 It can be mitigated by changing the line in the patch with `mmc_detect_change(host, 0)` to mmc_detect_change(host, 200)`, but that's just a workaround to play with the timing so the race condition doesn't happen.
Once this patch is applied, there is a potential for a null pointer dereference. That's what the second patch fixes.
What is the git id of that upstream fix?
So there is no specific upstream fix. There was a large patch set that migrated mmc to using blk-mq, so the bug just kind of went away. https://lwn.net/Articles/739774/ or 0fbfd12518303e9b32ac9fd231439459eac848f9
The third patch is just an optimization to stop processing earlier.
That's not how stable kernels work :(
Oops, I guess we can ignore that patch. It just prevents mmc_init_request from being called, but it doesn't matter since the 2nd patch actually checks for NULL now.
See https://patchwork.kernel.org/patch/10925469/ for the initial motivation.
I don't understand the motivation from that link at all :(
This commit applies to v4.14.116. It is already included in 4.19. 4.19 doesn't suffer from the null pointer dereference because later commits migrate the mmc stack to blk-mq.
What are those later commits?
Commit 0fbfd12518303e9b32ac9fd231439459eac848f9 specifically deletes the code for the 2nd patch. As I said above, the NULL pointer dereference just kind of went away as part of the blk-mq migration, so there is no upstream fix :(
I tested this patch set by randomly connecting/disconnecting the SD card. I got over 189650 itarations without a problem.
And if you do not have these patches, on 4.14.y, how many iterations cause a problem? If you just apply the first patch, does that work?
If I apply the AMD SDHCI patches and nothing else, then I can cause a resource leak within 10 iterations. If I apply just the first patch then I can cause a NULL pointer error within 10 iterations. If I apply both 1 and 2, then everything works as expected and I can't cause a problem.
_EVERY_ time we take a patch that is not upstream, something usually is broken and needs to be fixed. We have a long long long history of this, so if you want to have a patch that is not upstream applied to a stable kernel release, you need a whole lot of justification and explanation and begging. And you need to be around to fix the fallout for when it breaks :)
It also looks like 2361bfb055f948eac6583fa3c75a014da84fe554 includes a fix for 41e3efd07d5a02c80f503e29d755aa1bbb4245de, so that would need to be cherry picked in.
I guess I should have included a fixes: line in my second patch.
So to summarize: - cherry-pick 41e3efd07d5a02c80f503e29d755aa1bbb4245de - cherry-pick 2361bfb055f948eac6583fa3c75a014da84fe554 - apply 2nd patch but add to commit message: Fixes: 41e3efd07d5a ("mmc: block: Simplify cleaning up the queue") - Ignore patch 3 since it's an optimization.
thanks,
greg k-h
Thanks again for your time, and sorry for the really late response!
Raul