On Mon, 27 Jan 2020, Jouni Högander wrote:
Lukas Bulwahn lukas.bulwahn@gmail.com writes:
On Wed, 22 Jan 2020, Jouni Högander wrote:
Greg Kroah-Hartman gregkh@linuxfoundation.org writes:
Now queued up, I'll push out -rc2 versions with this fix.
greg k-h
We have also been informed about another regression these two commits are causing:
https://lore.kernel.org/lkml/ace19af4-7cae-babd-bac5-cd3505dcd874@I-love.SAK...
I suggest to drop these two patches from this queue, and give us a week to shake out the regressions of the change, and once ready, we can include the complete set of fixes to stable (probably in a week or two).
Ok, thanks for the information, I've now dropped them from all of the queues that had them in them.
greg k-h
I have now run more extensive Syzkaller testing on following patches:
cb626bf566eb net-sysfs: Fix reference count leak ddd9b5e3e765 net-sysfs: Call dev_hold always in rx_queue_add_kobject e0b60903b434 net-sysfs: Call dev_hold always in netdev_queue_add_kobje 48a322b6f996 net-sysfs: fix netdev_queue_add_kobject() breakage b8eb718348b8 net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject
These patches are fixing couple of memory leaks including this one found by Syzbot: https://syzkaller.appspot.com/bug?extid=ad8ca40ecd77896d51e2
I can reproduce these memory leaks in following stable branches: 4.14, 4.19, and 5.4.
These are all now merged into net/master tree and based on my testing they are ready to be taken into stable branches as well.
- syzkaller list
Jouni et. al, please drop Linus in further responses; Linus, it was wrong to add you to this thread in the first place (reason is explained below)
Jouni, thanks for investigating.
It raises the following questions and comments:
- Does the memory leak NOT appear on 4.9 and earlier LTS branches (or did
you not check that)? If it does not appear, can you bisect it with the reproducer to the commit between 4.14 and 4.9?
I tested and these memory leaks are not reproucible in 4.9 and earlier.
- Do the reproducers you found with your syzkaller testing show the same
behaviour (same bisection) as the reproducers from syzbot?
Yes, they are same.
- I fear syzbot's automatic bisection on is wrong, and Linus' commit
0e034f5c4bc4 ("iwlwifi: fix mis-merge that breaks the driver") is not to blame here; that commit did not cause the memory leak, but fixed some unrelated issue that simply confuses syzbot's automatic bisection.
Just FYI: Dmitry Vyukov's evaluation of the syzbot bisection shows that about 50% are wrong, e.g., due to multiple bugs being triggered with one reproducer and the difficulty of automatically identifying them of being different due to different root causes (despite the smart heuristics of syzkaller & syzbot). So, to identify the actual commit on which the memory leak first appeared, you need to bisect manually with your own judgement if the reported bug stack trace fits to the issue you investigating. Or you use syzbot's automatic bisection but then with a reduced kernel config that cannot be confused by other issues. You might possibly also hit a "beginning of time" in your bisection, where KASAN was simply not supported, then the initially causing commit can simply not determined by bisection with the reproducer and needs some code inspection and archaeology with git. Can you go ahead try to identify the correct commit for this issue?
These two commits (that are not in 4.9 and earlier) are intorducing these leaks:
commit e331c9066901dfe40bea4647521b86e9fb9901bb Author: YueHaibing yuehaibing@huawei.com Date: Tue Mar 19 10:16:53 2019 +0800
net-sysfs: call dev_hold if kobject_init_and_add success
[ Upstream commit a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e ] In netdev_queue_add_kobject and rx_queue_add_kobject, if sysfs_create_group failed, kobject_put will call netdev_queue_release to decrease dev refcont, however dev_hold has not be called. So we will see this while unregistering dev: unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1 Reported-by: Hulk Robot hulkci@huawei.com Fixes: d0d668371679 ("net: don't decrement kobj reference count on init fail ure") Signed-off-by: YueHaibing yuehaibing@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
commit d0d6683716791b2a2761a1bb025c613eb73da6c3 Author: stephen hemminger stephen@networkplumber.org Date: Fri Aug 18 13:46:19 2017 -0700
net: don't decrement kobj reference count on init failure
If kobject_init_and_add failed, then the failure path would decrement the reference count of the queue kobject whose reference count was already zero. Fixes: 114cf5802165 ("bql: Byte queue limits") Signed-off-by: Stephen Hemminger sthemmin@microsoft.com Signed-off-by: David S. Miller davem@davemloft.net
But, it seems that we now have just a long sequences of fix patches.
This commit from 2011 seems to be the initial buggy one:
commit 114cf5802165ee93e3ab461c9c505cd94a08b800 Author: Tom Herbert therbert@google.com Date: Mon Nov 28 16:33:09 2011 +0000
bql: Byte queue limits
And then we just have fixes over fixes:
114cf5802165ee93e3ab461c9c505cd94a08b800 fixed by d0d6683716791b2a2761a1bb025c613eb73da6c3 fixed by a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e fixed by the sequence of your five patches, mentioned above
If that is right, we should be able to find a reproducer with syzkaller on the versions before d0d668371679 ("net: don't decrement kobj reference count on init failure") with fault injection enabled or some manually injected fault by modifying the source code to always fail on init to really trigger the init failure, and see the reference count go below zero.
All further issues should also have reproducers found with syzkaller. If we have a good feeling on the reproducers and this series of fixes really fixed the issue now here for all cases, we should suggest to backport all of the fixes to 4.4 and 4.9.
We should NOT just have Greg pick up a subset of the patches and backport them to 4.4 and 4.9, that will likely break more than it fixes.
Jouni, did you see Greg's bot inform you that he would pick up your latest patch for 4.4 and 4.9? Please respond to those emails to make sure a complete set of patches is picked up, which we tested with all those intermediate reproducers and an extensive syzkaller run hitting the net-sysfs interface (e.g., by configuring the corpus and check coverage).
If you cannot do this testing for 4.4 and 4.9 now quickly (you potentially have less than 24 hours), we should hold those new patches back for 4.4 and 4.9, as none of the fixes seem to be applied at all right now and the users have not complained yet on 4.4 and 4.9. Once testing of the whole fix sequence is done, we request to backport all patches at once for 4.4 and 4.9.
Lukas