This is a follow up to my v7 series of fixes for the zram driver [0] which ended up uncovering a generic deadlock issue with sysfs and module removal. I've reported this issue and proposed a few patches first since March 2021 [1]. At the end of this email you will find an itemized list of changes since that v1 series, you can also find these changes on my branch 20210927-sysfs-generic-deadlock-fix [4] which is based on linux-next tag next-20210927.
Just a heads up, I'm goin on vacation in two days, won't be back until Monday October 11th.
On this v8 I incorporate feedback from the v7 series, namely:
- Tejun requested I move the struct module to the last attribute when extending functions - As per discussion with Tejun, trimmed and clarified the commit log and documentation on the generic fix on patch 7 - As requested by Bart Van Assche, I simplied the setting of the struct test_config *config into one line instead of two on many places on patch 3 which adds the new sysfs selftest - Dan Williams had some questions about patch 7, and so clarified these questions using a more elaborate example on the commit log to show where the lock call was happening. - Trimmed the Cc list considerably as it was way too long before - Rebased onto linux-next tag next-20210927
Below a list of changes of this patch set since its inception:
On v1: - Open coded the sysfs deadlock race to only be localized by the zram driver Changes on v2: - used bdgrab() as well for another race which was speculated by Minchan - improved documentation of fixes Changes on v3: - used a localized zram macros for the sysfs attributes instead of open coding on each routine - replaced bdget() stuff for a generic get_device() and bus_get() on dev_attr_show() / dev_attr_store() for the issue speculated by Michan Changes on v4: - Cosmetic fixes on the zram fixes as requested by Greg - Split out the driver core fix as requested by Greg for the issue speculated by Michan. This fix ended up getting up to its 4th patch iteration [2] and eventually hit linux-next. We got a 0day 0day suspend stres fail for this patch [3] Changes on v5: - I ended up writing a test_sysfs driver and with it I ended up proving that the issue speculated by Michen was not possible and so I asked Greg to drop the patch from his queue titled "sysfs: fix kobject refcount to address races with kobject removal" - checkpatch fixes for the zram changes Changes on v6: - I submitted my test_sysfs driver for inclusion upstream which easily abstracted the deadlock issue in a driver generically [4] - I rebased the zram fixes and added also a new patch for zram to use ATTRIBUTE_GROUPS As per Minchen I sent the patches to be merged through Andrew Morton. - Greg ended up NACK'ing the patchset because he was not sure the fix was correct still Changes on v7: - Formalizes the original proposed generic sysfs fix intead of using macro helpers to work around the issue - I decided it is best to merge all the effort together into one patch set because communication was being lost when I split the patches up. This was not helping in any way to either fix the zram issues or come to consensus on a generic solution. The patches are also merged now because they are all related now. - Running checkpatch exposed that S_IRWXUGO and S_IRWXU|S_IRUGO|S_IXUGO should be replaced, so I did that in this series in two new patches - Adds a try_module_get() documentation extension with tribal knowledge and new information I don't think some folks still believe in. The new test_sysfs selftest however proves this information to be correct, the same selftest can be used to try to prove that documentation incorrect - Because the fix is now generic zram's deadlock can easily be fixed now by just making it use ATTRIBUTE_GROUPS().
[0] https://lkml.kernel.org/r/YUjLAbnEB5qPfnL8@slm.duckdns.org [1] https://lkml.kernel.org/r/20210306022035.11266-1-mcgrof@kernel.org [2] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org [3] https://lkml.kernel.org/r/20210701022737.GC21279@xsang-OptiPlex-9020 [4] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h...
Luis Chamberlain (12): LICENSES: Add the copyleft-next-0.3.1 license testing: use the copyleft-next-0.3.1 SPDX tag selftests: add tests_sysfs module kernfs: add initial failure injection support test_sysfs: add support to use kernfs failure injection kernel/module: add documentation for try_module_get() fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link() fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() sysfs: fix deadlock race with module removal test_sysfs: enable deadlock tests by default zram: fix crashes with cpu hotplug multistate zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
.../fault-injection/fault-injection.rst | 22 + LICENSES/dual/copyleft-next-0.3.1 | 237 +++ MAINTAINERS | 9 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- drivers/block/zram/zram_drv.c | 74 +- fs/kernfs/Makefile | 1 + fs/kernfs/dir.c | 44 +- fs/kernfs/failure-injection.c | 91 ++ fs/kernfs/file.c | 19 +- fs/kernfs/kernfs-internal.h | 75 +- fs/kernfs/symlink.c | 4 +- fs/sysfs/dir.c | 5 +- fs/sysfs/file.c | 6 +- fs/sysfs/group.c | 3 +- include/linux/kernfs.h | 19 +- include/linux/module.h | 34 +- include/linux/sysfs.h | 52 +- kernel/cgroup/cgroup.c | 2 +- lib/Kconfig.debug | 25 + lib/Makefile | 1 + lib/test_kmod.c | 12 +- lib/test_sysctl.c | 12 +- lib/test_sysfs.c | 952 ++++++++++++ tools/testing/selftests/kmod/kmod.sh | 13 +- tools/testing/selftests/sysctl/sysctl.sh | 12 +- tools/testing/selftests/sysfs/Makefile | 12 + tools/testing/selftests/sysfs/config | 5 + tools/testing/selftests/sysfs/sysfs.sh | 1383 +++++++++++++++++ 28 files changed, 3026 insertions(+), 102 deletions(-) create mode 100644 LICENSES/dual/copyleft-next-0.3.1 create mode 100644 fs/kernfs/failure-injection.c create mode 100644 lib/test_sysfs.c create mode 100644 tools/testing/selftests/sysfs/Makefile create mode 100644 tools/testing/selftests/sysfs/config create mode 100755 tools/testing/selftests/sysfs/sysfs.sh
Add the full text of the copyleft-next-0.3.1 license to the kernel tree as well as the required tags for reference and tooling. The license text was copied directly from the copyleft-next project's git tree [0].
Discussion of using copyleft-next-0.3.1 on Linux started since June, 2016 [1]. In the end Linus' preference was to have drivers use MODULE_LICENSE("GPL") to make it clear that the GPL applies when it comes to Linux [2]. Additionally, even though copyleft-next-0.3.1 has been found to be to be GPLv2 compatible by three attorneys at SUSE and Redhat [3], to err on the side of caution we simply recommend to always use the "OR" language for this license [4].
Even though it has been a goal of the project to be GPL-v2 compatible to be certain in 2016 I asked for a clarification about what makes copyleft-next GPLv2 compatible and also asked for a summary of benefits. This prompted some small minor changes to make compatibility even further clear and as of copyleft 0.3.1 compatibility should be crystal clear [5].
The summary of why copyleft-next 0.3.1 is compatible with GPLv2 is explained as follows:
Like GPLv2, copyleft-next requires distribution of derivative works ("Derived Works" in copyleft-next 0.3.x) to be under the same license. Ordinarily this would make the two licenses incompatible. However, copyleft-next 0.3.1 says: "If the Derived Work includes material licensed under the GPL, You may instead license the Derived Work under the GPL." "GPL" is defined to include GPLv2.
In practice this means copyleft-next code in Linux may be licensed under the GPL2, however there are additional obvious gains for bringing contributions from Linux outbound where copyleft-next is preferred. A summary of benefits why projects outside of Linux might prefer to use copyleft-next >= 0.3.1 over GPLv2:
o It is much shorter and simpler o It has an explicit patent license grant, unlike GPLv2 o Its notice preservation conditions are clearer o More free software/open source licenses are compatible with it (via section 4) o The source code requirement triggered by binary distribution is much simpler in a procedural sense o Recipients potentially have a contract claim against distributors who are noncompliant with the source code requirement o There is a built-in inbound=outbound policy for upstream contributions (cf. Apache License 2.0 section 5) o There are disincentives to engage in the controversial practice of copyleft/ proprietary dual-licensing o In 15 years copyleft expires, which can be advantageous for legacy code o There are explicit disincentives to bringing patent infringement claims accusing the licensed work of infringement (see 10b) o There is a cure period for licensees who are not compliant with the license (there is no cure opportunity in GPLv2) o copyleft-next has a 'built-in or-later' provision
The first driver submission to Linux under this dual strategy was lib/test_sysctl.c through commit 9308f2f9e7f05 ("test_sysctl: add dedicated proc sysctl test driver") merged in July 2017. Shortly after that I also added test_kmod through commit d9c6a72d6fa29 ("kmod: add test driver to stress test the module loader") in the same month. These two drivers went in just a few months before the SPDX license practice kicked in. In 2018 Kuno Woudt went through the process to get SPDX identifiers for copyleft-next [6] [7]. Although there are SPDX tags for copyleft-next-0.3.0, we only document use in Linux starting from copyleft-next-0.3.1 which makes GPLv2 compatibility crystal clear.
This patch will let us update the two Linux selftest drivers in subsequent patches with their respective SPDX license identifiers and let us remove repetitive license boiler plate.
[0] https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft... [1] https://lore.kernel.org/lkml/1465929311-13509-1-git-send-email-mcgrof@kernel... [2] https://lore.kernel.org/lkml/CA+55aFyhxcvD+q7tp+-yrSFDKfR0mOHgyEAe=f_94aKLsO... [3] https://lore.kernel.org/lkml/20170516232702.GL17314@wotan.suse.de/ [4] https://lkml.kernel.org/r/1495234558.7848.122.camel@linux.intel.com [5] https://lists.fedorahosted.org/archives/list/copyleft-next@lists.fedorahoste... [6] https://spdx.org/licenses/copyleft-next-0.3.0.html [7] https://spdx.org/licenses/copyleft-next-0.3.1.html
Cc: Goldwyn Rodrigues rgoldwyn@suse.com Cc: Kuno Woudt kuno@frob.nl Cc: Richard Fontana fontana@sharpeleven.org Cc: copyleft-next@lists.fedorahosted.org Cc: Ciaran Farrell Ciaran.Farrell@suse.com Cc: Christopher De Nicolo Christopher.DeNicolo@suse.com Cc: Christoph Hellwig hch@lst.de Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Jonathan Corbet corbet@lwn.net Cc: Thorsten Leemhuis linux@leemhuis.info Cc: Andrew Morton akpm@linux-foundation.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- LICENSES/dual/copyleft-next-0.3.1 | 237 ++++++++++++++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 LICENSES/dual/copyleft-next-0.3.1
diff --git a/LICENSES/dual/copyleft-next-0.3.1 b/LICENSES/dual/copyleft-next-0.3.1 new file mode 100644 index 000000000000..086bcb74b478 --- /dev/null +++ b/LICENSES/dual/copyleft-next-0.3.1 @@ -0,0 +1,237 @@ +Valid-License-Identifier: copyleft-next-0.3.1 +SPDX-URL: https://spdx.org/licenses/copyleft-next-0.3.1 +Usage-Guide: + This license can be used in code, it has been found to be GPLv2 compatible + by attorneys at Redhat and SUSE, however to air on the side of caution, + it's best to only use it together with a GPL2 compatible license using "OR". + To use the copyleft-next-0.3.1 license put the following SPDX tag/value + pair into a comment according to the placement guidelines in the + licensing rules documentation: + SPDX-License-Identifier: GPL-2.0 OR copyleft-next-0.3.1 + SPDX-License-Identifier: GPL-2.0-only OR copyleft-next 0.3.1 + SPDX-License-Identifier: GPL-2.0+ OR copyleft-next-0.3.1 + SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +License-Text: + +======================================================================= + + copyleft-next 0.3.1 ("this License") + Release date: 2016-04-29 + +1. License Grants; No Trademark License + + Subject to the terms of this License, I grant You: + + a) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable + copyright license, to reproduce, Distribute, prepare derivative works + of, publicly perform and publicly display My Work. + + b) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable + patent license under Licensed Patents to make, have made, use, sell, + offer for sale, and import Covered Works. + + This License does not grant any rights in My name, trademarks, service + marks, or logos. + +2. Distribution: General Conditions + + You may Distribute Covered Works, provided that You (i) inform + recipients how they can obtain a copy of this License; (ii) satisfy the + applicable conditions of sections 3 through 6; and (iii) preserve all + Legal Notices contained in My Work (to the extent they remain + pertinent). "Legal Notices" means copyright notices, license notices, + license texts, and author attributions, but does not include logos, + other graphical images, trademarks or trademark legends. + +3. Conditions for Distributing Derived Works; Outbound GPL Compatibility + + If You Distribute a Derived Work, You must license the entire Derived + Work as a whole under this License, with prominent notice of such + licensing. This condition may not be avoided through such means as + separate Distribution of portions of the Derived Work. + + If the Derived Work includes material licensed under the GPL, You may + instead license the Derived Work under the GPL. + +4. Condition Against Further Restrictions; Inbound License Compatibility + + When Distributing a Covered Work, You may not impose further + restrictions on the exercise of rights in the Covered Work granted under + this License. This condition is not excused merely because such + restrictions result from Your compliance with conditions or obligations + extrinsic to this License (such as a court order or an agreement with a + third party). + + However, You may Distribute a Covered Work incorporating material + governed by a license that is both OSI-Approved and FSF-Free as of the + release date of this License, provided that compliance with such + other license would not conflict with any conditions stated in other + sections of this License. + +5. Conditions for Distributing Object Code + + You may Distribute an Object Code form of a Covered Work, provided that + you accompany the Object Code with a URL through which the Corresponding + Source is made available, at no charge, by some standard or customary + means of providing network access to source code. + + If you Distribute the Object Code in a physical product or tangible + storage medium ("Product"), the Corresponding Source must be available + through such URL for two years from the date of Your most recent + Distribution of the Object Code in the Product. However, if the Product + itself contains or is accompanied by the Corresponding Source (made + available in a customarily accessible manner), You need not also comply + with the first paragraph of this section. + + Each direct and indirect recipient of the Covered Work from You is an + intended third-party beneficiary of this License solely as to this + section 5, with the right to enforce its terms. + +6. Symmetrical Licensing Condition for Upstream Contributions + + If You Distribute a work to Me specifically for inclusion in or + modification of a Covered Work (a "Patch"), and no explicit licensing + terms apply to the Patch, You license the Patch under this License, to + the extent of Your copyright in the Patch. This condition does not + negate the other conditions of this License, if applicable to the Patch. + +7. Nullification of Copyleft/Proprietary Dual Licensing + + If I offer to license, for a fee, a Covered Work under terms other than + a license that is OSI-Approved or FSF-Free as of the release date of this + License or a numbered version of copyleft-next released by the + Copyleft-Next Project, then the license I grant You under section 1 is no + longer subject to the conditions in sections 3 through 5. + +8. Copyleft Sunset + + The conditions in sections 3 through 5 no longer apply once fifteen + years have elapsed from the date of My first Distribution of My Work + under this License. + +9. Pass-Through + + When You Distribute a Covered Work, the recipient automatically receives + a license to My Work from Me, subject to the terms of this License. + +10. Termination + + Your license grants under section 1 are automatically terminated if You + + a) fail to comply with the conditions of this License, unless You cure + such noncompliance within thirty days after becoming aware of it, or + + b) initiate a patent infringement litigation claim (excluding + declaratory judgment actions, counterclaims, and cross-claims) + alleging that any part of My Work directly or indirectly infringes + any patent. + + Termination of Your license grants extends to all copies of Covered + Works You subsequently obtain. Termination does not terminate the + rights of those who have received copies or rights from You subject to + this License. + + To the extent permission to make copies of a Covered Work is necessary + merely for running it, such permission is not terminable. + +11. Later License Versions + + The Copyleft-Next Project may release new versions of copyleft-next, + designated by a distinguishing version number ("Later Versions"). + Unless I explicitly remove the option of Distributing Covered Works + under Later Versions, You may Distribute Covered Works under any Later + Version. + +** 12. No Warranty ** +** ** +** My Work is provided "as-is", without warranty. You bear the risk ** +** of using it. To the extent permitted by applicable law, each ** +** Distributor of My Work excludes the implied warranties of title, ** +** merchantability, fitness for a particular purpose and ** +** non-infringement. ** + +** 13. Limitation of Liability ** +** ** +** To the extent permitted by applicable law, in no event will any ** +** Distributor of My Work be liable to You for any damages ** +** whatsoever, whether direct, indirect, special, incidental, or ** +** consequential damages, whether arising under contract, tort ** +** (including negligence), or otherwise, even where the Distributor ** +** knew or should have known about the possibility of such damages. ** + +14. Severability + + The invalidity or unenforceability of any provision of this License + does not affect the validity or enforceability of the remainder of + this License. Such provision is to be reformed to the minimum extent + necessary to make it valid and enforceable. + +15. Definitions + + "Copyleft-Next Project" means the project that maintains the source + code repository at https://github.com/copyleft-next/copyleft-next.git/ + as of the release date of this License. + + "Corresponding Source" of a Covered Work in Object Code form means (i) + the Source Code form of the Covered Work; (ii) all scripts, + instructions and similar information that are reasonably necessary for + a skilled developer to generate such Object Code from the Source Code + provided under (i); and (iii) a list clearly identifying all Separate + Works (other than those provided in compliance with (ii)) that were + specifically used in building and (if applicable) installing the + Covered Work (for example, a specified proprietary compiler including + its version number). Corresponding Source must be machine-readable. + + "Covered Work" means My Work or a Derived Work. + + "Derived Work" means a work of authorship that copies from, modifies, + adapts, is based on, is a derivative work of, transforms, translates or + contains all or part of My Work, such that copyright permission is + required. The following are not Derived Works: (i) Mere Aggregation; + (ii) a mere reproduction of My Work; and (iii) if My Work fails to + explicitly state an expectation otherwise, a work that merely makes + reference to My Work. + + "Distribute" means to distribute, transfer or make a copy available to + someone else, such that copyright permission is required. + + "Distributor" means Me and anyone else who Distributes a Covered Work. + + "FSF-Free" means classified as 'free' by the Free Software Foundation. + + "GPL" means a version of the GNU General Public License or the GNU + Affero General Public License. + + "I"/"Me"/"My" refers to the individual or legal entity that places My + Work under this License. "You"/"Your" refers to the individual or legal + entity exercising rights in My Work under this License. A legal entity + includes each entity that controls, is controlled by, or is under + common control with such legal entity. "Control" means (a) the power to + direct the actions of such legal entity, whether by contract or + otherwise, or (b) ownership of more than fifty percent of the + outstanding shares or beneficial ownership of such legal entity. + + "Licensed Patents" means all patent claims licensable royalty-free by + Me, now or in the future, that are necessarily infringed by making, + using, or selling My Work, and excludes claims that would be infringed + only as a consequence of further modification of My Work. + + "Mere Aggregation" means an aggregation of a Covered Work with a + Separate Work. + + "My Work" means the particular work of authorship I license to You + under this License. + + "Object Code" means any form of a work that is not Source Code. + + "OSI-Approved" means approved as 'Open Source' by the Open Source + Initiative. + + "Separate Work" means a work that is separate from and independent of a + particular Covered Work and is not by its nature an extension or + enhancement of the Covered Work, and/or a runtime library, standard + library or similar component that is used to generate an Object Code + form of a Covered Work. + + "Source Code" means the preferred form of a work for making + modifications to it.
Two selftests drivers exist under the copyleft-next license. These drivers were added prior to SPDX practice taking full swing in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1 documented, embrace it and remove the boiler plate.
Cc: Goldwyn Rodrigues rgoldwyn@suse.com Cc: Kuno Woudt kuno@frob.nl Cc: Richard Fontana fontana@sharpeleven.org Cc: copyleft-next@lists.fedorahosted.org Cc: Ciaran Farrell Ciaran.Farrell@suse.com Cc: Christopher De Nicolo Christopher.DeNicolo@suse.com Cc: Christoph Hellwig hch@lst.de Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Jonathan Corbet corbet@lwn.net Cc: Thorsten Leemhuis linux@leemhuis.info Cc: Andrew Morton akpm@linux-foundation.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- lib/test_kmod.c | 12 +----------- lib/test_sysctl.c | 12 +----------- tools/testing/selftests/kmod/kmod.sh | 13 +------------ tools/testing/selftests/sysctl/sysctl.sh | 12 +----------- 4 files changed, 4 insertions(+), 45 deletions(-)
diff --git a/lib/test_kmod.c b/lib/test_kmod.c index ce1589391413..d62afd89dc63 100644 --- a/lib/test_kmod.c +++ b/lib/test_kmod.c @@ -1,18 +1,8 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 /* * kmod stress test driver * * Copyright (C) 2017 Luis R. Rodriguez mcgrof@kernel.org - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License as published by the Free - * Software Foundation; either version 2 of the License, or at your option any - * later version; or, when distributed separately from the Linux kernel or - * when incorporated into other software packages, subject to the following - * license: - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of copyleft-next (version 0.3.1 or later) as published - * at http://copyleft-next.org/. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c index 3750323973f4..9e5bd10a930a 100644 --- a/lib/test_sysctl.c +++ b/lib/test_sysctl.c @@ -1,18 +1,8 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 /* * proc sysctl test driver * * Copyright (C) 2017 Luis R. Rodriguez mcgrof@kernel.org - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License as published by the Free - * Software Foundation; either version 2 of the License, or at your option any - * later version; or, when distributed separately from the Linux kernel or - * when incorporated into other software packages, subject to the following - * license: - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of copyleft-next (version 0.3.1 or later) as published - * at http://copyleft-next.org/. */
/* diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh index afd42387e8b2..7189715d7960 100755 --- a/tools/testing/selftests/kmod/kmod.sh +++ b/tools/testing/selftests/kmod/kmod.sh @@ -1,18 +1,7 @@ #!/bin/bash -# +# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 # Copyright (C) 2017 Luis R. Rodriguez mcgrof@kernel.org # -# This program is free software; you can redistribute it and/or modify it -# under the terms of the GNU General Public License as published by the Free -# Software Foundation; either version 2 of the License, or at your option any -# later version; or, when distributed separately from the Linux kernel or -# when incorporated into other software packages, subject to the following -# license: -# -# This program is free software; you can redistribute it and/or modify it -# under the terms of copyleft-next (version 0.3.1 or later) as published -# at http://copyleft-next.org/. - # This is a stress test script for kmod, the kernel module loader. It uses # test_kmod which exposes a series of knobs for the API for us so we can # tweak each test in userspace rather than in kernelspace. diff --git a/tools/testing/selftests/sysctl/sysctl.sh b/tools/testing/selftests/sysctl/sysctl.sh index 19515dcb7d04..2046c603a4d4 100755 --- a/tools/testing/selftests/sysctl/sysctl.sh +++ b/tools/testing/selftests/sysctl/sysctl.sh @@ -1,16 +1,6 @@ #!/bin/bash +# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 # Copyright (C) 2017 Luis R. Rodriguez mcgrof@kernel.org -# -# This program is free software; you can redistribute it and/or modify it -# under the terms of the GNU General Public License as published by the Free -# Software Foundation; either version 2 of the License, or at your option any -# later version; or, when distributed separately from the Linux kernel or -# when incorporated into other software packages, subject to the following -# license: -# -# This program is free software; you can redistribute it and/or modify it -# under the terms of copyleft-next (version 0.3.1 or later) as published -# at http://copyleft-next.org/.
# This performs a series tests against the proc sysctl interface.
On Mon, Sep 27, 2021 at 09:37:55AM -0700, Luis Chamberlain wrote:
Two selftests drivers exist under the copyleft-next license. These drivers were added prior to SPDX practice taking full swing in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1 documented, embrace it and remove the boiler plate.
Cc: Goldwyn Rodrigues rgoldwyn@suse.com Cc: Kuno Woudt kuno@frob.nl Cc: Richard Fontana fontana@sharpeleven.org Cc: copyleft-next@lists.fedorahosted.org Cc: Ciaran Farrell Ciaran.Farrell@suse.com Cc: Christopher De Nicolo Christopher.DeNicolo@suse.com Cc: Christoph Hellwig hch@lst.de Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Jonathan Corbet corbet@lwn.net Cc: Thorsten Leemhuis linux@leemhuis.info Cc: Andrew Morton akpm@linux-foundation.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org
You're the primary author, and it cleans up boilerplate, so LGTM.
Reviewed-by: Kees Cook keescook@chromium.org
This adds a new selftest module which can be used to test sysfs, which would otherwise require using an existing driver. This lets us muck with a template driver to test breaking things without affecting system behaviour or requiring the dependencies of a real device driver.
A series of 28 tests are added. Support for using two device types are supported:
* misc * block
Contrary to sysctls, sysfs requires a full write to happen at once, and so we reduce the digit tests to single writes. Two main sysfs knobs are provided for testing reading/storing, one which doesn't inclur any delays and another which can incur programmed delays. What locks are held, if any, are configurable, at module load time, or through dynamic configuration at run time.
Since sysfs is a technically filesystem, but a pseudo one, which requires a kernel user, our test_sysfs module and respective test script embraces fstests format for tests in the kernel ring bufffer. Likewise, a scraper for kernel crashes is provided which matches what fstests does as well.
Two tests are kept disabled as they currently cause a deadlock, and so this provides a mechanism to easily show proof and demo how the deadlock can happen:
Demos the deadlock with a device specific lock ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
Demos the deadlock with rtnl_lock() ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
Two separate solutions to the deadlock issue have been proposed, and so now its a matter of either documenting this limitation or eventually adopting a generic fix.
This selftests will shortly be expanded upon with more tests which require further kernel changes in order to provide better test coverage.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- MAINTAINERS | 7 + lib/Kconfig.debug | 12 + lib/Makefile | 1 + lib/test_sysfs.c | 921 ++++++++++++++++++ tools/testing/selftests/sysfs/Makefile | 12 + tools/testing/selftests/sysfs/config | 2 + tools/testing/selftests/sysfs/sysfs.sh | 1208 ++++++++++++++++++++++++ 7 files changed, 2163 insertions(+) create mode 100644 lib/test_sysfs.c create mode 100644 tools/testing/selftests/sysfs/Makefile create mode 100644 tools/testing/selftests/sysfs/config create mode 100755 tools/testing/selftests/sysfs/sysfs.sh
diff --git a/MAINTAINERS b/MAINTAINERS index 0f28fb4b4e5c..1b4cefcb064c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18216,6 +18216,13 @@ L: linux-mmc@vger.kernel.org S: Maintained F: drivers/mmc/host/sdhci-pci-dwc-mshc.c
+SYSFS TEST DRIVER +M: Luis Chamberlain mcgrof@kernel.org +L: linux-kernel@vger.kernel.org +S: Maintained +F: lib/test_sysfs.c +F: tools/testing/selftests/sysfs/ + SYSTEM CONFIGURATION (SYSCON) M: Lee Jones lee.jones@linaro.org M: Arnd Bergmann arnd@arndb.de diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index dbff322207ad..ae19bf1a21b8 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2343,6 +2343,18 @@ config TEST_SYSCTL
If unsure, say N.
+config TEST_SYSFS + tristate "sysfs test driver" + depends on SYSFS + depends on NET + depends on BLOCK + help + This builds the "test_sysfs" module. This driver enables to test the + sysfs file system safely without affecting production knobs which + might alter system functionality. + + If unsure, say N. + config BITFIELD_KUNIT tristate "KUnit test bitfield functions at runtime" depends on KUNIT diff --git a/lib/Makefile b/lib/Makefile index 2cfd33917ad5..5143d65f90d6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -61,6 +61,7 @@ obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o obj-$(CONFIG_TEST_BITOPS) += test_bitops.o CFLAGS_test_bitops.o += -Werror obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o +obj-$(CONFIG_TEST_SYSFS) += test_sysfs.o obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o obj-$(CONFIG_TEST_IDA) += test_ida.o obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c new file mode 100644 index 000000000000..2043ca494af8 --- /dev/null +++ b/lib/test_sysfs.c @@ -0,0 +1,921 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +/* + * sysfs test driver + * + * Copyright (C) 2021 Luis Chamberlain mcgrof@kernel.org + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or at your option any + * later version; or, when distributed separately from the Linux kernel or + * when incorporated into other software packages, subject to the following + * license: + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of copyleft-next (version 0.3.1 or later) as published + * at http://copyleft-next.org/. + */ + +/* + * This module allows us to add race conditions which we can test for + * against the sysfs filesystem. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/init.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/printk.h> +#include <linux/fs.h> +#include <linux/miscdevice.h> +#include <linux/slab.h> +#include <linux/uaccess.h> +#include <linux/async.h> +#include <linux/delay.h> +#include <linux/vmalloc.h> +#include <linux/debugfs.h> +#include <linux/rtnetlink.h> +#include <linux/genhd.h> +#include <linux/blkdev.h> + +static bool enable_lock; +module_param(enable_lock, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_lock, + "enable locking on reads / stores from the start"); + +static bool enable_lock_on_rmmod; +module_param(enable_lock_on_rmmod, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_lock_on_rmmod, + "enable locking on rmmod"); + +static bool use_rtnl_lock; +module_param(use_rtnl_lock, bool_enable_only, 0644); +MODULE_PARM_DESC(use_rtnl_lock, + "use an rtnl_lock instead of the device mutex_lock"); + +static unsigned int write_delay_msec_y = 500; +module_param_named(write_delay_msec_y, write_delay_msec_y, uint, 0644); +MODULE_PARM_DESC(write_delay_msec_y, "msec write delay for writes to y"); + +static unsigned int test_devtype; +module_param_named(devtype, test_devtype, uint, 0644); +MODULE_PARM_DESC(devtype, "device type to register"); + +static bool enable_busy_alloc; +module_param(enable_busy_alloc, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_busy_alloc, "do a fake allocation during writes"); + +static bool enable_debugfs; +module_param(enable_debugfs, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_debugfs, "enable a few debugfs files"); + +static bool enable_verbose_writes; +module_param(enable_verbose_writes, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_debugfs, "enable stores to print verbose information"); + +static unsigned int delay_rmmod_ms; +module_param_named(delay_rmmod_ms, delay_rmmod_ms, uint, 0644); +MODULE_PARM_DESC(delay_rmmod_ms, "if set how many ms to delay rmmod before device deletion"); + +static bool enable_verbose_rmmod; +module_param(enable_verbose_rmmod, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod"); + +static int sysfs_test_major; + +/** + * test_config - used for configuring how the sysfs test device will behave + * + * @enable_lock: if enabled a lock will be used when reading/storing variables + * @enable_lock_on_rmmod: if enabled a lock will be used when reading/storing + * sysfs attributes, but it will also be used to lock on rmmod. This is + * useful to test for a deadlock. + * @use_rtnl_lock: if enabled instead of configuration specific mutex, we'll + * use the rtnl_lock. If your test case is modifying this on the fly + * while doing other stores / reads, things will break as a lock can be + * left contending. Best is that tests use this knob serially, without + * allowing userspace to modify other knobs while this one changes. + * @write_delay_msec_y: the amount of delay to use when writing to y + * @enable_busy_alloc: if enabled we'll do a large allocation between + * writes. We immediately free right away. We also schedule to give the + * kernel some time to re-use any memory we don't need. This is intened + * to mimic typical driver behaviour. + */ +struct test_config { + bool enable_lock; + bool enable_lock_on_rmmod; + bool use_rtnl_lock; + unsigned int write_delay_msec_y; + bool enable_busy_alloc; +}; + +/** + * enum sysfs_test_devtype - sysfs device type + * @TESTDEV_TYPE_MISC: misc device type + * @TESTDEV_TYPE_BLOCK: use a block device for the sysfs test device. + */ +enum sysfs_test_devtype { + TESTDEV_TYPE_MISC = 0, + TESTDEV_TYPE_BLOCK, +}; + +/** + * sysfs_test_device - test device to help test sysfs + * + * @devtype: the type of device to use + * @config: configuration for the test + * @config_mutex: protects configuration of test + * @misc_dev: we use a misc device under the hood + * @disk: represents a disk when used as a block device + * @dev: pointer to misc_dev's own struct device + * @dev_idx: unique ID for test device + * @x: variable we can use to test read / store + * @y: slow variable we can use to test read / store + */ +struct sysfs_test_device { + enum sysfs_test_devtype devtype; + struct test_config config; + struct mutex config_mutex; + struct miscdevice misc_dev; + struct gendisk *disk; + struct device *dev; + int dev_idx; + int x; + int y; +}; + +static struct sysfs_test_device *first_test_dev; + +static struct miscdevice *dev_to_misc_dev(struct device *dev) +{ + return dev_get_drvdata(dev); +} + +static struct sysfs_test_device *misc_dev_to_test_dev(struct miscdevice *misc_dev) +{ + return container_of(misc_dev, struct sysfs_test_device, misc_dev); +} + +static struct sysfs_test_device *devblock_to_test_dev(struct device *dev) +{ + return (struct sysfs_test_device *)dev_to_disk(dev)->private_data; +} + +static struct sysfs_test_device *devmisc_to_testdev(struct device *dev) +{ + struct miscdevice *misc_dev; + + misc_dev = dev_to_misc_dev(dev); + return misc_dev_to_test_dev(misc_dev); +} + +static struct sysfs_test_device *dev_to_test_dev(struct device *dev) +{ + if (test_devtype == TESTDEV_TYPE_MISC) + return devmisc_to_testdev(dev); + else if (test_devtype == TESTDEV_TYPE_BLOCK) + return devblock_to_test_dev(dev); + return NULL; +} + +static void test_dev_config_lock(struct sysfs_test_device *test_dev) +{ + struct test_config *config = &test_dev->config; + + if (config->enable_lock) { + if (config->use_rtnl_lock) + rtnl_lock(); + else + mutex_lock(&test_dev->config_mutex); + } +} + +static void test_dev_config_unlock(struct sysfs_test_device *test_dev) +{ + struct test_config *config = &test_dev->config; + + if (config->enable_lock) { + if (config->use_rtnl_lock) + rtnl_unlock(); + else + mutex_unlock(&test_dev->config_mutex); + } +} + +static void test_dev_config_lock_rmmod(struct sysfs_test_device *test_dev) +{ + struct test_config *config = &test_dev->config; + + if (config->enable_lock_on_rmmod) + test_dev_config_lock(test_dev); +} + +static void test_dev_config_unlock_rmmod(struct sysfs_test_device *test_dev) +{ + struct test_config *config = &test_dev->config; + + if (config->enable_lock_on_rmmod) + test_dev_config_unlock(test_dev); +} + +static void free_test_dev_sysfs(struct sysfs_test_device *test_dev) +{ + if (test_dev) { + kfree_const(test_dev->misc_dev.name); + test_dev->misc_dev.name = NULL; + kfree(test_dev); + test_dev = NULL; + } +} + +static void test_sysfs_reset_vals(struct sysfs_test_device *test_dev) +{ + test_dev->x = 3; + test_dev->y = 4; +} + +static ssize_t config_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int len = 0; + + test_dev_config_lock(test_dev); + + len += snprintf(buf, PAGE_SIZE, + "Configuration for: %s\n", + dev_name(dev)); + + len += snprintf(buf+len, PAGE_SIZE - len, + "x:\t%d\n", + test_dev->x); + + len += snprintf(buf+len, PAGE_SIZE - len, + "y:\t%d\n", + test_dev->y); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_lock:\t%s\n", + config->enable_lock ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_lock_on_rmmmod:\t%s\n", + config->enable_lock_on_rmmod ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "use_rtnl_lock:\t%s\n", + config->use_rtnl_lock ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "write_delay_msec_y:\t%d\n", + config->write_delay_msec_y); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_busy_alloc:\t%s\n", + config->enable_busy_alloc ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_debugfs:\t%s\n", + enable_debugfs ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_verbose_writes:\t%s\n", + enable_verbose_writes ? "true" : "false"); + + test_dev_config_unlock(test_dev); + + return len; +} +static DEVICE_ATTR_RO(config); + +static ssize_t reset_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + + /* + * We compromise and simplify this condition and do not use a lock + * here as the lock type can change. + */ + config->enable_lock = false; + config->enable_lock_on_rmmod = false; + config->use_rtnl_lock = false; + config->enable_busy_alloc = false; + test_sysfs_reset_vals(test_dev); + + dev_info(dev, "reset\n"); + + return count; +} +static DEVICE_ATTR_WO(reset); + +static void test_dev_busy_alloc(struct sysfs_test_device *test_dev) +{ + struct test_config *config = &test_dev->config; + char *ignore; + + if (!config->enable_busy_alloc) + return; + + ignore = kzalloc(sizeof(struct sysfs_test_device) * 10, GFP_KERNEL); + kfree(ignore); + + schedule(); +} + +static ssize_t test_dev_x_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + int ret; + + test_dev_busy_alloc(test_dev); + test_dev_config_lock(test_dev); + + ret = kstrtoint(buf, 10, &test_dev->x); + if (ret) + count = ret; + + if (enable_verbose_writes) + dev_info(test_dev->dev, "wrote x = %d\n", test_dev->x); + + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t test_dev_x_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + int ret; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->x); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(test_dev_x); + +static ssize_t test_dev_y_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int y; + int ret; + + test_dev_busy_alloc(test_dev); + test_dev_config_lock(test_dev); + + config = &test_dev->config; + + ret = kstrtoint(buf, 10, &y); + if (ret) + count = ret; + + msleep(config->write_delay_msec_y); + test_dev->y = test_dev->x + y + 7; + + if (enable_verbose_writes) + dev_info(test_dev->dev, "wrote y = %d\n", test_dev->y); + + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t test_dev_y_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + int ret; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->y); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(test_dev_y); + +static ssize_t config_enable_lock_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int ret; + int val; + + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + /* + * We compromise for simplicty and do not lock when changing + * locking configuration, with the assumption userspace tests + * will know this. + */ + if (val) + config->enable_lock = true; + else + config->enable_lock = false; + + return count; +} + +static ssize_t config_enable_lock_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + ssize_t ret; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(config_enable_lock); + +static ssize_t config_enable_lock_on_rmmod_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int ret; + int val; + + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + test_dev_config_lock(test_dev); + if (val) + config->enable_lock_on_rmmod = true; + else + config->enable_lock_on_rmmod = false; + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t config_enable_lock_on_rmmod_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + ssize_t ret; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock_on_rmmod); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(config_enable_lock_on_rmmod); + +static ssize_t config_use_rtnl_lock_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int ret; + int val; + + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + /* + * We compromise and simplify this condition and do not use a lock + * here as the lock type can change. + */ + if (val) + config->use_rtnl_lock = true; + else + config->use_rtnl_lock = false; + + return count; +} + +static ssize_t config_use_rtnl_lock_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + + return snprintf(buf, PAGE_SIZE, "%d\n", config->use_rtnl_lock); +} +static DEVICE_ATTR_RW(config_use_rtnl_lock); + +static ssize_t config_write_delay_msec_y_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int ret; + int val; + + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + test_dev_config_lock(test_dev); + config->write_delay_msec_y = val; + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t config_write_delay_msec_y_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + + return snprintf(buf, PAGE_SIZE, "%d\n", config->write_delay_msec_y); +} +static DEVICE_ATTR_RW(config_write_delay_msec_y); + +static ssize_t config_enable_busy_alloc_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int ret; + int val; + + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + test_dev_config_lock(test_dev); + config->enable_busy_alloc = val; + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t config_enable_busy_alloc_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + + return snprintf(buf, PAGE_SIZE, "%d\n", config->enable_busy_alloc); +} +static DEVICE_ATTR_RW(config_enable_busy_alloc); + +#define TEST_SYSFS_DEV_ATTR(name) (&dev_attr_##name.attr) + +static struct attribute *test_dev_attrs[] = { + /* Generic driver knobs go here */ + TEST_SYSFS_DEV_ATTR(config), + TEST_SYSFS_DEV_ATTR(reset), + + /* These are used to test sysfs */ + TEST_SYSFS_DEV_ATTR(test_dev_x), + TEST_SYSFS_DEV_ATTR(test_dev_y), + + /* + * These are configuration knobs to modify how we test sysfs when + * doing reads / stores. + */ + TEST_SYSFS_DEV_ATTR(config_enable_lock), + TEST_SYSFS_DEV_ATTR(config_enable_lock_on_rmmod), + TEST_SYSFS_DEV_ATTR(config_use_rtnl_lock), + TEST_SYSFS_DEV_ATTR(config_write_delay_msec_y), + TEST_SYSFS_DEV_ATTR(config_enable_busy_alloc), + + NULL, +}; + +ATTRIBUTE_GROUPS(test_dev); + +static int sysfs_test_dev_alloc_miscdev(struct sysfs_test_device *test_dev) +{ + struct miscdevice *misc_dev; + + misc_dev = &test_dev->misc_dev; + misc_dev->minor = MISC_DYNAMIC_MINOR; + misc_dev->name = kasprintf(GFP_KERNEL, "test_sysfs%d", test_dev->dev_idx); + if (!misc_dev->name) { + pr_err("Cannot alloc misc_dev->name\n"); + return -ENOMEM; + } + misc_dev->groups = test_dev_groups; + + return 0; +} + +static int testdev_open(struct block_device *bdev, fmode_t mode) +{ + return -EINVAL; +} + +static blk_qc_t testdev_submit_bio(struct bio *bio) +{ + return BLK_QC_T_NONE; +} + +static void testdev_slot_free_notify(struct block_device *bdev, + unsigned long index) +{ +} + +static int testdev_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, unsigned int op) +{ + return -EOPNOTSUPP; +} + +static const struct block_device_operations sysfs_testdev_ops = { + .open = testdev_open, + .submit_bio = testdev_submit_bio, + .swap_slot_free_notify = testdev_slot_free_notify, + .rw_page = testdev_rw_page, + .owner = THIS_MODULE +}; + +static int sysfs_test_dev_alloc_blockdev(struct sysfs_test_device *test_dev) +{ + int ret = -ENOMEM; + + test_dev->disk = blk_alloc_disk(NUMA_NO_NODE); + if (!test_dev->disk) { + pr_err("Error allocating disk structure for device %d\n", + test_dev->dev_idx); + goto out; + } + + test_dev->disk->major = sysfs_test_major; + test_dev->disk->first_minor = test_dev->dev_idx + 1; + test_dev->disk->fops = &sysfs_testdev_ops; + test_dev->disk->private_data = test_dev; + snprintf(test_dev->disk->disk_name, 16, "test_sysfs%d", + test_dev->dev_idx); + set_capacity(test_dev->disk, 0); + blk_queue_flag_set(QUEUE_FLAG_NONROT, test_dev->disk->queue); + blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, test_dev->disk->queue); + blk_queue_physical_block_size(test_dev->disk->queue, PAGE_SIZE); + blk_queue_max_discard_sectors(test_dev->disk->queue, UINT_MAX); + blk_queue_flag_set(QUEUE_FLAG_DISCARD, test_dev->disk->queue); + + return 0; +out: + return ret; +} + +static struct sysfs_test_device *alloc_test_dev_sysfs(int idx) +{ + struct sysfs_test_device *test_dev; + int ret; + + switch (test_devtype) { + case TESTDEV_TYPE_MISC: + fallthrough; + case TESTDEV_TYPE_BLOCK: + break; + default: + return NULL; + } + + test_dev = kzalloc(sizeof(struct sysfs_test_device), GFP_KERNEL); + if (!test_dev) + goto err_out; + + mutex_init(&test_dev->config_mutex); + test_dev->dev_idx = idx; + test_dev->devtype = test_devtype; + + if (test_dev->devtype == TESTDEV_TYPE_MISC) { + ret = sysfs_test_dev_alloc_miscdev(test_dev); + if (ret) + goto err_out_free; + } else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) { + ret = sysfs_test_dev_alloc_blockdev(test_dev); + if (ret) + goto err_out_free; + } + return test_dev; + +err_out_free: + kfree(test_dev); + test_dev = NULL; +err_out: + return NULL; +} + +static int register_test_dev_sysfs_misc(struct sysfs_test_device *test_dev) +{ + int ret; + + ret = misc_register(&test_dev->misc_dev); + if (ret) + return ret; + + test_dev->dev = test_dev->misc_dev.this_device; + + return 0; +} + +static int register_test_dev_sysfs_block(struct sysfs_test_device *test_dev) +{ + device_add_disk(NULL, test_dev->disk, test_dev_groups); + test_dev->dev = disk_to_dev(test_dev->disk); + + return 0; +} + +static struct sysfs_test_device *register_test_dev_sysfs(void) +{ + struct sysfs_test_device *test_dev = NULL; + int ret; + + test_dev = alloc_test_dev_sysfs(0); + if (!test_dev) + goto out; + + if (test_dev->devtype == TESTDEV_TYPE_MISC) { + ret = register_test_dev_sysfs_misc(test_dev); + if (ret) { + pr_err("could not register misc device: %d\n", ret); + goto out_free_dev; + } + } else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) { + ret = register_test_dev_sysfs_block(test_dev); + if (ret) { + pr_err("could not register block device: %d\n", ret); + goto out_free_dev; + } + } + + dev_info(test_dev->dev, "interface ready\n"); + +out: + return test_dev; +out_free_dev: + free_test_dev_sysfs(test_dev); + return NULL; +} + +static struct sysfs_test_device *register_test_dev_set_config(void) +{ + struct sysfs_test_device *test_dev; + struct test_config *config; + + test_dev = register_test_dev_sysfs(); + if (!test_dev) + return NULL; + + config = &test_dev->config; + + if (enable_lock) + config->enable_lock = true; + if (enable_lock_on_rmmod) + config->enable_lock_on_rmmod = true; + if (use_rtnl_lock) + config->use_rtnl_lock = true; + if (enable_busy_alloc) + config->enable_busy_alloc = true; + + config->write_delay_msec_y = write_delay_msec_y; + test_sysfs_reset_vals(test_dev); + + return test_dev; +} + +static void unregister_test_dev_sysfs_misc(struct sysfs_test_device *test_dev) +{ + misc_deregister(&test_dev->misc_dev); +} + +static void unregister_test_dev_sysfs_block(struct sysfs_test_device *test_dev) +{ + del_gendisk(test_dev->disk); + blk_cleanup_disk(test_dev->disk); +} + +static void unregister_test_dev_sysfs(struct sysfs_test_device *test_dev) +{ + test_dev_config_lock_rmmod(test_dev); + + dev_info(test_dev->dev, "removing interface\n"); + + if (test_dev->devtype == TESTDEV_TYPE_MISC) + unregister_test_dev_sysfs_misc(test_dev); + else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) + unregister_test_dev_sysfs_block(test_dev); + + test_dev_config_unlock_rmmod(test_dev); + + free_test_dev_sysfs(test_dev); +} + +static struct dentry *debugfs_dir; + +/* When read represents how many times we have reset the first_test_dev */ +static u8 reset_first_test_dev; + +static ssize_t read_reset_first_test_dev(struct file *file, + char __user *user_buf, + size_t count, loff_t *ppos) +{ + ssize_t len; + char buf[32]; + + reset_first_test_dev++; + len = sprintf(buf, "%d\n", reset_first_test_dev); + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static ssize_t write_reset_first_test_dev(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + + if (!first_test_dev) { + module_put(THIS_MODULE); + return -ENODEV; + } + + dev_info(first_test_dev->dev, "going to reset first interface ...\n"); + + unregister_test_dev_sysfs(first_test_dev); + first_test_dev = register_test_dev_set_config(); + + dev_info(first_test_dev->dev, "first interface reset complete\n"); + + module_put(THIS_MODULE); + + return count; +} + +static const struct file_operations fops_reset_first_test_dev = { + .read = read_reset_first_test_dev, + .write = write_reset_first_test_dev, + .open = simple_open, + .owner = THIS_MODULE, + .llseek = default_llseek, +}; + +static int __init test_sysfs_init(void) +{ + first_test_dev = register_test_dev_set_config(); + if (!first_test_dev) + return -ENOMEM; + + if (!enable_debugfs) + return 0; + + debugfs_dir = debugfs_create_dir("test_sysfs", NULL); + if (!debugfs_dir) { + unregister_test_dev_sysfs(first_test_dev); + return -ENOMEM; + } + + debugfs_create_file("reset_first_test_dev", 0600, debugfs_dir, + NULL, &fops_reset_first_test_dev); + return 0; +} +module_init(test_sysfs_init); + +static void __exit test_sysfs_exit(void) +{ + if (enable_debugfs) + debugfs_remove(debugfs_dir); + if (delay_rmmod_ms) + msleep(delay_rmmod_ms); + unregister_test_dev_sysfs(first_test_dev); + if (enable_verbose_rmmod) + pr_info("unregister_test_dev_sysfs() completed\n"); + first_test_dev = NULL; +} +module_exit(test_sysfs_exit); + +MODULE_AUTHOR("Luis Chamberlain mcgrof@kernel.org"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/sysfs/Makefile b/tools/testing/selftests/sysfs/Makefile new file mode 100644 index 000000000000..fde99caa2338 --- /dev/null +++ b/tools/testing/selftests/sysfs/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +# Makefile for sysfs selftests. + +# No binaries, but make sure arg-less "make" doesn't trigger "run_tests". +all: + +TEST_PROGS := sysfs.sh + +include ../lib.mk + +# Nothing to clean up. +clean: diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config new file mode 100644 index 000000000000..9196f452ecd5 --- /dev/null +++ b/tools/testing/selftests/sysfs/config @@ -0,0 +1,2 @@ +CONFIG_SYSFS=m +CONFIG_TEST_SYSFS=m diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh new file mode 100755 index 000000000000..b3f4c2236c7f --- /dev/null +++ b/tools/testing/selftests/sysfs/sysfs.sh @@ -0,0 +1,1208 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0-or-later +# Copyright (C) 2021 Luis Chamberlain mcgrof@kernel.org +# +# This program is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation; either version 2 of the License, or at your option any +# later version; or, when distributed separately from the Linux kernel or +# when incorporated into other software packages, subject to the following +# license: +# +# This program is free software; you can redistribute it and/or modify it +# under the terms of copyleft-next (version 0.3.1 or later) as published +# at http://copyleft-next.org/. + +# This performs a series tests against the sysfs filesystem. + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +TEST_NAME="sysfs" +TEST_DRIVER="test_${TEST_NAME}" +TEST_DIR=$(dirname $0) +TEST_FILE=$(mktemp) + +# This represents +# +# TEST_ID:TEST_COUNT:ENABLED:TARGET +# +# TEST_ID: is the test id number +# TEST_COUNT: number of times we should run the test +# ENABLED: 1 if enabled, 0 otherwise +# TARGET: test target file required on the test_sysfs module +# +# Once these are enabled please leave them as-is. Write your own test, +# we have tons of space. +ALL_TESTS="0001:3:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0002:3:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0003:3:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0004:3:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0005:1:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0006:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0007:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0008:1:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0009:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0010:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0011:1:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0012:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0013:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0014:3:1:test_dev_x:block" # block equivalent set +ALL_TESTS="$ALL_TESTS 0015:3:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0016:3:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0017:3:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0018:1:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0019:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0020:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0021:1:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0022:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test +ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock + +allow_user_defaults() +{ + if [ -z $DIR ]; then + case $TEST_DEV_TYPE in + misc) + DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0" + ;; + block) + DIR="/sys/devices/virtual/block/${TEST_DRIVER}0" + ;; + *) + DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0" + ;; + esac + fi + case $TEST_DEV_TYPE in + misc) + MODPROBE_TESTDEV_TYPE="" + ;; + block) + MODPROBE_TESTDEV_TYPE="devtype=1" + ;; + *) + MODPROBE_TESTDEV_TYPE="" + ;; + esac + if [ -z $SYSFS_DEBUGFS_DIR ]; then + SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs" + fi + if [ -z $PAGE_SIZE ]; then + PAGE_SIZE=$(getconf PAGESIZE) + fi + if [ -z $MAX_DIGITS ]; then + MAX_DIGITS=$(($PAGE_SIZE/8)) + fi + if [ -z $INT_MAX ]; then + INT_MAX=$(getconf INT_MAX) + fi + if [ -z $UINT_MAX ]; then + UINT_MAX=$(getconf UINT_MAX) + fi +} + +test_reqs() +{ + uid=$(id -u) + if [ $uid -ne 0 ]; then + echo $msg must be run as root >&2 + exit $ksft_skip + fi + + if ! which modprobe 2> /dev/null > /dev/null; then + echo "$0: You need modprobe installed" >&2 + exit $ksft_skip + fi + if ! which getconf 2> /dev/null > /dev/null; then + echo "$0: You need getconf installed" + exit $ksft_skip + fi + if ! which diff 2> /dev/null > /dev/null; then + echo "$0: You need diff installed" + exit $ksft_skip + fi + if ! which perl 2> /dev/null > /dev/null; then + echo "$0: You need perl installed" + exit $ksft_skip + fi +} + +call_modprobe() +{ + modprobe $TEST_DRIVER $MODPROBE_TESTDEV_TYPE $FIRST_MODPROBE_ARGS $MODPROBE_ARGS + return $? +} + +modprobe_reset() +{ + modprobe -q -r $TEST_DRIVER + call_modprobe + return $? +} + +modprobe_reset_enable_debugfs() +{ + FIRST_MODPROBE_ARGS="enable_debugfs=1" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + +modprobe_reset_enable_lock_on_rmmod() +{ + FIRST_MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + +modprobe_reset_enable_rtnl_lock_on_rmmod() +{ + FIRST_MODPROBE_ARGS="enable_lock=1 use_rtnl_lock=1 enable_lock_on_rmmod=1" + FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_writes=1" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + +load_req_mod() +{ + modprobe_reset + if [ ! -d $DIR ]; then + if ! modprobe -q -n $TEST_DRIVER; then + echo "$0: module $TEST_DRIVER not found [SKIP]" + echo "You must set CONFIG_TEST_SYSFS=m in your kernel" >&2 + exit $ksft_skip + fi + call_modprobe + if [ $? -ne 0 ]; then + echo "$0: modprobe $TEST_DRIVER failed." + exit + fi + fi +} + +config_reset() +{ + if ! echo -n "1" >"$DIR"/reset; then + echo "$0: reset should have worked" >&2 + exit 1 + fi +} + +debugfs_reset_first_test_dev_ignore_errors() +{ + echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev +} + +set_orig() +{ + if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then + if [ -f ${TARGET} ]; then + echo "${ORIG}" > "${TARGET}" + fi + fi +} + +set_test() +{ + echo "${TEST_STR}" > "${TARGET}" +} + +set_test_ignore_errors() +{ + echo "${TEST_STR}" > "${TARGET}" 2> /dev/null +} + +verify() +{ + local seen + seen=$(cat "$1") + target_short=$(basename $TARGET) + case $target_short in + test_dev_x) + if [ "${seen}" != "${TEST_STR}" ]; then + return 1 + fi + ;; + test_dev_y) + DIRNAME=$(dirname $1) + EXPECTED_RESULT="" + # If our target was the test file then what we write to it + # is the same as what that we expect when we read from it. + # When we write to test_dev_y directly though we expect + # a computed value which is driver specific. + if [[ "$DIRNAME" == "/tmp" ]]; then + let EXPECTED_RESULT="${TEST_STR}" + else + x=$(cat ${DIR}/test_dev_x) + let EXPECTED_RESULT="$x+${TEST_STR}+7" + fi + + if [[ "${seen}" != "${EXPECTED_RESULT}" ]]; then + return 1 + fi + ;; + *) + echo "Unsupported target type update test script: $target_short" + exit 1 + esac + return 0 +} + +verify_diff_w() +{ + echo "$TEST_STR" | diff -q -w -u - $1 > /dev/null + return $? +} + +test_rc() +{ + if [[ $rc != 0 ]]; then + echo "Failed test, return value: $rc" >&2 + exit $rc + fi +} + +test_finish() +{ + set_orig + rm -f "${TEST_FILE}" + + if [ ! -z ${old_strict} ]; then + echo ${old_strict} > ${WRITES_STRICT} + fi + exit $rc +} + +# kernfs requires us to write everything we want in one shot because +# There is no easy way for us to know if userspace is only doing a partial +# write, so we don't support them. We expect the entire buffer to come on +# the first write. If you're writing a value, first read the file, +# modify only the value you're changing, then write entire buffer back. +# Since we are only testing digits we just full single writes and old stuff. +# For more details, refer to kernfs_fop_write_iter(). +run_numerictests_single_write() +{ + echo "== Testing sysfs behavior against ${TARGET} ==" + + rc=0 + + echo -n "Writing test file ... " + echo "${TEST_STR}" > "${TEST_FILE}" + if ! verify "${TEST_FILE}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + echo -n "Checking the sysfs file is not set to test value ... " + if verify "${TARGET}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + echo -n "Writing to sysfs file from shell ... " + set_test + if ! verify "${TARGET}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + echo -n "Resetting sysfs file to original value ... " + set_orig + if verify "${TARGET}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + # Now that we've validated the sanity of "set_test" and "set_orig", + # we can use those functions to set starting states before running + # specific behavioral tests. + + echo -n "Writing to the entire sysfs file in a single write ... " + set_orig + dd if="${TEST_FILE}" of="${TARGET}" bs=4096 2>/dev/null + if ! verify "${TARGET}"; then + echo "FAIL" >&2 + rc=1 + else + echo "ok" + fi + + echo -n "Writing to the sysfs file with multiple long writes ... " + set_orig + (perl -e 'print "A" x 50;'; echo "${TEST_STR}") | \ + dd of="${TARGET}" bs=50 2>/dev/null + if verify "${TARGET}"; then + echo "FAIL" >&2 + rc=1 + else + echo "ok" + fi + test_rc +} + +reset_vals() +{ + echo -n 3 > $DIR/test_dev_x + echo -n 4 > $DIR/test_dev_x +} + +check_failure() +{ + echo -n "Testing that $1 fails as expected..." + reset_vals + TEST_STR="$1" + orig="$(cat $TARGET)" + echo -n "$TEST_STR" > $TARGET 2> /dev/null + + # write should fail and $TARGET should retain its original value + if [ $? = 0 ] || [ "$(cat $TARGET)" != "$orig" ]; then + echo "FAIL" >&2 + rc=1 + else + echo "ok" + fi + test_rc +} + +load_modreqs() +{ + export TEST_DEV_TYPE=$(get_test_type $1) + unset DIR + allow_user_defaults + load_req_mod +} + +target_exists() +{ + TARGET="${DIR}/$1" + TEST_ID="$2" + + if [ ! -f ${TARGET} ] ; then + echo "Target for test $TEST_ID: $TARGET does not exist, skipping test ..." + return 0 + fi + return 1 +} + +config_enable_lock() +{ + if ! echo -n 1 > $DIR/config_enable_lock; then + echo "$0: Unable to enable locks" >&2 + exit 1 + fi +} + +config_write_delay_msec_y() +{ + if ! echo -n $1 > $DIR/config_write_delay_msec_y ; then + echo "$0: Unable to set write_delay_msec_y to $1" >&2 + exit 1 + fi +} + +# Default filter for dmesg scanning. +# Ignore lockdep complaining about its own bugginess when scanning dmesg +# output, because we shouldn't be failing filesystem tests on account of +# lockdep. +_check_dmesg_filter() +{ + egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \ + -e "BUG: MAX_STACK_TRACE_ENTRIES too low" +} + +check_dmesg() +{ + # filter out intentional WARNINGs or Oopses + local filter=${1:-_check_dmesg_filter} + + _dmesg_since_test_start | $filter >$seqres.dmesg + egrep -q -e "kernel BUG at" \ + -e "WARNING:" \ + -e "\bBUG:" \ + -e "Oops:" \ + -e "possible recursive locking detected" \ + -e "Internal error" \ + -e "(INFO|ERR): suspicious RCU usage" \ + -e "INFO: possible circular locking dependency detected" \ + -e "general protection fault:" \ + -e "BUG .* remaining" \ + -e "UBSAN:" \ + $seqres.dmesg + if [ $? -eq 0 ]; then + echo "something found in dmesg (see $seqres.dmesg)" + return 1 + else + if [ "$KEEP_DMESG" != "yes" ]; then + rm -f $seqres.dmesg + fi + return 0 + fi +} + +log_kernel_fstest_dmesg() +{ + export FSTYP="$1" + export seqnum="$FSTYP/$2" + export date_time=$(date +"%F %T") + echo "run fstests $seqnum at $date_time" > /dev/kmsg +} + +modprobe_loop() +{ + while true; do + call_modprobe > /dev/null 2>&1 + modprobe -r $TEST_DRIVER > /dev/null 2>&1 + done > /dev/null 2>&1 +} + +write_loop() +{ + while true; do + set_test_ignore_errors > /dev/null 2>&1 + TEST_STR=$(( $TEST_STR + 1 )) + done > /dev/null 2>&1 +} + +write_loop_reset() +{ + while true; do + set_test_ignore_errors > /dev/null 2>&1 + debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1 + done > /dev/null 2>&1 +} + +write_loop_bg() +{ + BG_WRITES=1000 > /dev/null 2>&1 + while true; do + for i in $(seq 1 $BG_WRITES); do + set_test_ignore_errors > /dev/null 2>&1 & + TEST_STR=$(( $TEST_STR + 1 )) + done > /dev/null 2>&1 + wait + done > /dev/null 2>&1 + wait +} + +reset_loop() +{ + while true; do + debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1 + done > /dev/null 2>&1 +} + +kill_trigger_loop() +{ + + local my_first_loop_pid=$1 + local my_second_loop_pid=$2 + local my_sleep_max=$3 + local my_loop=0 + + while true; do + sleep 1 + if [[ $my_loop -ge $my_sleep_max ]]; then + break + fi + let my_loop=$my_loop+1 + done + + kill -s TERM $my_first_loop_pid 2>&1 > /dev/null + kill -s TERM $my_second_loop_pid 2>&1 > /dev/null +} + +_dmesg_since_test_start() +{ + # search the dmesg log of last run of $seqnum for possible failures + # use sed \cregexpc address type, since $seqnum contains "/" + dmesg | tac | sed -ne "0,#run fstests $seqnum at $date_time#p" | tac +} + +sysfs_test_0001() +{ + TARGET="${DIR}/$(get_test_target 0001)" + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + run_numerictests_single_write +} + +sysfs_test_0002() +{ + TARGET="${DIR}/$(get_test_target 0002)" + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + run_numerictests_single_write +} + +sysfs_test_0003() +{ + TARGET="${DIR}/$(get_test_target 0003)" + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + config_enable_lock + + run_numerictests_single_write +} + +sysfs_test_0004() +{ + TARGET="${DIR}/$(get_test_target 0004)" + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + config_enable_lock + + run_numerictests_single_write +} + +sysfs_test_0005() +{ + TARGET="${DIR}/$(get_test_target 0005)" + modprobe_reset + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing x while loading/unloading the module... " + + modprobe_loop & + modprobe_pid=$! + + write_loop & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0006() +{ + TARGET="${DIR}/$(get_test_target 0006)" + modprobe_reset + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y while loading/unloading the module... " + modprobe_loop & + modprobe_pid=$! + + write_loop & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0007() +{ + TARGET="${DIR}/$(get_test_target 0007)" + modprobe_reset + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y with a larger delay while loading/unloading the module... " + + MODPROBE_ARGS="write_delay_msec_y=1500" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + write_loop & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0008() +{ + TARGET="${DIR}/$(get_test_target 0008)" + modprobe_reset + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop busy writing x while loading/unloading the module... " + + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + + write_loop_bg > /dev/null 2>&1 & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0009() +{ + TARGET="${DIR}/$(get_test_target 0009)" + modprobe_reset + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop busy writing y while loading/unloading the module... " + + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + + write_loop_bg > /dev/null 2>&1 & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0010() +{ + TARGET="${DIR}/$(get_test_target 0010)" + modprobe_reset + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop busy writing y with a larger delay while loading/unloading the module... " + modprobe -q -r $TEST_DRIVER > /dev/null 2>&1 + + MODPROBE_ARGS="write_delay_msec_y=1500" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + write_loop_bg > /dev/null 2>&1 & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0011() +{ + TARGET="${DIR}/$(get_test_target 0011)" + modprobe_reset_enable_debugfs + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing x and resetting ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + reset_loop > /dev/null 2>&1 & + reset_pid=$! + + kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0012() +{ + TARGET="${DIR}/$(get_test_target 0012)" + modprobe_reset_enable_debugfs + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y and resetting ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + reset_loop > /dev/null 2>&1 & + reset_pid=$! + + kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0013() +{ + TARGET="${DIR}/$(get_test_target 0013)" + modprobe_reset_enable_debugfs + config_reset + reset_vals + config_write_delay_msec_y 1500 + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y with a larger delay and resetting ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + reset_loop > /dev/null 2>&1 & + reset_pid=$! + + kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0014() +{ + sysfs_test_0001 +} + +sysfs_test_0015() +{ + sysfs_test_0002 +} + +sysfs_test_0016() +{ + sysfs_test_0003 +} + +sysfs_test_0017() +{ + sysfs_test_0004 +} + +sysfs_test_0018() +{ + sysfs_test_0005 +} + +sysfs_test_0019() +{ + sysfs_test_0006 +} + +sysfs_test_0020() +{ + sysfs_test_0007 +} + +sysfs_test_0021() +{ + sysfs_test_0008 +} + +sysfs_test_0022() +{ + sysfs_test_0009 +} + +sysfs_test_0023() +{ + sysfs_test_0010 +} + +sysfs_test_0024() +{ + sysfs_test_0011 +} + +sysfs_test_0025() +{ + sysfs_test_0012 +} + +sysfs_test_0026() +{ + sysfs_test_0013 +} + +sysfs_test_0027() +{ + TARGET="${DIR}/$(get_test_target 0027)" + modprobe_reset_enable_lock_on_rmmod + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Test for possible rmmod deadlock while writing x ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0028() +{ + TARGET="${DIR}/$(get_test_target 0028)" + modprobe_reset_enable_lock_on_rmmod + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Test for possible rmmod deadlock using rtnl_lock while writing x ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 use_rtnl_lock=1 enable_verbose_writes=1" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +test_gen_desc() +{ + echo -n "$1 x $(get_test_count $1)" +} + +list_tests() +{ + echo "Test ID list:" + echo + echo "TEST_ID x NUM_TEST" + echo "TEST_ID: Test ID" + echo "NUM_TESTS: Number of recommended times to run the test" + echo + echo "$(test_gen_desc 0001) - misc test writing x in different ways" + echo "$(test_gen_desc 0002) - misc test writing y in different ways" + echo "$(test_gen_desc 0003) - misc test writing x in different ways using a mutex lock" + echo "$(test_gen_desc 0004) - misc test writing y in different ways using a mutex lock" + echo "$(test_gen_desc 0005) - misc test writing x load and remove the test_sysfs module" + echo "$(test_gen_desc 0006) - misc writing y load and remove the test_sysfs module" + echo "$(test_gen_desc 0007) - misc test writing y larger delay, load, remove test_sysfs" + echo "$(test_gen_desc 0008) - misc test busy writing x remove test_sysfs module" + echo "$(test_gen_desc 0009) - misc test busy writing y remove the test_sysfs module" + echo "$(test_gen_desc 0010) - misc test busy writing y larger delay, remove test_sysfs" + echo "$(test_gen_desc 0011) - misc test writing x and resetting device" + echo "$(test_gen_desc 0012) - misc test writing y and resetting device" + echo "$(test_gen_desc 0013) - misc test writing y with a larger delay and resetting device" + echo "$(test_gen_desc 0014) - block test writing x in different ways" + echo "$(test_gen_desc 0015) - block test writing y in different ways" + echo "$(test_gen_desc 0016) - block test writing x in different ways using a mutex lock" + echo "$(test_gen_desc 0017) - block test writing y in different ways using a mutex lock" + echo "$(test_gen_desc 0018) - block test writing x load and remove the test_sysfs module" + echo "$(test_gen_desc 0019) - block test writing y load and remove the test_sysfs module" + echo "$(test_gen_desc 0020) - block test writing y larger delay, load, remove test_sysfs" + echo "$(test_gen_desc 0021) - block test busy writing x remove the test_sysfs module" + echo "$(test_gen_desc 0022) - block test busy writing y remove the test_sysfs module" + echo "$(test_gen_desc 0023) - block test busy writing y larger delay, remove test_sysfs" + echo "$(test_gen_desc 0024) - block test writing x and resetting device" + echo "$(test_gen_desc 0025) - block test writing y and resetting device" + echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device" + echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... " + echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..." +} + +usage() +{ + NUM_TESTS=$(grep -o ' ' <<<"$ALL_TESTS" | grep -c .) + let NUM_TESTS=$NUM_TESTS+1 + MAX_TEST=$(printf "%04d\n" $NUM_TESTS) + echo "Usage: $0 [ -t <4-number-digit> ] | [ -w <4-number-digit> ] |" + echo " [ -s <4-number-digit> ] | [ -c <4-number-digit> <test- count>" + echo " [ all ] [ -h | --help ] [ -l ]" + echo "" + echo "Valid tests: 0001-$MAX_TEST" + echo "" + echo " all Runs all tests (default)" + echo " -t Run test ID the number amount of times is recommended" + echo " -w Watch test ID run until it runs into an error" + echo " -c Run test ID once" + echo " -s Run test ID x test-count number of times" + echo " -l List all test ID list" + echo " -h|--help Help" + echo + echo "If an error every occurs execution will immediately terminate." + echo "If you are adding a new test try using -w <test-ID> first to" + echo "make sure the test passes a series of tests." + echo + echo Example uses: + echo + echo "$TEST_NAME.sh -- executes all tests" + echo "$TEST_NAME.sh -t 0002 -- Executes test ID 0002 number of times is recomended" + echo "$TEST_NAME.sh -w 0002 -- Watch test ID 0002 run until an error occurs" + echo "$TEST_NAME.sh -s 0002 -- Run test ID 0002 once" + echo "$TEST_NAME.sh -c 0002 3 -- Run test ID 0002 three times" + echo + list_tests + exit 1 +} + +test_num() +{ + re='^[0-9]+$' + if ! [[ $1 =~ $re ]]; then + usage + fi +} + +get_test_count() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $2}' +} + +get_test_enabled() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $3}' +} + +get_test_target() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $4}' +} + +get_test_type() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $5}' +} + +run_all_tests() +{ + for i in $ALL_TESTS ; do + TEST_ID=$(echo $i | awk -F":" '{print $1}') + ENABLED=$(get_test_enabled $TEST_ID) + TEST_COUNT=$(get_test_count $TEST_ID) + TEST_TARGET=$(get_test_target $TEST_ID) + if [[ $ENABLED -eq "1" ]]; then + test_case $TEST_ID $TEST_COUNT $TEST_TARGET + else + echo -n "Skipping test $TEST_ID as its disabled, likely " + echo "could crash your system ..." + fi + done +} + +watch_log() +{ + if [ $# -ne 3 ]; then + clear + fi + echo "Running test: $2 - run #$1" +} + +watch_case() +{ + i=0 + while [ 1 ]; do + if [ $# -eq 1 ]; then + test_num $1 + watch_log $i ${TEST_NAME}_test_$1 + log_kernel_fstest_dmesg sysfs $1 + RUN_TEST=${TEST_NAME}_test_$1 + $RUN_TEST + check_dmesg + if [[ $? -ne 0 ]]; then + exit 1 + fi + else + watch_log $i all + run_all_tests + fi + let i=$i+1 + done +} + +test_case() +{ + NUM_TESTS=$2 + + i=0 + + load_modreqs $1 + if target_exists $3 $1; then + return + fi + + while [[ $i -lt $NUM_TESTS ]]; do + test_num $1 + watch_log $i ${TEST_NAME}_test_$1 noclear + log_kernel_fstest_dmesg sysfs $1 + RUN_TEST=${TEST_NAME}_test_$1 + $RUN_TEST + let i=$i+1 + done + check_dmesg + if [[ $? -ne 0 ]]; then + exit 1 + fi +} + +parse_args() +{ + if [ $# -eq 0 ]; then + run_all_tests + else + if [[ "$1" = "all" ]]; then + run_all_tests + elif [[ "$1" = "-w" ]]; then + shift + watch_case $@ + elif [[ "$1" = "-t" ]]; then + shift + test_num $1 + test_case $1 $(get_test_count $1) $(get_test_target $1) + shift + elif [[ "$1" = "-c" ]]; then + shift + test_num $1 + test_num $2 + test_case $1 $2 $(get_test_target $1) + shift + shift + elif [[ "$1" = "-s" ]]; then + shift + test_case $1 1 $(get_test_target $1) + shift + elif [[ "$1" = "-l" ]]; then + list_tests + shift + elif [[ "$1" = "-h" || "$1" = "--help" ]]; then + usage + else + usage + fi + fi +} + +test_reqs +allow_user_defaults + +trap "test_finish" EXIT + +parse_args $@ + +exit 0
On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
--- /dev/null +++ b/lib/test_sysfs.c @@ -0,0 +1,921 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +/*
- sysfs test driver
- Copyright (C) 2021 Luis Chamberlain mcgrof@kernel.org
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License as published by the Free
- Software Foundation; either version 2 of the License, or at your option any
- later version; or, when distributed separately from the Linux kernel or
- when incorporated into other software packages, subject to the following
- license:
- This program is free software; you can redistribute it and/or modify it
- under the terms of copyleft-next (version 0.3.1 or later) as published
Independant of the fact that I don't like sysfs code attempting to be accessed in the kernel with licenses other than GPLv2, you do not need the license "boilerplate" text at all in files. That's what the SPDX line is for.
thanks,
greg k-h
-----Original Message----- From: Greg KH gregkh@linuxfoundation.org Sent: Tuesday, October 5, 2021 8:17 AM To: Luis Chamberlain mcgrof@kernel.org Cc: tj@kernel.org; akpm@linux-foundation.org; minchan@kernel.org; jeyu@kernel.org; shuah@kernel.org; bvanassche@acm.org; dan.j.williams@intel.com; joe@perches.com; tglx@linutronix.de; keescook@chromium.org; rostedt@goodmis.org; linux- spdx@vger.kernel.org; linux-doc@vger.kernel.org; linux-block@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- kselftest@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH v8 03/12] selftests: add tests_sysfs module
On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
--- /dev/null +++ b/lib/test_sysfs.c @@ -0,0 +1,921 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +/*
- sysfs test driver
- Copyright (C) 2021 Luis Chamberlain mcgrof@kernel.org
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License as published by the Free
- Software Foundation; either version 2 of the License, or at your option any
- later version; or, when distributed separately from the Linux kernel or
- when incorporated into other software packages, subject to the following
- license:
This is a very strange license grant, which I'm not sure is covered by any current SPDX syntax. " when distributed separately from the Linux kernel or when incorporated into other software packages, subject to the following license:"
Why would we care about the license used when the code is used in a non-kernel project? If it is desired for the code to be available outside the kernel under a different license, then surely the easiest thing is to make it available separately under that license. I'm not sure why the kernel needs to carry this license for non-kernel use of the code.
I would recommend giving this a GPLv2 SPDX header, and maybe in the comment at the top of the file put a reference to a git repository where the code can be obtained under a different license.
Just my 2 cents. -- Tim
- This program is free software; you can redistribute it and/or modify it
- under the terms of copyleft-next (version 0.3.1 or later) as published
Independant of the fact that I don't like sysfs code attempting to be accessed in the kernel with licenses other than GPLv2, you do not need the license "boilerplate" text at all in files. That's what the SPDX line is for.
thanks,
greg k-h
On Tue, Oct 05, 2021 at 04:57:55PM +0000, Tim.Bird@sony.com wrote:
-----Original Message----- From: Greg KH gregkh@linuxfoundation.org Sent: Tuesday, October 5, 2021 8:17 AM To: Luis Chamberlain mcgrof@kernel.org Cc: tj@kernel.org; akpm@linux-foundation.org; minchan@kernel.org; jeyu@kernel.org; shuah@kernel.org; bvanassche@acm.org; dan.j.williams@intel.com; joe@perches.com; tglx@linutronix.de; keescook@chromium.org; rostedt@goodmis.org; linux- spdx@vger.kernel.org; linux-doc@vger.kernel.org; linux-block@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- kselftest@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH v8 03/12] selftests: add tests_sysfs module
On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
--- /dev/null +++ b/lib/test_sysfs.c @@ -0,0 +1,921 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +/*
- sysfs test driver
- Copyright (C) 2021 Luis Chamberlain mcgrof@kernel.org
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License as published by the Free
- Software Foundation; either version 2 of the License, or at your option any
- later version; or, when distributed separately from the Linux kernel or
- when incorporated into other software packages, subject to the following
- license:
This is a very strange license grant, which I'm not sure is covered by any current SPDX syntax. " when distributed separately from the Linux kernel or when incorporated into other software packages, subject to the following license:"
drivers/xen/events/events_fifo.c has that same language.
Why would we care about the license used when the code is used in a non-kernel project? If it is desired for the code to be available outside the kernel under a different license, then surely the easiest thing is to make it available separately under that license. I'm not sure why the kernel needs to carry this license for non-kernel use of the code.
I would recommend giving this a GPLv2 SPDX header, and maybe in the comment at the top of the file put a reference to a git repository where the code can be obtained under a different license.
Keeping the dual let's new updates directly on the kernel benefit from evolution. A fork would stagnate it in place and would require updates separately.
Luis
On Tue, Oct 05, 2021 at 04:16:46PM +0200, Greg KH wrote:
On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
--- /dev/null +++ b/lib/test_sysfs.c @@ -0,0 +1,921 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +/*
- sysfs test driver
- Copyright (C) 2021 Luis Chamberlain mcgrof@kernel.org
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License as published by the Free
- Software Foundation; either version 2 of the License, or at your option any
- later version; or, when distributed separately from the Linux kernel or
- when incorporated into other software packages, subject to the following
- license:
- This program is free software; you can redistribute it and/or modify it
- under the terms of copyleft-next (version 0.3.1 or later) as published
Independant of the fact that I don't like sysfs code attempting to be accessed in the kernel with licenses other than GPLv2, you do not need the license "boilerplate" text at all in files. That's what the SPDX line is for.
Sure, I'll remove the boilerplate, sorry for missing that again, I thought I had removed it.
Luis
On Mon, 27 Sep 2021, Luis Chamberlain wrote:
This adds a new selftest module which can be used to test sysfs, which would otherwise require using an existing driver. This lets us muck with a template driver to test breaking things without affecting system behaviour or requiring the dependencies of a real device driver.
A series of 28 tests are added. Support for using two device types are supported:
- misc
- block
I suppose the selftests will run for more than 45 seconds (default kselftest timeout), so you probably also want to set timeout to something sensible in tools/testing/selftests/sysfs/settings file (0 would disable it).
Miroslav
On Thu, Oct 07, 2021 at 04:23:22PM +0200, Miroslav Benes wrote:
On Mon, 27 Sep 2021, Luis Chamberlain wrote:
This adds a new selftest module which can be used to test sysfs, which would otherwise require using an existing driver. This lets us muck with a template driver to test breaking things without affecting system behaviour or requiring the dependencies of a real device driver.
A series of 28 tests are added. Support for using two device types are supported:
- misc
- block
I suppose the selftests will run for more than 45 seconds (default kselftest timeout), so you probably also want to set timeout to something sensible in tools/testing/selftests/sysfs/settings file (0 would disable it).
Good catch, I'll use a default of 200, in practice for me this runs in much less than that, about 110 seconds, so 200 should be good wiggle room.
Luis
This adds initial failure injection support to kernfs. We start off with debug knobs which when enabled allow test drivers, such as test_sysfs, to then make use of these to try to force certain difficult races to take place with a high degree of certainty.
This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is enabled in your kernel. If you don't have this enabled this provides no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new routine kernfs_debug_should_wait() ends up being transformed to if (false), and so the compiler should optimize these out as dead code producing no new effective binary changes.
We start off with enabling failure injections in kernfs by allowing us to alter the way kernfs_fop_write_iter() behaves. We allow for the routine kernfs_fop_write_iter() to wait for a certain condition in the kernel to occur, after which it will sleep a predefined amount of time. This lets kernfs users to time exactly when it want kernfs_fop_write_iter() to complete, allowing for developing race conditions and test for correctness in kernfs.
You'd boot with this enabled on your kernel command line:
fail_kernfs_fop_write_iter=1,100,0,1
The values are <interval,probability,size,times>, we don't care for size, so for now we ignore it. The above ensures a failure will trigger only once.
*How* we allow for this routine to change behaviour is left to knobs we expose under debugfs:
# ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/ wait_after_active wait_after_mutex wait_at_start wait_before_mutex
A debugfs entry also exists to allow us to sleep a configurabler amount of time after the completion:
/sys/kernel/debug/kernfs/sleep_after_wait_ms
These two sets of knobs allow us to construct races and demonstrate how the kernfs active reference should suffice to project against races.
Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the differnt fault injection parametres for the new fail_kernfs_fop_write_iter fault injection at run time:
ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/ interval probability space task-filter times verbose verbose_ratelimit_burst verbose_ratelimit_interval_ms
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- .../fault-injection/fault-injection.rst | 22 +++++ MAINTAINERS | 2 +- fs/kernfs/Makefile | 1 + fs/kernfs/failure-injection.c | 91 +++++++++++++++++++ fs/kernfs/file.c | 13 +++ fs/kernfs/kernfs-internal.h | 72 +++++++++++++++ include/linux/kernfs.h | 5 + lib/Kconfig.debug | 10 ++ 8 files changed, 215 insertions(+), 1 deletion(-) create mode 100644 fs/kernfs/failure-injection.c
diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst index 4a25c5eb6f07..d4d34b082f47 100644 --- a/Documentation/fault-injection/fault-injection.rst +++ b/Documentation/fault-injection/fault-injection.rst @@ -28,6 +28,28 @@ Available fault injection capabilities
injects kernel RPC client and server failures.
+- fail_kernfs_fop_write_iter + + Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling + this does not immediately enable any errors to occur. You must configure + how you want this routine to fail or change behaviour by using the debugfs + knobs for it: + + # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/ + wait_after_active + wait_after_mutex + wait_at_start + wait_before_mutex + + You can also configure how long to sleep after a wait under + + /sys/kernel/debug/kernfs/sleep_after_wait_ms + + If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure + injection parameters are placed under: + + /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/ + - fail_make_request
injects disk IO errors on devices permitted by setting diff --git a/MAINTAINERS b/MAINTAINERS index 1b4cefcb064c..fadfd961ad80 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10384,7 +10384,7 @@ M: Greg Kroah-Hartman gregkh@linuxfoundation.org M: Tejun Heo tj@kernel.org S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git -F: fs/kernfs/ +F: fs/kernfs/* F: include/linux/kernfs.h
KEXEC diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile index 4ca54ff54c98..bc5b32ca39f9 100644 --- a/fs/kernfs/Makefile +++ b/fs/kernfs/Makefile @@ -4,3 +4,4 @@ #
obj-y := mount.o inode.o dir.o file.o symlink.o +obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c new file mode 100644 index 000000000000..4130d202c13b --- /dev/null +++ b/fs/kernfs/failure-injection.c @@ -0,0 +1,91 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/fault-inject.h> +#include <linux/delay.h> + +#include "kernfs-internal.h" + +static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter); +struct kernfs_config_fail kernfs_config_fail; + +#define kernfs_config_fail(when) \ + kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when + +#define kernfs_config_fail(when) \ + kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when + +static int __init setup_fail_kernfs_fop_write_iter(char *str) +{ + return setup_fault_attr(&fail_kernfs_fop_write_iter, str); +} + +__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter); + +struct dentry *kernfs_debugfs_root; +struct dentry *config_fail_kernfs_fop_write_iter; + +static int __init kernfs_init_failure_injection(void) +{ + kernfs_config_fail.sleep_after_wait_ms = 100; + kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL); + + fault_create_debugfs_attr("fail_kernfs_fop_write_iter", + kernfs_debugfs_root, &fail_kernfs_fop_write_iter); + + config_fail_kernfs_fop_write_iter = + debugfs_create_dir("config_fail_kernfs_fop_write_iter", + kernfs_debugfs_root); + + debugfs_create_u32("sleep_after_wait_ms", 0600, + kernfs_debugfs_root, + &kernfs_config_fail.sleep_after_wait_ms); + + debugfs_create_bool("wait_at_start", 0600, + config_fail_kernfs_fop_write_iter, + &kernfs_config_fail(at_start)); + debugfs_create_bool("wait_before_mutex", 0600, + config_fail_kernfs_fop_write_iter, + &kernfs_config_fail(before_mutex)); + debugfs_create_bool("wait_after_mutex", 0600, + config_fail_kernfs_fop_write_iter, + &kernfs_config_fail(after_mutex)); + debugfs_create_bool("wait_after_active", 0600, + config_fail_kernfs_fop_write_iter, + &kernfs_config_fail(after_active)); + return 0; +} +late_initcall(kernfs_init_failure_injection); + +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate) +{ + if (!evaluate) + return 0; + + return should_fail(&fail_kernfs_fop_write_iter, 0); +} + +DECLARE_COMPLETION(kernfs_debug_wait_completion); +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE); + +void kernfs_debug_wait(void) +{ + unsigned long timeout; + + timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion, + msecs_to_jiffies(3000)); + if (!timeout) + pr_info("%s waiting for kernfs_debug_wait_completion timed out\n", + __func__); + else + pr_info("%s received completion with time left on timeout %u ms\n", + __func__, jiffies_to_msecs(timeout)); + + /** + * The goal is wait for an event, and *then* once we have + * reached it, the other side will try to do something which + * it thinks will break. So we must give it some time to do + * that. The amount of time is configurable. + */ + msleep(kernfs_config_fail.sleep_after_wait_ms); + pr_info("%s ended\n", __func__); +} diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 60e2a86c535e..4479c6580333 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) const struct kernfs_ops *ops; char *buf;
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start)) + kernfs_debug_wait(); + if (of->atomic_write_len) { if (len > of->atomic_write_len) return -E2BIG; @@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) } buf[len] = '\0'; /* guarantee string termination */
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex)) + kernfs_debug_wait(); + /* * @of->mutex nests outside active ref and is used both to ensure that * the ops aren't called concurrently for the same open file. */ mutex_lock(&of->mutex); + + if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex)) + kernfs_debug_wait(); + if (!kernfs_get_active(of->kn)) { mutex_unlock(&of->mutex); len = -ENODEV; goto out_free; }
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active)) + kernfs_debug_wait(); + ops = kernfs_ops(of->kn); if (ops->write) len = ops->write(of, buf, len, iocb->ki_pos); diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index f9cc912c31e1..9e3abf597e2d 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -18,6 +18,7 @@
#include <linux/kernfs.h> #include <linux/fs_context.h> +#include <linux/stringify.h>
struct kernfs_iattrs { kuid_t ia_uid; @@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn); */ extern const struct inode_operations kernfs_symlink_iops;
+/* + * failure-injection.c + */ +#ifdef CONFIG_FAIL_KERNFS_KNOBS + +/** + * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails + * + * This lets you configure what part of kernfs_fop_write_iter() should behave + * in a specific way to allow userspace to capture possible failures in + * kernfs. The wait knobs are allowed to let you design capture possible + * race conditions which would otherwise be difficult to reproduce. A + * secondary driver would tell kernfs's wait completion when it is done. + * + * The point to the wait completion failure injection tests are to confirm + * that the kernfs active refcount suffice to ensure other objects in other + * layers are also gauranteed to exist, even they are opaque to kernfs. This + * includes kobjects, devices, and other objects built on top of this, like + * the block layer when using sysfs block device attributes. + * + * @wait_at_start: waits for completion from a third party at the start of + * the routine. + * @wait_before_mutex: waits for completion from a third party before we + * are allowed to continue before the of->mutex is held. + * @wait_after_mutex: waits for completion from a third party after we + * have held the of->mutex. + * @wait_after_active: waits for completion from a thid party after we + * have refcounted the struct kernfs_node. + */ +struct kernfs_fop_write_iter_fail { + bool wait_at_start; + bool wait_before_mutex; + bool wait_after_mutex; + bool wait_after_active; +}; + +/** + * struct kernfs_config_fail - kernfs configuration for failure injection + * + * You can kernfs failure injection on boot, and in particular we currently + * only support failures for kernfs_fop_write_iter(). However, we don't + * want to always enable errors on this call when failure injection is enabled + * as this routine is used by many parts of the kernel for proper functionality. + * The compromise we make is we let userspace start enabling which parts it + * wants to fail after boot, if and only if failure injection has been enabled. + * + * @kernfs_fop_write_iter_fail: configuration for how we want to allow + * for failure injection on kernfs_fop_write_iter() + * @sleep_after_wait_ms: how many ms to wait after completion is received. + */ +struct kernfs_config_fail { + struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail; + u32 sleep_after_wait_ms; +}; + +extern struct kernfs_config_fail kernfs_config_fail; + +#define __kernfs_config_wait_var(func, when) \ + (kernfs_config_fail. func ## _fail.wait_ ## when) +#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func + +#define kernfs_debug_should_wait(func, when) \ + __kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when)) +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate); +void kernfs_debug_wait(void); +#else +static inline void kernfs_init_failure_injection(void) {} +#define kernfs_debug_should_wait(func, when) (false) +static inline void kernfs_debug_wait(void) {} +#endif + #endif /* __KERNFS_INTERNAL_H */ diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 3ccce6f24548..cd968ee2b503 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -411,6 +411,11 @@ void kernfs_init(void);
struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root, u64 id); + +#ifdef CONFIG_FAIL_KERNFS_KNOBS +extern struct completion kernfs_debug_wait_completion; +#endif + #else /* CONFIG_KERNFS */
static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ae19bf1a21b8..a29b7d398c4e 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY Provides fault-injection capability to inject failures in usercopy functions (copy_from_user(), get_user(), ...).
+config FAIL_KERNFS_KNOBS + bool "Fault-injection support in kernfs" + depends on FAULT_INJECTION + help + Provide fault-injection capability for kernfs. This only enables + the error injection functionality. To use it you must configure which + which path you want to trigger on error on using debugfs under + /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By + default all of these are disabled. + config FAIL_MAKE_REQUEST bool "Fault-injection capability for disk IO" depends on FAULT_INJECTION && BLOCK
On Mon, Sep 27, 2021 at 09:37:57AM -0700, Luis Chamberlain wrote:
This adds initial failure injection support to kernfs. We start off with debug knobs which when enabled allow test drivers, such as test_sysfs, to then make use of these to try to force certain difficult races to take place with a high degree of certainty.
This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is enabled in your kernel. If you don't have this enabled this provides no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new routine kernfs_debug_should_wait() ends up being transformed to if (false), and so the compiler should optimize these out as dead code producing no new effective binary changes.
We start off with enabling failure injections in kernfs by allowing us to alter the way kernfs_fop_write_iter() behaves. We allow for the routine kernfs_fop_write_iter() to wait for a certain condition in the kernel to occur, after which it will sleep a predefined amount of time. This lets kernfs users to time exactly when it want kernfs_fop_write_iter() to complete, allowing for developing race conditions and test for correctness in kernfs.
You'd boot with this enabled on your kernel command line:
fail_kernfs_fop_write_iter=1,100,0,1
The values are <interval,probability,size,times>, we don't care for size, so for now we ignore it. The above ensures a failure will trigger only once.
*How* we allow for this routine to change behaviour is left to knobs we expose under debugfs:
# ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
I'd expect this to live under /sys/kernel/debug/fail_kernfs, like the other fault injectors.
wait_after_active wait_after_mutex wait_at_start wait_before_mutex
A debugfs entry also exists to allow us to sleep a configurabler amount of time after the completion:
/sys/kernel/debug/kernfs/sleep_after_wait_ms
These two sets of knobs allow us to construct races and demonstrate how the kernfs active reference should suffice to project against races.
Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the differnt fault injection parametres for the new fail_kernfs_fop_write_iter fault injection at run time:
ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/ interval probability space times task-filter verbose verbose_ratelimit_burst verbose_ratelimit_interval_ms
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
.../fault-injection/fault-injection.rst | 22 +++++ MAINTAINERS | 2 +- fs/kernfs/Makefile | 1 + fs/kernfs/failure-injection.c | 91 +++++++++++++++++++ fs/kernfs/file.c | 13 +++ fs/kernfs/kernfs-internal.h | 72 +++++++++++++++ include/linux/kernfs.h | 5 + lib/Kconfig.debug | 10 ++ 8 files changed, 215 insertions(+), 1 deletion(-) create mode 100644 fs/kernfs/failure-injection.c
diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst index 4a25c5eb6f07..d4d34b082f47 100644 --- a/Documentation/fault-injection/fault-injection.rst +++ b/Documentation/fault-injection/fault-injection.rst @@ -28,6 +28,28 @@ Available fault injection capabilities injects kernel RPC client and server failures. +- fail_kernfs_fop_write_iter
- Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
- this does not immediately enable any errors to occur. You must configure
- how you want this routine to fail or change behaviour by using the debugfs
- knobs for it:
- # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
- wait_after_active
- wait_after_mutex
- wait_at_start
- wait_before_mutex
This should be split up and detailed in the "debugfs entries" section below here.
- You can also configure how long to sleep after a wait under
- /sys/kernel/debug/kernfs/sleep_after_wait_ms
- If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
- injection parameters are placed under:
- /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
- fail_make_request
injects disk IO errors on devices permitted by setting diff --git a/MAINTAINERS b/MAINTAINERS index 1b4cefcb064c..fadfd961ad80 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10384,7 +10384,7 @@ M: Greg Kroah-Hartman gregkh@linuxfoundation.org M: Tejun Heo tj@kernel.org S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git -F: fs/kernfs/ +F: fs/kernfs/* F: include/linux/kernfs.h KEXEC diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile index 4ca54ff54c98..bc5b32ca39f9 100644 --- a/fs/kernfs/Makefile +++ b/fs/kernfs/Makefile @@ -4,3 +4,4 @@ # obj-y := mount.o inode.o dir.o file.o symlink.o +obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c new file mode 100644 index 000000000000..4130d202c13b --- /dev/null +++ b/fs/kernfs/failure-injection.c
I'd name this fault_inject.c, which matches the more common case:
$ find . -type f -name '*fault*inject*.c' ./fs/nfsd/fault_inject.c ./drivers/nvme/host/fault_inject.c ./drivers/scsi/ufs/ufs-fault-injection.c ./lib/fault-inject.c ./lib/fault-inject-usercopy.c
@@ -0,0 +1,91 @@ +// SPDX-License-Identifier: GPL-2.0
+#include <linux/fault-inject.h> +#include <linux/delay.h>
+#include "kernfs-internal.h"
+static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter); +struct kernfs_config_fail kernfs_config_fail;
+#define kernfs_config_fail(when) \
- kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+#define kernfs_config_fail(when) \
- kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+static int __init setup_fail_kernfs_fop_write_iter(char *str) +{
- return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
+}
+__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
+struct dentry *kernfs_debugfs_root; +struct dentry *config_fail_kernfs_fop_write_iter;
+static int __init kernfs_init_failure_injection(void) +{
- kernfs_config_fail.sleep_after_wait_ms = 100;
- kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
- fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
- config_fail_kernfs_fop_write_iter =
debugfs_create_dir("config_fail_kernfs_fop_write_iter",
kernfs_debugfs_root);
- debugfs_create_u32("sleep_after_wait_ms", 0600,
kernfs_debugfs_root,
&kernfs_config_fail.sleep_after_wait_ms);
- debugfs_create_bool("wait_at_start", 0600,
config_fail_kernfs_fop_write_iter,
&kernfs_config_fail(at_start));
- debugfs_create_bool("wait_before_mutex", 0600,
config_fail_kernfs_fop_write_iter,
&kernfs_config_fail(before_mutex));
- debugfs_create_bool("wait_after_mutex", 0600,
config_fail_kernfs_fop_write_iter,
&kernfs_config_fail(after_mutex));
- debugfs_create_bool("wait_after_active", 0600,
config_fail_kernfs_fop_write_iter,
&kernfs_config_fail(after_active));
- return 0;
+} +late_initcall(kernfs_init_failure_injection);
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate) +{
- if (!evaluate)
return 0;
- return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
Every caller ends up doing the wait, so how about just including that here instead? It should make things much less intrusive and more readable.
And for the naming, other fault injectors use "should_fail_$topic", so maybe better here would be something like may_wait_kernfs(...).
+DECLARE_COMPLETION(kernfs_debug_wait_completion); +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+void kernfs_debug_wait(void) +{
- unsigned long timeout;
- timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
msecs_to_jiffies(3000));
- if (!timeout)
pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
__func__);
- else
pr_info("%s received completion with time left on timeout %u ms\n",
__func__, jiffies_to_msecs(timeout));
- /**
* The goal is wait for an event, and *then* once we have
* reached it, the other side will try to do something which
* it thinks will break. So we must give it some time to do
* that. The amount of time is configurable.
*/
- msleep(kernfs_config_fail.sleep_after_wait_ms);
- pr_info("%s ended\n", __func__);
+}
All the uses of "__func__" here seems redundant; I would drop them.
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 60e2a86c535e..4479c6580333 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) const struct kernfs_ops *ops; char *buf;
- if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
kernfs_debug_wait();
So this could just be:
may_wait_kernfs(kernfs_fop_write_iter, at_start);
- if (of->atomic_write_len) { if (len > of->atomic_write_len) return -E2BIG;
@@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) } buf[len] = '\0'; /* guarantee string termination */
- if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
kernfs_debug_wait();
- /*
*/ mutex_lock(&of->mutex);
- @of->mutex nests outside active ref and is used both to ensure that
- the ops aren't called concurrently for the same open file.
- if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
kernfs_debug_wait();
- if (!kernfs_get_active(of->kn)) { mutex_unlock(&of->mutex); len = -ENODEV; goto out_free; }
- if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
kernfs_debug_wait();
- ops = kernfs_ops(of->kn); if (ops->write) len = ops->write(of, buf, len, iocb->ki_pos);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index f9cc912c31e1..9e3abf597e2d 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -18,6 +18,7 @@ #include <linux/kernfs.h> #include <linux/fs_context.h> +#include <linux/stringify.h> struct kernfs_iattrs { kuid_t ia_uid; @@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn); */ extern const struct inode_operations kernfs_symlink_iops; +/*
- failure-injection.c
- */
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+/**
- struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
- This lets you configure what part of kernfs_fop_write_iter() should behave
- in a specific way to allow userspace to capture possible failures in
- kernfs. The wait knobs are allowed to let you design capture possible
- race conditions which would otherwise be difficult to reproduce. A
- secondary driver would tell kernfs's wait completion when it is done.
- The point to the wait completion failure injection tests are to confirm
- that the kernfs active refcount suffice to ensure other objects in other
- layers are also gauranteed to exist, even they are opaque to kernfs. This
- includes kobjects, devices, and other objects built on top of this, like
- the block layer when using sysfs block device attributes.
- @wait_at_start: waits for completion from a third party at the start of
- the routine.
- @wait_before_mutex: waits for completion from a third party before we
- are allowed to continue before the of->mutex is held.
- @wait_after_mutex: waits for completion from a third party after we
- have held the of->mutex.
- @wait_after_active: waits for completion from a thid party after we
- have refcounted the struct kernfs_node.
- */
+struct kernfs_fop_write_iter_fail {
- bool wait_at_start;
- bool wait_before_mutex;
- bool wait_after_mutex;
- bool wait_after_active;
+};
+/**
- struct kernfs_config_fail - kernfs configuration for failure injection
- You can kernfs failure injection on boot, and in particular we currently
- only support failures for kernfs_fop_write_iter(). However, we don't
- want to always enable errors on this call when failure injection is enabled
- as this routine is used by many parts of the kernel for proper functionality.
- The compromise we make is we let userspace start enabling which parts it
- wants to fail after boot, if and only if failure injection has been enabled.
- @kernfs_fop_write_iter_fail: configuration for how we want to allow
- for failure injection on kernfs_fop_write_iter()
- @sleep_after_wait_ms: how many ms to wait after completion is received.
- */
+struct kernfs_config_fail {
- struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
- u32 sleep_after_wait_ms;
+};
+extern struct kernfs_config_fail kernfs_config_fail;
+#define __kernfs_config_wait_var(func, when) \
- (kernfs_config_fail. func ## _fail.wait_ ## when)
^^ ^ ^ nit: needless spaces
+#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
+#define kernfs_debug_should_wait(func, when) \
- __kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate); +void kernfs_debug_wait(void); +#else +static inline void kernfs_init_failure_injection(void) {} +#define kernfs_debug_should_wait(func, when) (false) +static inline void kernfs_debug_wait(void) {} +#endif
#endif /* __KERNFS_INTERNAL_H */ diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 3ccce6f24548..cd968ee2b503 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -411,6 +411,11 @@ void kernfs_init(void); struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root, u64 id);
+#ifdef CONFIG_FAIL_KERNFS_KNOBS +extern struct completion kernfs_debug_wait_completion; +#endif
#else /* CONFIG_KERNFS */ static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ae19bf1a21b8..a29b7d398c4e 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY Provides fault-injection capability to inject failures in usercopy functions (copy_from_user(), get_user(), ...). +config FAIL_KERNFS_KNOBS
- bool "Fault-injection support in kernfs"
- depends on FAULT_INJECTION
- help
Provide fault-injection capability for kernfs. This only enables
the error injection functionality. To use it you must configure which
which path you want to trigger on error on using debugfs under
/sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
default all of these are disabled.
config FAIL_MAKE_REQUEST bool "Fault-injection capability for disk IO" depends on FAULT_INJECTION && BLOCK -- 2.30.2
On Tue, Oct 05, 2021 at 12:47:22PM -0700, Kees Cook wrote:
On Mon, Sep 27, 2021 at 09:37:57AM -0700, Luis Chamberlain wrote:
This adds initial failure injection support to kernfs. We start off with debug knobs which when enabled allow test drivers, such as test_sysfs, to then make use of these to try to force certain difficult races to take place with a high degree of certainty.
This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is enabled in your kernel. If you don't have this enabled this provides no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new routine kernfs_debug_should_wait() ends up being transformed to if (false), and so the compiler should optimize these out as dead code producing no new effective binary changes.
We start off with enabling failure injections in kernfs by allowing us to alter the way kernfs_fop_write_iter() behaves. We allow for the routine kernfs_fop_write_iter() to wait for a certain condition in the kernel to occur, after which it will sleep a predefined amount of time. This lets kernfs users to time exactly when it want kernfs_fop_write_iter() to complete, allowing for developing race conditions and test for correctness in kernfs.
You'd boot with this enabled on your kernel command line:
fail_kernfs_fop_write_iter=1,100,0,1
The values are <interval,probability,size,times>, we don't care for size, so for now we ignore it. The above ensures a failure will trigger only once.
*How* we allow for this routine to change behaviour is left to knobs we expose under debugfs:
# ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
I'd expect this to live under /sys/kernel/debug/fail_kernfs, like the other fault injectors.
Yes I see, thanks will fix up!
diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst index 4a25c5eb6f07..d4d34b082f47 100644 --- a/Documentation/fault-injection/fault-injection.rst +++ b/Documentation/fault-injection/fault-injection.rst @@ -28,6 +28,28 @@ Available fault injection capabilities injects kernel RPC client and server failures. +- fail_kernfs_fop_write_iter
- Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
- this does not immediately enable any errors to occur. You must configure
- how you want this routine to fail or change behaviour by using the debugfs
- knobs for it:
- # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
- wait_after_active
- wait_after_mutex
- wait_at_start
- wait_before_mutex
This should be split up and detailed in the "debugfs entries" section below here.
Done!
diff --git a/MAINTAINERS b/MAINTAINERS index 1b4cefcb064c..fadfd961ad80 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10384,7 +10384,7 @@ M: Greg Kroah-Hartman gregkh@linuxfoundation.org M: Tejun Heo tj@kernel.org S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git -F: fs/kernfs/ +F: fs/kernfs/* F: include/linux/kernfs.h KEXEC diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile index 4ca54ff54c98..bc5b32ca39f9 100644 --- a/fs/kernfs/Makefile +++ b/fs/kernfs/Makefile @@ -4,3 +4,4 @@ # obj-y := mount.o inode.o dir.o file.o symlink.o +obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c new file mode 100644 index 000000000000..4130d202c13b --- /dev/null +++ b/fs/kernfs/failure-injection.c
I'd name this fault_inject.c, which matches the more common case:
$ find . -type f -name '*fault*inject*.c' ./fs/nfsd/fault_inject.c ./drivers/nvme/host/fault_inject.c ./drivers/scsi/ufs/ufs-fault-injection.c ./lib/fault-inject.c ./lib/fault-inject-usercopy.c
Sure, done.
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate) +{
- if (!evaluate)
return 0;
- return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
Every caller ends up doing the wait, so how about just including that here instead? It should make things much less intrusive and more readable.
And for the naming, other fault injectors use "should_fail_$topic", so maybe better here would be something like may_wait_kernfs(...).
In case anyone is reading Hail Mary by Andy Weir: "Yes yes yes!"
Indeed, that's a great idea. Changed!
+DECLARE_COMPLETION(kernfs_debug_wait_completion); +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+void kernfs_debug_wait(void) +{
- unsigned long timeout;
- timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
msecs_to_jiffies(3000));
- if (!timeout)
pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
__func__);
- else
pr_info("%s received completion with time left on timeout %u ms\n",
__func__, jiffies_to_msecs(timeout));
- /**
* The goal is wait for an event, and *then* once we have
* reached it, the other side will try to do something which
* it thinks will break. So we must give it some time to do
* that. The amount of time is configurable.
*/
- msleep(kernfs_config_fail.sleep_after_wait_ms);
- pr_info("%s ended\n", __func__);
+}
All the uses of "__func__" here seems redundant; I would drop them.
Alright, and I also added the pr_fmt define which I forgot.
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 60e2a86c535e..4479c6580333 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) const struct kernfs_ops *ops; char *buf;
- if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
kernfs_debug_wait();
So this could just be:
may_wait_kernfs(kernfs_fop_write_iter, at_start);
Yup! Thanks!
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index f9cc912c31e1..9e3abf597e2d 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h +#define __kernfs_config_wait_var(func, when) \
- (kernfs_config_fail. func ## _fail.wait_ ## when)
^^ ^ ^
nit: needless spaces
Trimmed.
Luis
This extends test_sysfs with support for using the failure injection wait completion and knobs to force a few race conditions which demonstrates that kernfs active reference protection is sufficient for kobject / device protection at higher layers.
This adds 4 new tests which tries to remove the device attribute store operation in 4 different situations:
1) at the start of kernfs_kernfs_fop_write_iter() 2) before the of->mutex is held in kernfs_kernfs_fop_write_iter() 3) after the of->mutex is held in kernfs_kernfs_fop_write_iter() 4) after the kernfs node active reference is taken
A write fails in call cases except the last one, test number #32. There is a good explanation for this: *once* kernfs_get_active() gets called we have a guarantee that the kernfs entry cannot be removed. If kernfs_get_active() succeeds that entry cannot be removed and so anything trying to remove that entry will have to wait. It is perhaps not obvious but since a sysfs write will trigger eventually a kernfs_get_active() call, and *only* if this succeeds will the sysfs op be called, this and the fact that you cannot remove the kernfs entry while the kenfs entry is active implies that a module that created the respective sysfs / kernfs entry *cannot* possibly be removed during a sysfs operation. And test number 32 provides us with proof of this. If it were not true test #32 should crash.
No null dereferences are reproduced, even though this has been observed in some complex testing cases [0]. If this issue really exists we should have enough tools on the sysfs_test toolbox now to try to reproduce this easily without having to poke around other drivers. It very likley was the case that the issue reported [0] was possibly a side issue after the first bug which was zram specific. This is why it is important to isolate the issue and try to reproduce it in a generic form using the test_sysfs driver.
[0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- lib/Kconfig.debug | 3 + lib/test_sysfs.c | 31 +++++ tools/testing/selftests/sysfs/config | 3 + tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++ 4 files changed, 212 insertions(+)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a29b7d398c4e..176b822654e5 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2358,6 +2358,9 @@ config TEST_SYSFS depends on SYSFS depends on NET depends on BLOCK + select FAULT_INJECTION + select FAULT_INJECTION_DEBUG_FS + select FAIL_KERNFS_KNOBS help This builds the "test_sysfs" module. This driver enables to test the sysfs file system safely without affecting production knobs which diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c index 2043ca494af8..c6e62de61403 100644 --- a/lib/test_sysfs.c +++ b/lib/test_sysfs.c @@ -38,6 +38,11 @@ #include <linux/rtnetlink.h> #include <linux/genhd.h> #include <linux/blkdev.h> +#include <linux/kernfs.h> + +#ifdef CONFIG_FAIL_KERNFS_KNOBS +MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE); +#endif
static bool enable_lock; module_param(enable_lock, bool_enable_only, 0644); @@ -82,6 +87,13 @@ static bool enable_verbose_rmmod; module_param(enable_verbose_rmmod, bool_enable_only, 0644); MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
+#ifdef CONFIG_FAIL_KERNFS_KNOBS +static bool enable_completion_on_rmmod; +module_param(enable_completion_on_rmmod, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_completion_on_rmmod, + "enable sending a kernfs completion on rmmod"); +#endif + static int sysfs_test_major;
/** @@ -285,6 +297,12 @@ static ssize_t config_show(struct device *dev, "enable_verbose_writes:\t%s\n", enable_verbose_writes ? "true" : "false");
+#ifdef CONFIG_FAIL_KERNFS_KNOBS + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_completion_on_rmmod:\t%s\n", + enable_completion_on_rmmod ? "true" : "false"); +#endif + test_dev_config_unlock(test_dev);
return len; @@ -904,10 +922,23 @@ static int __init test_sysfs_init(void) } module_init(test_sysfs_init);
+#ifdef CONFIG_FAIL_KERNFS_KNOBS +/* The goal is to race our device removal with a pending kernfs -> store call */ +static void test_sysfs_kernfs_send_completion_rmmod(void) +{ + if (!enable_completion_on_rmmod) + return; + complete(&kernfs_debug_wait_completion); +} +#else +static inline void test_sysfs_kernfs_send_completion_rmmod(void) {} +#endif + static void __exit test_sysfs_exit(void) { if (enable_debugfs) debugfs_remove(debugfs_dir); + test_sysfs_kernfs_send_completion_rmmod(); if (delay_rmmod_ms) msleep(delay_rmmod_ms); unregister_test_dev_sysfs(first_test_dev); diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config index 9196f452ecd5..2876a229f95b 100644 --- a/tools/testing/selftests/sysfs/config +++ b/tools/testing/selftests/sysfs/config @@ -1,2 +1,5 @@ CONFIG_SYSFS=m CONFIG_TEST_SYSFS=m +CONFIG_FAULT_INJECTION=y +CONFIG_FAULT_INJECTION_DEBUG_FS=y +CONFIG_FAIL_KERNFS_KNOBS=y diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh index b3f4c2236c7f..f928635d0e35 100755 --- a/tools/testing/selftests/sysfs/sysfs.sh +++ b/tools/testing/selftests/sysfs/sysfs.sh @@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock +ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store +ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex +ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex +ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
allow_user_defaults() { @@ -92,6 +96,9 @@ allow_user_defaults() if [ -z $SYSFS_DEBUGFS_DIR ]; then SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs" fi + if [ -z $KERNFS_DEBUGFS_DIR ]; then + KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs" + fi if [ -z $PAGE_SIZE ]; then PAGE_SIZE=$(getconf PAGESIZE) fi @@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod() unset FIRST_MODPROBE_ARGS }
+modprobe_reset_enable_completion() +{ + FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1" + FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + load_req_mod() { modprobe_reset @@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors() echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev }
+debugfs_kernfs_kernfs_fop_write_iter_exists() +{ + KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter" + if [[ ! -d $KNOB_DIR ]]; then + echo "kernfs debugfs does not exist $KNOB_DIR" + return 0; + fi + KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter" + if [[ ! -d $KNOB_DEBUGFS ]]; then + echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter " + echo "does not exist $KNOB_DIR" + return 0; + fi + return 1 +} + +debugfs_kernfs_kernfs_fop_write_iter_set_fail_once() +{ + KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter" + echo 1 > $KNOB_DEBUGFS/interval + echo 100 > $KNOB_DEBUGFS/probability + echo 0 > $KNOB_DEBUGFS/space + # Disable verbose messages on the kernel ring buffer which may + # confuse developers with a kernel panic. + echo 0 > $KNOB_DEBUGFS/verbose + + # Fail only once + echo 1 > $KNOB_DEBUGFS/times +} + +debugfs_kernfs_kernfs_fop_write_iter_set_fail_never() +{ + KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter" + echo 0 > $KNOB_DEBUGFS/times +} + +debugfs_kernfs_set_wait_ms() +{ + SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms" + echo $1 > $SLEEP_AFTER_WAIT_MS +} + +debugfs_kernfs_disable_wait_kernfs_fop_write_iter() +{ + ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_" + for KNOB in ${ENABLE_WAIT_KNOB}*; do + echo 0 > $KNOB + done +} + +debugfs_kernfs_enable_wait_kernfs_fop_write_iter() +{ + ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1" + echo -n "1" > $ENABLE_WAIT_KNOB + return $? +} + set_orig() { if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then @@ -972,6 +1044,105 @@ sysfs_test_0028() fi }
+sysfs_race_kernfs_kernfs_fop_write_iter() +{ + TARGET="${DIR}/$(get_test_target $1)" + WAIT_AT=$2 + EXPECT_WRITE_RETURNS=$3 + MSDELAY=$4 + + modprobe_reset_enable_completion + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... " + + if debugfs_kernfs_kernfs_fop_write_iter_exists; then + echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS " + echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled" + return $ksft_skip + fi + + # Allow for failing the kernfs_kernfs_fop_write_iter call once, + # we'll provide exact context shortly afterwards. + debugfs_kernfs_kernfs_fop_write_iter_set_fail_once + + # First disable all waits + debugfs_kernfs_disable_wait_kernfs_fop_write_iter + + # Enable a wait_for_completion(&kernfs_debug_wait_completion) at the + # specified location inside the kernfs_fop_write_iter() routine + debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT + + # Configure kernfs so that after its wait_for_completion() it + # will msleep() this amount of time and schedule(). We figure this + # will be sufficient time to allow for our module removal to complete. + debugfs_kernfs_set_wait_ms $MSDELAY + + # Now we trigger a kernfs write op, which will run kernfs_fop_write_iter, + # but will wait until our driver sends a respective completion + set_test_ignore_errors & + write_pid=$! + + # At this point kernfs_fop_write_iter() hasn't run our op, its + # waiting for our completion at the specified time $WAIT_AT. + # We now remove our module which will send a + # complete(&kernfs_debug_wait_completion) right before we deregister + # our device and the sysfs device attributes are removed. + # + # After the completion is sent, the test_sysfs driver races with + # kernfs to do the device deregistration with the kernfs msleep + # and schedule(). This should mean we've forced trying to remove the + # module prior to allowing kernfs to run our store operation. If the + # race did happen we'll panic with a null dereference on the store op. + # + # If no race happens we should see no write operation triggered. + modprobe -r $TEST_DRIVER > /dev/null 2>&1 + + debugfs_kernfs_kernfs_fop_write_iter_set_fail_never + + wait $write_pid + if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0029() +{ + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do + echo "Using delay-after-completion: $delay" + sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay + done +} + +sysfs_test_0030() +{ + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do + echo "Using delay-after-completion: $delay" + sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay + done +} + +sysfs_test_0031() +{ + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do + echo "Using delay-after-completion: $delay" + sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay + done +} + +# A write only succeeds *iff* a module removal happens *after* the +# kernfs active reference is obtained with kernfs_get_active(). +sysfs_test_0032() +{ + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do + echo "Using delay-after-completion: $delay" + sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay + done +} + test_gen_desc() { echo -n "$1 x $(get_test_count $1)" @@ -1013,6 +1184,10 @@ list_tests() echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device" echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... " echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..." + echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start" + echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex" + echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex" + echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active" }
usage()
On Mon, Sep 27, 2021 at 09:37:58AM -0700, Luis Chamberlain wrote:
This extends test_sysfs with support for using the failure injection wait completion and knobs to force a few race conditions which demonstrates that kernfs active reference protection is sufficient for kobject / device protection at higher layers.
This adds 4 new tests which tries to remove the device attribute store operation in 4 different situations:
- at the start of kernfs_kernfs_fop_write_iter()
- before the of->mutex is held in kernfs_kernfs_fop_write_iter()
- after the of->mutex is held in kernfs_kernfs_fop_write_iter()
- after the kernfs node active reference is taken
A write fails in call cases except the last one, test number #32. There is a good explanation for this: *once* kernfs_get_active() gets called we have a guarantee that the kernfs entry cannot be removed. If kernfs_get_active() succeeds that entry cannot be removed and so anything trying to remove that entry will have to wait. It is perhaps not obvious but since a sysfs write will trigger eventually a kernfs_get_active() call, and *only* if this succeeds will the sysfs op be called, this and the fact that you cannot remove the kernfs entry while the kenfs entry is active implies that a module that created the respective sysfs / kernfs entry *cannot* possibly be removed during a sysfs operation. And test number 32 provides us with proof of this. If it were not true test #32 should crash.
No null dereferences are reproduced, even though this has been observed in some complex testing cases [0]. If this issue really exists we should have enough tools on the sysfs_test toolbox now to try to reproduce this easily without having to poke around other drivers. It very likley was the case that the issue reported [0] was possibly a side issue after the first bug which was zram specific. This is why it is important to isolate the issue and try to reproduce it in a generic form using the test_sysfs driver.
[0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
lib/Kconfig.debug | 3 + lib/test_sysfs.c | 31 +++++ tools/testing/selftests/sysfs/config | 3 + tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++ 4 files changed, 212 insertions(+)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a29b7d398c4e..176b822654e5 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2358,6 +2358,9 @@ config TEST_SYSFS depends on SYSFS depends on NET depends on BLOCK
- select FAULT_INJECTION
- select FAULT_INJECTION_DEBUG_FS
- select FAIL_KERNFS_KNOBS
I don't like seeing "select" for user-configurable CONFIGs -- things tend to end up weird. This should simply be:
depends on FAIL_KERNFS_KNOBS
help This builds the "test_sysfs" module. This driver enables to test the sysfs file system safely without affecting production knobs which diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c index 2043ca494af8..c6e62de61403 100644 --- a/lib/test_sysfs.c +++ b/lib/test_sysfs.c @@ -38,6 +38,11 @@ #include <linux/rtnetlink.h> #include <linux/genhd.h> #include <linux/blkdev.h> +#include <linux/kernfs.h>
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
This isn't an optional config here (and following)?
+MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE); +#endif static bool enable_lock; module_param(enable_lock, bool_enable_only, 0644); @@ -82,6 +87,13 @@ static bool enable_verbose_rmmod; module_param(enable_verbose_rmmod, bool_enable_only, 0644); MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod"); +#ifdef CONFIG_FAIL_KERNFS_KNOBS +static bool enable_completion_on_rmmod; +module_param(enable_completion_on_rmmod, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_completion_on_rmmod,
"enable sending a kernfs completion on rmmod");
+#endif
static int sysfs_test_major; /** @@ -285,6 +297,12 @@ static ssize_t config_show(struct device *dev, "enable_verbose_writes:\t%s\n", enable_verbose_writes ? "true" : "false"); +#ifdef CONFIG_FAIL_KERNFS_KNOBS
- len += snprintf(buf+len, PAGE_SIZE - len,
"enable_completion_on_rmmod:\t%s\n",
enable_completion_on_rmmod ? "true" : "false");
+#endif
- test_dev_config_unlock(test_dev);
return len; @@ -904,10 +922,23 @@ static int __init test_sysfs_init(void) } module_init(test_sysfs_init); +#ifdef CONFIG_FAIL_KERNFS_KNOBS +/* The goal is to race our device removal with a pending kernfs -> store call */ +static void test_sysfs_kernfs_send_completion_rmmod(void) +{
- if (!enable_completion_on_rmmod)
return;
- complete(&kernfs_debug_wait_completion);
+} +#else +static inline void test_sysfs_kernfs_send_completion_rmmod(void) {} +#endif
static void __exit test_sysfs_exit(void) { if (enable_debugfs) debugfs_remove(debugfs_dir);
- test_sysfs_kernfs_send_completion_rmmod(); if (delay_rmmod_ms) msleep(delay_rmmod_ms); unregister_test_dev_sysfs(first_test_dev);
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config index 9196f452ecd5..2876a229f95b 100644 --- a/tools/testing/selftests/sysfs/config +++ b/tools/testing/selftests/sysfs/config @@ -1,2 +1,5 @@ CONFIG_SYSFS=m CONFIG_TEST_SYSFS=m +CONFIG_FAULT_INJECTION=y +CONFIG_FAULT_INJECTION_DEBUG_FS=y +CONFIG_FAIL_KERNFS_KNOBS=y diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh index b3f4c2236c7f..f928635d0e35 100755 --- a/tools/testing/selftests/sysfs/sysfs.sh +++ b/tools/testing/selftests/sysfs/sysfs.sh @@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock +ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store +ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex +ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex +ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active allow_user_defaults() { @@ -92,6 +96,9 @@ allow_user_defaults() if [ -z $SYSFS_DEBUGFS_DIR ]; then SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs" fi
- if [ -z $KERNFS_DEBUGFS_DIR ]; then
KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
- fi if [ -z $PAGE_SIZE ]; then PAGE_SIZE=$(getconf PAGESIZE) fi
@@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod() unset FIRST_MODPROBE_ARGS } +modprobe_reset_enable_completion() +{
- FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
- FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
- modprobe_reset
- unset FIRST_MODPROBE_ARGS
+}
load_req_mod() { modprobe_reset @@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors() echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev } +debugfs_kernfs_kernfs_fop_write_iter_exists() +{
- KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
- if [[ ! -d $KNOB_DIR ]]; then
echo "kernfs debugfs does not exist $KNOB_DIR"
return 0;
- fi
- KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
- if [[ ! -d $KNOB_DEBUGFS ]]; then
echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
echo "does not exist $KNOB_DIR"
return 0;
- fi
- return 1
+}
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_once() +{
- KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
- echo 1 > $KNOB_DEBUGFS/interval
- echo 100 > $KNOB_DEBUGFS/probability
- echo 0 > $KNOB_DEBUGFS/space
- # Disable verbose messages on the kernel ring buffer which may
- # confuse developers with a kernel panic.
- echo 0 > $KNOB_DEBUGFS/verbose
- # Fail only once
- echo 1 > $KNOB_DEBUGFS/times
+}
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_never() +{
- KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
- echo 0 > $KNOB_DEBUGFS/times
+}
+debugfs_kernfs_set_wait_ms() +{
- SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
- echo $1 > $SLEEP_AFTER_WAIT_MS
+}
+debugfs_kernfs_disable_wait_kernfs_fop_write_iter() +{
- ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
- for KNOB in ${ENABLE_WAIT_KNOB}*; do
echo 0 > $KNOB
- done
+}
+debugfs_kernfs_enable_wait_kernfs_fop_write_iter() +{
- ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
- echo -n "1" > $ENABLE_WAIT_KNOB
- return $?
+}
set_orig() { if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then @@ -972,6 +1044,105 @@ sysfs_test_0028() fi } +sysfs_race_kernfs_kernfs_fop_write_iter() +{
- TARGET="${DIR}/$(get_test_target $1)"
- WAIT_AT=$2
- EXPECT_WRITE_RETURNS=$3
- MSDELAY=$4
- modprobe_reset_enable_completion
- ORIG=$(cat "${TARGET}")
- TEST_STR=$(( $ORIG + 1 ))
- echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
- if debugfs_kernfs_kernfs_fop_write_iter_exists; then
echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
return $ksft_skip
- fi
- # Allow for failing the kernfs_kernfs_fop_write_iter call once,
- # we'll provide exact context shortly afterwards.
- debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
- # First disable all waits
- debugfs_kernfs_disable_wait_kernfs_fop_write_iter
- # Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
- # specified location inside the kernfs_fop_write_iter() routine
- debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
- # Configure kernfs so that after its wait_for_completion() it
- # will msleep() this amount of time and schedule(). We figure this
- # will be sufficient time to allow for our module removal to complete.
- debugfs_kernfs_set_wait_ms $MSDELAY
- # Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
- # but will wait until our driver sends a respective completion
- set_test_ignore_errors &
- write_pid=$!
- # At this point kernfs_fop_write_iter() hasn't run our op, its
- # waiting for our completion at the specified time $WAIT_AT.
- # We now remove our module which will send a
- # complete(&kernfs_debug_wait_completion) right before we deregister
- # our device and the sysfs device attributes are removed.
- #
- # After the completion is sent, the test_sysfs driver races with
- # kernfs to do the device deregistration with the kernfs msleep
- # and schedule(). This should mean we've forced trying to remove the
- # module prior to allowing kernfs to run our store operation. If the
- # race did happen we'll panic with a null dereference on the store op.
- #
- # If no race happens we should see no write operation triggered.
- modprobe -r $TEST_DRIVER > /dev/null 2>&1
- debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
- wait $write_pid
- if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
echo "ok"
- else
echo "FAIL" >&2
- fi
+}
+sysfs_test_0029() +{
- for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
echo "Using delay-after-completion: $delay"
sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
- done
+}
+sysfs_test_0030() +{
- for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
echo "Using delay-after-completion: $delay"
sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
- done
+}
+sysfs_test_0031() +{
- for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
echo "Using delay-after-completion: $delay"
sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
- done
+}
+# A write only succeeds *iff* a module removal happens *after* the +# kernfs active reference is obtained with kernfs_get_active(). +sysfs_test_0032() +{
- for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
echo "Using delay-after-completion: $delay"
sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
- done
+}
test_gen_desc() { echo -n "$1 x $(get_test_count $1)" @@ -1013,6 +1184,10 @@ list_tests() echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device" echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... " echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
- echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
- echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
- echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
- echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
} usage() -- 2.30.2
On Tue, Oct 05, 2021 at 12:51:33PM -0700, Kees Cook wrote:
On Mon, Sep 27, 2021 at 09:37:58AM -0700, Luis Chamberlain wrote:
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a29b7d398c4e..176b822654e5 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2358,6 +2358,9 @@ config TEST_SYSFS depends on SYSFS depends on NET depends on BLOCK
- select FAULT_INJECTION
- select FAULT_INJECTION_DEBUG_FS
- select FAIL_KERNFS_KNOBS
I don't like seeing "select" for user-configurable CONFIGs -- things tend to end up weird. This should simply be:
depends on FAIL_KERNFS_KNOBS
Sure.
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c index 2043ca494af8..c6e62de61403 100644 --- a/lib/test_sysfs.c +++ b/lib/test_sysfs.c @@ -38,6 +38,11 @@ #include <linux/rtnetlink.h> #include <linux/genhd.h> #include <linux/blkdev.h> +#include <linux/kernfs.h>
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
This isn't an optional config here (and following)?
Sure with the above change this is no longer needed. Removed all that ifdef'ery.
Luis
There is quite a bit of tribal knowledge around proper use of try_module_get() and that it must be used only in a context which can ensure the module won't be gone during the operation. Document this little bit of tribal knowledge.
I'm extending this tribal knowledge with new developments which it seems some folks do not yet believe to be true: we can be sure a module will exist during the lifetime of a sysfs file operation. For proof, refer to test_sysfs test #32:
./tools/testing/selftests/sysfs/sysfs.sh -t 0032
Without this being true, the write would fail or worse, a crash would happen, in this test. It does not.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- include/linux/module.h | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h index c9f1200b2312..22eacd5e1e85 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr); to handle the error case (which only happens with rmmod --wait). */ extern void __module_get(struct module *module);
-/* This is the Right Way to get a module: if it fails, it's being removed, - * so pretend it's not there. */ +/** + * try_module_get() - yields to module removal and bumps refcnt otherwise + * @module: the module we should check for + * + * This can be used to try to bump the reference count of a module, so to + * prevent module removal. The reference count of a module is not allowed + * to be incremented if the module is already being removed. + * + * Care must be taken to ensure the module cannot be removed during the call to + * try_module_get(). This can be done by having another entity other than the + * module itself increment the module reference count, or through some other + * means which guarantees the module could not be removed during an operation. + * An example of this later case is using try_module_get() in a sysfs file + * which the module created. The sysfs store / read file operations are + * gauranteed to exist through the use of kernfs's active reference (see + * kernfs_active()). If a sysfs file operation is being run, the module which + * created it must still exist as the module is in charge of removing the same + * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen + * unless the same file is not active. + * + * One of the real values to try_module_get() is the module_is_live() check + * which ensures this the caller of try_module_get() can yield to userspace + * module removal requests and fail whatever it was about to process. + */ extern bool try_module_get(struct module *module);
+/** + * module_put() - release a reference count to a module + * @module: the module we should release a reference count for + * + * If you successfully bump a reference count to a module with try_module_get(), + * when you are finished you must call module_put() to release that reference + * count. + */ extern void module_put(struct module *module);
#else /*!CONFIG_MODULE_UNLOAD*/
On Mon, Sep 27, 2021 at 09:37:59AM -0700, Luis Chamberlain wrote:
There is quite a bit of tribal knowledge around proper use of try_module_get() and that it must be used only in a context which can ensure the module won't be gone during the operation. Document this little bit of tribal knowledge.
I'm extending this tribal knowledge with new developments which it seems some folks do not yet believe to be true: we can be sure a module will exist during the lifetime of a sysfs file operation. For proof, refer to test_sysfs test #32:
./tools/testing/selftests/sysfs/sysfs.sh -t 0032
Without this being true, the write would fail or worse, a crash would happen, in this test. It does not.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
include/linux/module.h | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h index c9f1200b2312..22eacd5e1e85 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr); to handle the error case (which only happens with rmmod --wait). */ extern void __module_get(struct module *module); -/* This is the Right Way to get a module: if it fails, it's being removed,
- so pretend it's not there. */
+/**
- try_module_get() - yields to module removal and bumps refcnt otherwise
I find this hard to parse. How about: "Take module refcount unless module is being removed"
- @module: the module we should check for
- This can be used to try to bump the reference count of a module, so to
- prevent module removal. The reference count of a module is not allowed
- to be incremented if the module is already being removed.
This I understand.
- Care must be taken to ensure the module cannot be removed during the call to
- try_module_get(). This can be done by having another entity other than the
- module itself increment the module reference count, or through some other
- means which guarantees the module could not be removed during an operation.
- An example of this later case is using try_module_get() in a sysfs file
- which the module created. The sysfs store / read file operations are
- gauranteed to exist through the use of kernfs's active reference (see
- kernfs_active()). If a sysfs file operation is being run, the module which
- created it must still exist as the module is in charge of removing the same
- sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
- unless the same file is not active.
I can't understand this paragraph at all. "Care must be taken ..."? Why? Shouldn't callers of try_module_get() be satisfied with the results? I don't follow the example at all. It seems to just say "sysfs store/read functions don't need try_module_get() because whatever opened the sysfs file is already keeping the module referenced." ?
- One of the real values to try_module_get() is the module_is_live() check
- which ensures this the caller of try_module_get() can yield to userspace
- module removal requests and fail whatever it was about to process.
Please document the return value explicitly.
- */
extern bool try_module_get(struct module *module); +/**
- module_put() - release a reference count to a module
- @module: the module we should release a reference count for
- If you successfully bump a reference count to a module with try_module_get(),
- when you are finished you must call module_put() to release that reference
- count.
- */
extern void module_put(struct module *module);
#else /*!CONFIG_MODULE_UNLOAD*/
2.30.2
On Tue, Oct 05, 2021 at 12:58:47PM -0700, Kees Cook wrote:
On Mon, Sep 27, 2021 at 09:37:59AM -0700, Luis Chamberlain wrote:
diff --git a/include/linux/module.h b/include/linux/module.h index c9f1200b2312..22eacd5e1e85 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr); to handle the error case (which only happens with rmmod --wait). */ extern void __module_get(struct module *module); -/* This is the Right Way to get a module: if it fails, it's being removed,
- so pretend it's not there. */
+/**
- try_module_get() - yields to module removal and bumps refcnt otherwise
I find this hard to parse. How about: "Take module refcount unless module is being removed"
Sure.
- @module: the module we should check for
- This can be used to try to bump the reference count of a module, so to
- prevent module removal. The reference count of a module is not allowed
- to be incremented if the module is already being removed.
This I understand.
- Care must be taken to ensure the module cannot be removed during the call to
- try_module_get(). This can be done by having another entity other than the
- module itself increment the module reference count, or through some other
- means which guarantees the module could not be removed during an operation.
- An example of this later case is using try_module_get() in a sysfs file
- which the module created. The sysfs store / read file operations are
- gauranteed to exist through the use of kernfs's active reference (see
- kernfs_active()). If a sysfs file operation is being run, the module which
- created it must still exist as the module is in charge of removing the same
- sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
- unless the same file is not active.
I can't understand this paragraph at all. "Care must be taken ..."? Why?
Because the routine try_module_get() assumes the struct module pointer is valid for the entire call. That can only be true if at least one reference is held prior to this call.
Shouldn't callers of try_module_get() be satisfied with the results?
Yes but only with the above care addressed.
I don't follow the example at all. It seems to just say "sysfs store/read functions don't need try_module_get() because whatever opened the sysfs file is already keeping the module referenced." ?
That is exactly what I intended to clarify with that example, yes, a reference is held but this is done implicitly. *If* a kernfs op is active module removal waits for that active reference to go down. So while a kernfs file is being used it is simply not possible for the module to disappear underneath us. And the reason is that the module that created the sysfs file must obviously destroy that same sysfs file. But since kernfs ensures that sysfs file cannot be removed if a sysfs file is being used, this implicitly holds a module reference.
Let me know if y ou can think of a better way to phrase this.
- One of the real values to try_module_get() is the module_is_live() check
- which ensures this the caller of try_module_get() can yield to userspace
- module removal requests and fail whatever it was about to process.
Please document the return value explicitly.
Sure thing.
Luis
If one ends up extending this line checkpatch will complain about the use of S_IRWXUGO suggesting it is not preferred and that 0777 should be used instead. Take the tip from checkpatch and do that change before we do our subsequent changes.
This makes no functional changes.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- fs/kernfs/symlink.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index c8f8e41b8411..19a6c71c6ff5 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; }
- kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid, - KERNFS_LINK); + kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK); if (!kn) return ERR_PTR(-ENOMEM);
On Mon, Sep 27, 2021 at 09:38:00AM -0700, Luis Chamberlain wrote:
If one ends up extending this line checkpatch will complain about the use of S_IRWXUGO suggesting it is not preferred and that 0777 should be used instead. Take the tip from checkpatch and do that change before we do our subsequent changes.
This makes no functional changes.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
Reviewed-by: Kees Cook keescook@chromium.org
fs/kernfs/symlink.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index c8f8e41b8411..19a6c71c6ff5 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; }
- kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
KERNFS_LINK);
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK); if (!kn) return ERR_PTR(-ENOMEM);
2.30.2
If one ends up expanding on this line checkpatch will complain that the combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the octal 0755. Do that.
This makes no functional changes.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- fs/sysfs/dir.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 59dffd5ca517..b6b6796e1616 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
- kn = kernfs_create_dir_ns(parent, kobject_name(kobj), - S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, + kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid, kobj, ns); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST)
On Mon, Sep 27, 2021 at 09:38:01AM -0700, Luis Chamberlain wrote:
If one ends up expanding on this line checkpatch will complain that the combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the octal 0755. Do that.
This makes no functional changes.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
It could be helpful to add a link too: https://www.kernel.org/doc/html/latest/dev-tools/checkpatch.html?highlight=n...
Reviewed-by: Kees Cook keescook@chromium.org
fs/sysfs/dir.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 59dffd5ca517..b6b6796e1616 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid);
- kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
- kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid, kobj, ns); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST)
-- 2.30.2
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live patching folks have acknowledged they have observed this as well with live patching, when a live patch is removed. I was then able to reproduce easily by creating a dedicated selftest for it.
A sketch of how this can happen follows, consider foo a local mutex part of a driver, and used on the driver's module exit routine and on one of its sysfs ops:
foo.c: static DEFINE_MUTEX(foo); static ssize_t foo_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { ... mutex_lock(&foo); ... mutex_lock(&foo); ... } static DEVICE_ATTR_RW(foo); ... void foo_exit(void) { mutex_lock(&foo); ... mutex_unlock(&foo); } module_exit(foo_exit);
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
In this situation foo_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked at completes which is waiting for a lock already held.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- fs/kernfs/dir.c | 44 ++++++++++++++++++---- fs/kernfs/file.c | 6 ++- fs/kernfs/kernfs-internal.h | 3 +- fs/kernfs/symlink.c | 3 +- fs/sysfs/dir.c | 2 +- fs/sysfs/file.c | 6 ++- fs/sysfs/group.c | 3 +- include/linux/kernfs.h | 14 ++++--- include/linux/sysfs.h | 52 ++++++++++++++++++++------ kernel/cgroup/cgroup.c | 2 +- 11 files changed, 105 insertions(+), 34 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b57b3db9a6a7..4edf3b37fd2c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - 0, rft->kf_ops, rft, NULL, NULL); + 0, rft->kf_ops, rft, NULL, NULL, NULL); if (IS_ERR(kn)) return PTR_ERR(kn);
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
kn = __kernfs_create_file(parent_kn, name, 0444, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, - &kf_mondata_ops, priv, NULL, NULL); + &kf_mondata_ops, priv, NULL, NULL, NULL); if (IS_ERR(kn)) return PTR_ERR(kn);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index ba581429bf7b..e841201fd11b 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -14,6 +14,7 @@ #include <linux/slab.h> #include <linux/security.h> #include <linux/hash.h> +#include <linux/module.h>
#include "kernfs-internal.h"
@@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn) */ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) { + int v; + if (unlikely(!kn)) return NULL;
if (!atomic_inc_unless_negative(&kn->active)) return NULL;
+ /* + * If a module created the kernfs_node, the module cannot possibly be + * removed if the above atomic_inc_unless_negative() succeeded. So the + * try_module_get() below is not to protect the lifetime of the module + * as that is already guaranteed. The try_module_get() below is used + * to ensure that we don't deadlock in case a kernfs operation and + * module removal used a shared lock. + */ + if (!try_module_get(kn->owner)) { + v = atomic_dec_return(&kn->active); + if (unlikely(v == KN_DEACTIVATED_BIAS)) + wake_up_all(&kernfs_root(kn)->deactivate_waitq); + return NULL; + } + if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn; @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn) if (kernfs_lockdep(kn)) rwsem_release(&kn->dep_map, _RET_IP_); v = atomic_dec_return(&kn->active); + + /* + * We prevent module exit *until* we know for sure all possible + * kernfs ops are done. + */ + module_put(kn->owner); + if (likely(v != KN_DEACTIVATED_BIAS)) return;
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - unsigned flags) + unsigned flags, + struct module *owner) { struct kernfs_node *kn; u32 id_highbits; @@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, kn->name = name; kn->mode = mode; kn->flags = flags; + kn->owner = owner;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) { struct iattr iattr = { @@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - unsigned flags) + unsigned flags, + struct module *owner) { struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent, - name, mode, uid, gid, flags); + name, mode, uid, gid, flags, owner); if (kn) { kernfs_get(parent); kn->parent = parent; @@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - KERNFS_DIR); + KERNFS_DIR, NULL); if (!kn) { idr_destroy(&root->ino_idr); kfree(root); @@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root) * @gid: gid of the new directory * @priv: opaque data associated with the new directory * @ns: optional namespace tag of the directory + * @owner: if set, the module owner of this directory * * Returns the created node on success, ERR_PTR() value on failure. */ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - void *priv, const void *ns) + void *priv, const void *ns, + struct module *owner) { struct kernfs_node *kn; int rc;
/* allocate */ kn = kernfs_new_node(parent, name, mode | S_IFDIR, - uid, gid, KERNFS_DIR); + uid, gid, KERNFS_DIR, owner); if (!kn) return ERR_PTR(-ENOMEM);
@@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */ kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL); if (!kn) return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 4479c6580333..0e125287e050 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = { * @priv: private data for the file * @ns: optional namespace tag of the file * @key: lockdep key for the file's active_ref, %NULL to disable lockdep + * @owner: if set, the module owner of the file * * Returns the created node on success, ERR_PTR() value on error. */ @@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, - struct lock_class_key *key) + struct lock_class_key *key, + struct module *owner) { struct kernfs_node *kn; unsigned flags; @@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG, - uid, gid, flags); + uid, gid, flags, owner); if (!kn) return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index 9e3abf597e2d..6d275d661987 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn); struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - unsigned flags); + unsigned flags, + struct module *owner);
/* * file.c diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index 19a6c71c6ff5..5a053eebee52 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; }
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK); + kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK, + target->owner); if (!kn) return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index b6b6796e1616..9763c2fde3c7 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid);
kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid, - kobj, ns); + kobj, ns, NULL); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, kobject_name(kobj)); diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 42dcf96881b6..af9e91fd1a92 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, #endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - PAGE_SIZE, ops, (void *)attr, ns, key); + PAGE_SIZE, ops, (void *)attr, ns, key, + attr->owner); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); @@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent, #endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - battr->size, ops, (void *)attr, ns, key); + battr->size, ops, (void *)attr, ns, key, + attr->owner); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index eeb0e3099421..372864d1cb54 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update, } else { kn = kernfs_create_dir_ns(kobj->sd, grp->name, S_IRWXU | S_IRUGO | S_IXUGO, - uid, gid, kobj, NULL); + uid, gid, kobj, NULL, + grp->owner); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(kobj->sd, grp->name); diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index cd968ee2b503..818b00ebd107 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -161,6 +161,7 @@ struct kernfs_node { unsigned short flags; umode_t mode; struct kernfs_iattrs *iattr; + struct module *owner; };
/* @@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - void *priv, const void *ns); + void *priv, const void *ns, + struct module *owner); struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, const char *name); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, @@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, - struct lock_class_key *key); + struct lock_class_key *key, + struct module *owner); struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, const char *name, struct kernfs_node *target); @@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { } static inline struct kernfs_node * kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - void *priv, const void *ns) + void *priv, const void *ns, struct module *owner) { return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node * __kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, - void *priv, const void *ns, struct lock_class_key *key) + void *priv, const void *ns, struct lock_class_key *key, + struct module *owner) { return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node * @@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode, { return kernfs_create_dir_ns(parent, name, mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - priv, NULL); + priv, NULL, parent->owner); }
static inline int kernfs_remove_by_name(struct kernfs_node *parent, diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index e3f1e8ac1f85..babbabb460dc 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -30,6 +30,7 @@ enum kobj_ns_type; struct attribute { const char *name; umode_t mode; + struct module *owner; #ifdef CONFIG_DEBUG_LOCK_ALLOC bool ignore_lockdep:1; struct lock_class_key *key; @@ -80,6 +81,7 @@ do { \ * @attrs: Pointer to NULL terminated list of attributes. * @bin_attrs: Pointer to NULL terminated list of binary attributes. * Either attrs or bin_attrs or both must be provided. + * @module: If set, module responsible for this attribute group */ struct attribute_group { const char *name; @@ -89,6 +91,7 @@ struct attribute_group { struct bin_attribute *, int); struct attribute **attrs; struct bin_attribute **bin_attrs; + struct module *owner; };
/* @@ -100,38 +103,52 @@ struct attribute_group {
#define __ATTR(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ }
#define __ATTR_PREALLOC(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ - .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\ + .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ }
#define __ATTR_RO(_name) { \ - .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0444, \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ }
#define __ATTR_RO_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ }
#define __ATTR_RW_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ .store = _name##_store, \ }
#define __ATTR_WO(_name) { \ - .attr = { .name = __stringify(_name), .mode = 0200 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0200, \ + .owner = THIS_MODULE, \ + }, \ .store = _name##_store, \ }
@@ -141,8 +158,11 @@ struct attribute_group {
#ifdef CONFIG_DEBUG_LOCK_ALLOC #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \ - .attr = {.name = __stringify(_name), .mode = _mode, \ - .ignore_lockdep = true }, \ + .attr = {.name = __stringify(_name), \ + .mode = _mode, \ + .ignore_lockdep = true, \ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \ #define ATTRIBUTE_GROUPS(_name) \ static const struct attribute_group _name##_group = { \ .attrs = _name##_attrs, \ + .owner = THIS_MODULE, \ }; \ __ATTRIBUTE_GROUPS(_name)
@@ -199,20 +220,29 @@ struct bin_attribute {
/* macros to create static binary attributes easier */ #define __BIN_ATTR(_name, _mode, _read, _write, _size) { \ - .attr = { .name = __stringify(_name), .mode = _mode }, \ + .attr = { .name = __stringify(_name), \ + .mode = _mode, \ + .owner = THIS_MODULE, \ + }, \ .read = _read, \ .write = _write, \ .size = _size, \ }
#define __BIN_ATTR_RO(_name, _size) { \ - .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0444, \ + .owner = THIS_MODULE, \ + }, \ .read = _name##_read, \ .size = _size, \ }
#define __BIN_ATTR_WO(_name, _size) { \ - .attr = { .name = __stringify(_name), .mode = 0200 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0200, \ + .owner = THIS_MODULE, \ + }, \ .write = _name##_write, \ .size = _size, \ } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9e0390000025..c6b0a28f599c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, cgroup_file_mode(cft), GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft, - NULL, key); + NULL, key, NULL); if (IS_ERR(kn)) return PTR_ERR(kn);
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you share the related zram code?
patching folks have acknowledged they have observed this as well with live patching, when a live patch is removed. I was then able to reproduce easily by creating a dedicated selftest for it.
A sketch of how this can happen follows, consider foo a local mutex part of a driver, and used on the driver's module exit routine and on one of its sysfs ops:
foo.c: static DEFINE_MUTEX(foo); static ssize_t foo_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { ... mutex_lock(&foo); ... mutex_lock(&foo); ... } static DEVICE_ATTR_RW(foo); ... void foo_exit(void) { mutex_lock(&foo); ... mutex_unlock(&foo); } module_exit(foo_exit);
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes, look the issue may not be related with removing module directly, right?
Thanks, Ming
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you share the related zram code?
I recommend to not look at the zram driver, instead look at the test_sysfs driver as that abstracts the issue more clearly and uses two different locks as an example. The point is that if on module removal *any* lock is used which is *also* used on the sysfs file created by the module, you can deadlock.
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes, look the issue may not be related with removing module directly, right?
No, the reason this can deadlock is that the module exit routine will patiently wait for the sysfs / kernfs files to be stop being used, but clearly they cannot if the exit routine took the mutex also used by the sysfs ops. That is, the special condition here is the removal of the sysfs files, and the sysfs files using a lock also used on module exit.
Luis
On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you share the related zram code?
I recommend to not look at the zram driver, instead look at the test_sysfs driver as that abstracts the issue more clearly and uses
Looks test_sysfs isn't in linus tree, where can I find it? Also please update your commit log about this wrong info if it can't be applied on zram.
two different locks as an example. The point is that if on module removal *any* lock is used which is *also* used on the sysfs file created by the module, you can deadlock.
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes, look the issue may not be related with removing module directly, right?
No, the reason this can deadlock is that the module exit routine will patiently wait for the sysfs / kernfs files to be stop being used,
Can you share the code which waits for the sysfs / kernfs files to be stop being used? And why does it make a difference in case of being called from module_exit()?
Thanks, Ming
On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you share the related zram code?
I recommend to not look at the zram driver, instead look at the test_sysfs driver as that abstracts the issue more clearly and uses
Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h...
Also please update your commit log about this wrong info if it can't be applied on zram.
It does apply to zram, it is just that I have other fixes for zram in my pipeline which will change the zram driver further, and so what makes more sense is to abstract the issue into a selftest driver to demonstrate the issue more clearly.
To reproduce the deadlock revert the patch in this thread and then run either of these two tests as root:
./tools/testing/selftests/sysfs/sysfs.sh -w 0027 ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
You will need to enable the test_sysfs driver.
two different locks as an example. The point is that if on module removal *any* lock is used which is *also* used on the sysfs file created by the module, you can deadlock.
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes, look the issue may not be related with removing module directly, right?
No, the reason this can deadlock is that the module exit routine will patiently wait for the sysfs / kernfs files to be stop being used,
Can you share the code which waits for the sysfs / kernfs files to be stop being used?
How about a call trace of the two tasks which deadlock, here is one of running test 0027:
kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more than 120 seconds. [ 363.878341] Tainted: G E 5.15.0-rc3-next-20210927+ #83 [ 363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.882255] task:sysfs.sh state:D stack: 0 pid: 1271 ppid: 1 flags:0x00000004 [ 363.882894] Call Trace: [ 363.883091] <TASK> [ 363.883259] __schedule+0x2fd/0x990 [ 363.883551] schedule+0x43/0xe0 [ 363.883800] schedule_preempt_disabled+0x14/0x20 [ 363.884160] __mutex_lock.constprop.0+0x249/0x470 [ 363.884524] test_dev_x_store+0xa5/0xc0 [test_sysfs] [ 363.884915] kernfs_fop_write_iter+0x177/0x220 [ 363.885257] new_sync_write+0x11c/0x1b0 [ 363.885556] vfs_write+0x20d/0x2a0 [ 363.885821] ksys_write+0x5f/0xe0 [ 363.886081] do_syscall_64+0x38/0xc0 [ 363.886359] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 363.886748] RIP: 0033:0x7fee00f8bf33 [ 363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33 [ 363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001 [ 363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002 [ 363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003 [ 363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0 [ 363.890513] </TASK> [ 363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds. [ 363.891185] Tainted: G E 5.15.0-rc3-next-20210927+ #83 [ 363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.892353] task:modprobe state:D stack: 0 pid: 1276 ppid: 1 flags:0x00004000 [ 363.892955] Call Trace: [ 363.893141] <TASK> [ 363.893457] __schedule+0x2fd/0x990 [ 363.893865] schedule+0x43/0xe0 [ 363.894246] __kernfs_remove.part.0+0x21e/0x2a0 [ 363.894704] ? do_wait_intr_irq+0xa0/0xa0 [ 363.895142] kernfs_remove_by_name_ns+0x50/0x90 [ 363.895632] remove_files+0x2b/0x60 [ 363.896035] sysfs_remove_group+0x38/0x80 [ 363.896470] sysfs_remove_groups+0x29/0x40 [ 363.896912] device_remove_attrs+0x5b/0x90 [ 363.897352] device_del+0x183/0x400 [ 363.897758] unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs] [ 363.898317] test_sysfs_exit+0x45/0xfb0 [test_sysfs] [ 363.898833] __do_sys_delete_module+0x18d/0x2a0 [ 363.899329] ? fpregs_assert_state_consistent+0x1e/0x40 [ 363.899868] ? exit_to_user_mode_prepare+0x3a/0x180 [ 363.900390] do_syscall_64+0x38/0xc0 [ 363.900810] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 363.901330] RIP: 0033:0x7f21915c57d7 [ 363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7 [ 363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98 [ 363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000 [ 363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98 [ 363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0 [ 363.905797] </TASK>
And gdb:
(gdb) l *(__kernfs_remove+0x21e) 0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476). 471 if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS) 472 lock_contended(&kn->dep_map, _RET_IP_); 473 } 474 475 /* but everyone should wait for draining */ 476 wait_event(root->deactivate_waitq, 477 atomic_read(&kn->active) == KN_DEACTIVATED_BIAS); 478 479 if (kernfs_lockdep(kn)) { 480 lock_acquired(&kn->dep_map, _RET_IP_);
(gdb) l *(kernfs_remove_by_name_ns+0x50) 0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534). 1529 1530 kn = kernfs_find_ns(parent, name, ns); 1531 if (kn) 1532 __kernfs_remove(kn); 1533 1534 up_write(&kernfs_rwsem); 1535 1536 if (kn) 1537 return 0; 1538 else
The same happens for test 0028 except instead of a mutex lock an rtnl_lock() is used.
Would this be better for the commit log?
And why does it make a difference in case of being called from module_exit()?
Well because that is where we remove the sysfs files. *If* a developer happens to use a lock on a sysfs op but it is also used on module exit, this deadlock is bound to happen.
Luis
On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you share the related zram code?
I recommend to not look at the zram driver, instead look at the test_sysfs driver as that abstracts the issue more clearly and uses
Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h...
Also please update your commit log about this wrong info if it can't be applied on zram.
It does apply to zram, it is just that I have other fixes for zram in my pipeline which will change the zram driver further, and so what makes more sense is to abstract the issue into a selftest driver to demonstrate the issue more clearly.
To reproduce the deadlock revert the patch in this thread and then run either of these two tests as root:
./tools/testing/selftests/sysfs/sysfs.sh -w 0027 ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
You will need to enable the test_sysfs driver.
two different locks as an example. The point is that if on module removal *any* lock is used which is *also* used on the sysfs file created by the module, you can deadlock.
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes, look the issue may not be related with removing module directly, right?
No, the reason this can deadlock is that the module exit routine will patiently wait for the sysfs / kernfs files to be stop being used,
Can you share the code which waits for the sysfs / kernfs files to be stop being used?
How about a call trace of the two tasks which deadlock, here is one of running test 0027:
kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more than 120 seconds. [ 363.878341] Tainted: G E 5.15.0-rc3-next-20210927+ #83 [ 363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.882255] task:sysfs.sh state:D stack: 0 pid: 1271 ppid: 1 flags:0x00000004 [ 363.882894] Call Trace: [ 363.883091] <TASK> [ 363.883259] __schedule+0x2fd/0x990 [ 363.883551] schedule+0x43/0xe0 [ 363.883800] schedule_preempt_disabled+0x14/0x20 [ 363.884160] __mutex_lock.constprop.0+0x249/0x470 [ 363.884524] test_dev_x_store+0xa5/0xc0 [test_sysfs] [ 363.884915] kernfs_fop_write_iter+0x177/0x220 [ 363.885257] new_sync_write+0x11c/0x1b0 [ 363.885556] vfs_write+0x20d/0x2a0 [ 363.885821] ksys_write+0x5f/0xe0 [ 363.886081] do_syscall_64+0x38/0xc0 [ 363.886359] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 363.886748] RIP: 0033:0x7fee00f8bf33 [ 363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33 [ 363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001 [ 363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002 [ 363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003 [ 363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0 [ 363.890513] </TASK> [ 363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds. [ 363.891185] Tainted: G E 5.15.0-rc3-next-20210927+ #83 [ 363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.892353] task:modprobe state:D stack: 0 pid: 1276 ppid: 1 flags:0x00004000 [ 363.892955] Call Trace: [ 363.893141] <TASK> [ 363.893457] __schedule+0x2fd/0x990 [ 363.893865] schedule+0x43/0xe0 [ 363.894246] __kernfs_remove.part.0+0x21e/0x2a0 [ 363.894704] ? do_wait_intr_irq+0xa0/0xa0 [ 363.895142] kernfs_remove_by_name_ns+0x50/0x90 [ 363.895632] remove_files+0x2b/0x60 [ 363.896035] sysfs_remove_group+0x38/0x80 [ 363.896470] sysfs_remove_groups+0x29/0x40 [ 363.896912] device_remove_attrs+0x5b/0x90 [ 363.897352] device_del+0x183/0x400 [ 363.897758] unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs] [ 363.898317] test_sysfs_exit+0x45/0xfb0 [test_sysfs] [ 363.898833] __do_sys_delete_module+0x18d/0x2a0 [ 363.899329] ? fpregs_assert_state_consistent+0x1e/0x40 [ 363.899868] ? exit_to_user_mode_prepare+0x3a/0x180 [ 363.900390] do_syscall_64+0x38/0xc0 [ 363.900810] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 363.901330] RIP: 0033:0x7f21915c57d7 [ 363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7 [ 363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98 [ 363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000 [ 363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98 [ 363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0 [ 363.905797] </TASK>
That doesn't show the deadlock is related with module_exit().
And gdb:
(gdb) l *(__kernfs_remove+0x21e) 0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476). 471 if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS) 472 lock_contended(&kn->dep_map, _RET_IP_); 473 } 474 475 /* but everyone should wait for draining */ 476 wait_event(root->deactivate_waitq, 477 atomic_read(&kn->active) == KN_DEACTIVATED_BIAS); 478 479 if (kernfs_lockdep(kn)) { 480 lock_acquired(&kn->dep_map, _RET_IP_);
(gdb) l *(kernfs_remove_by_name_ns+0x50) 0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534). 1529 1530 kn = kernfs_find_ns(parent, name, ns); 1531 if (kn) 1532 __kernfs_remove(kn); 1533 1534 up_write(&kernfs_rwsem); 1535 1536 if (kn) 1537 return 0; 1538 else
The same happens for test 0028 except instead of a mutex lock an rtnl_lock() is used.
Would this be better for the commit log?
And why does it make a difference in case of being called from module_exit()?
Well because that is where we remove the sysfs files. *If* a developer happens to use a lock on a sysfs op but it is also used on module exit, this deadlock is bound to happen.
It is clearly one AA deadlock, what I meant was that it isn't related with module exit cause lock & device_del() isn't always done in module exit, so I doubt your fix with grabbing module refcnt is good or generic enough.
Except for your cooked test_sys module, how many real drivers do suffer the problem? What are they? Why can't we fix the exact driver?
Thanks, Ming
On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h...
To reproduce the deadlock revert the patch in this thread and then run either of these two tests as root:
./tools/testing/selftests/sysfs/sysfs.sh -w 0027 ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
You will need to enable the test_sysfs driver.
Can you share the code which waits for the sysfs / kernfs files to be stop being used?
How about a call trace of the two tasks which deadlock, here is one of running test 0027:
kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more than 120 seconds.
<-- snip -->
That doesn't show the deadlock is related with module_exit().
Not directly no.
It is clearly one AA deadlock, what I meant was that it isn't related with module exit cause lock & device_del() isn't always done in module exit, so I doubt your fix with grabbing module refcnt is good or generic enough.
A device_del() *can* happen in other areas other than module exit sure, but the issue is if a shared lock is used *before* device_del() and also used on a sysfs op. Typically this can happen on module exit, and the other common use case in my experience is on sysfs ops, such is the case with the zram driver. Both cases are covered then by this fix.
If there are other areas, that is still driver specific, but of the things we *can* generalize, definitely module exit is a common path.
Except for your cooked test_sys module, how many real drivers do suffer the problem? What are they?
I only really seriously considered trying to generalize this after it was hinted to me live patching was also affected, and so clearly something generic was desirable.
There may be other drivers for sure, but a hunt for that with semantics would require a bit complex coccinelle patch with iteration support.
Why can't we fix the exact driver?
You can try, the way the lock is used in zram is correct, specially after my other fix in this series which addresses another unrelated bug with cpu hotplug multistate support. So we then can proceed to either take the position to say: "Thou shalt not use a shared lock on module exit and a sysfs op" and try to fix all places, or we generalize a fix for this. A generic fix seems more desirable.
Luis
On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h...
To reproduce the deadlock revert the patch in this thread and then run either of these two tests as root:
./tools/testing/selftests/sysfs/sysfs.sh -w 0027 ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
You will need to enable the test_sysfs driver.
Can you share the code which waits for the sysfs / kernfs files to be stop being used?
How about a call trace of the two tasks which deadlock, here is one of running test 0027:
kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more than 120 seconds.
<-- snip -->
That doesn't show the deadlock is related with module_exit().
Not directly no.
Then the patch title of 'sysfs: fix deadlock race with module removal' is wrong.
It is clearly one AA deadlock, what I meant was that it isn't related with module exit cause lock & device_del() isn't always done in module exit, so I doubt your fix with grabbing module refcnt is good or generic enough.
A device_del() *can* happen in other areas other than module exit sure, but the issue is if a shared lock is used *before* device_del() and also used on a sysfs op. Typically this can happen on module exit, and the other common use case in my experience is on sysfs ops, such is the case with the zram driver. Both cases are covered then by this fix.
Again, can you share the related zram code about the issue? In zram_drv.c of linus or next tree, I don't see any lock is held before calling del_gendisk().
If there are other areas, that is still driver specific, but of the things we *can* generalize, definitely module exit is a common path.
Except for your cooked test_sys module, how many real drivers do suffer the problem? What are they?
I only really seriously considered trying to generalize this after it
IMO your generalization isn't good or correct because this kind of issue is _not_ related with module exit at all. What matters is just that one lock is held before calling device_del(), meantime the same lock is required in the device's attribute show/store function().
There are many cases in which we call device_del() not from module_exit(), such as scsi scan, scsi sysfs store(), or even handling event from device side, nvme error handling, usb hotplug, ...
was hinted to me live patching was also affected, and so clearly something generic was desirable.
It might be just the only two drivers(zram and live patch) with this bug, and it is one simply AA bug in driver. Not mention I don't see such usage in zram_drv.c.
There may be other drivers for sure, but a hunt for that with semantics would require a bit complex coccinelle patch with iteration support.
Why can't we fix the exact driver?
You can try, the way the lock is used in zram is correct, specially
What is the lock in zram? Again can you share the related functions?
after my other fix in this series which addresses another unrelated bug with cpu hotplug multistate support. So we then can proceed to either take the position to say: "Thou shalt not use a shared lock on module exit and a sysfs op" and try to fix all places, or we generalize a fix for this. A generic fix seems more desirable.
What matters is that the lock is held before calling device_del() instead of being held in module_exit().
Thanks, Ming
On Wed, Oct 13, 2021 at 11:04:07PM +0800, Ming Lei wrote:
On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h...
To reproduce the deadlock revert the patch in this thread and then run either of these two tests as root:
./tools/testing/selftests/sysfs/sysfs.sh -w 0027 ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
You will need to enable the test_sysfs driver.
Can you share the code which waits for the sysfs / kernfs files to be stop being used?
How about a call trace of the two tasks which deadlock, here is one of running test 0027:
kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more than 120 seconds.
<-- snip -->
That doesn't show the deadlock is related with module_exit().
Not directly no.
Then the patch title of 'sysfs: fix deadlock race with module removal' is wrong.
Well that is what it does though. The scope of the issue you are raising is beyond module removal, but I do agree such races can exist outside of module removal.
It is clearly one AA deadlock, what I meant was that it isn't related with module exit cause lock & device_del() isn't always done in module exit, so I doubt your fix with grabbing module refcnt is good or generic enough.
A device_del() *can* happen in other areas other than module exit sure, but the issue is if a shared lock is used *before* device_del() and also used on a sysfs op. Typically this can happen on module exit, and the other common use case in my experience is on sysfs ops, such is the case with the zram driver. Both cases are covered then by this fix.
Again, can you share the related zram code about the issue? In zram_drv.c of linus or next tree, I don't see any lock is held before calling del_gendisk().
There is another bug with CPU hotplug multistate support in the zram driver which a patch in this series fixes, refer to the patch titled "zram: fix crashes with cpu hotplug multistate". In zram's case we need to contend a generic lock on certain sysfs attributes due to the way CPU hotplug is used.
If we tried to generalize this on the block layer the closest we get is the disk->fops->owner, however zram is an example driver where the disk->fops is actually be even changed *after* module load, and so the original disk->fops->owner can be dynamic. In zram's case the fops->owner is the same, however we have no semantics to ensure this is the case for all block drivers.
In the case for live patching, refer to the use of klp_mutex. The way that was solved there was a combination of completions and deferred works to solve it, so that all kobject_put calls are outside of the critical sections, refer to commit 3ec24776bfd0 ("livepatch: allow removal of a disabled patch").
And so it was encouraged a generic solution be sought after.
If there are other areas, that is still driver specific, but of the things we *can* generalize, definitely module exit is a common path.
Except for your cooked test_sys module, how many real drivers do suffer the problem? What are they?
I only really seriously considered trying to generalize this after it
IMO your generalization isn't good or correct because this kind of issue is _not_ related with module exit at all. What matters is just that one lock is held before calling device_del(), meantime the same lock is required in the device's attribute show/store function().
Your point that a race for a deadlock still can exist beyond module removal is valid but unfortunately there are no possible semantics I can see to fix that generically at this time.
There are many cases in which we call device_del() not from module_exit(), such as scsi scan, scsi sysfs store(), or even handling event from device side, nvme error handling, usb hotplug, ...
These are really good points.
was hinted to me live patching was also affected, and so clearly something generic was desirable.
It might be just the only two drivers(zram and live patch) with this bug, and it is one simply AA bug in driver. Not mention I don't see such usage in zram_drv.c.
Well... given what you say above about other uses cases other than module removal which can remove sysfs files and having them be used, the possibilities of this deadlock existing elsewhere should increase, not decrease.
There may be other drivers for sure, but a hunt for that with semantics would require a bit complex coccinelle patch with iteration support.
Why can't we fix the exact driver?
You can try, the way the lock is used in zram is correct, specially
What is the lock in zram? Again can you share the related functions?
If you git checked out the tree I mentioned try looking at the code there with the fix for CPU hotplug multistate in mind.
after my other fix in this series which addresses another unrelated bug with cpu hotplug multistate support. So we then can proceed to either take the position to say: "Thou shalt not use a shared lock on module exit and a sysfs op" and try to fix all places, or we generalize a fix for this. A generic fix seems more desirable.
What matters is that the lock is held before calling device_del() instead of being held in module_exit().
I agree the possibilities can include more than just module exit. Unfortunately I can't see a way to generalize this further. I tried, see below, and this moves the ideas from a module to the kobject, but even with that, it does not get us any closer to fixing this generically. The reason a fix works for module removal is the try_module_get() call when getting the kernfs active reference will trump the module exit call completely, and so we *do* prevent the context which will issue the lock in this case if a sysfs operation is in progress.
Outside of that call sequence I am afraid we'd need separate solutions or side with the 'though shall not use a shared lock on a sysfs op and when issuing a device_del(), other than module exit'.
Below is an attempt to generalize this further, but it does not work, let me know if you have further ideas.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b57b3db9a6a7..4edf3b37fd2c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - 0, rft->kf_ops, rft, NULL, NULL); + 0, rft->kf_ops, rft, NULL, NULL, NULL); if (IS_ERR(kn)) return PTR_ERR(kn);
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
kn = __kernfs_create_file(parent_kn, name, 0444, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, - &kf_mondata_ops, priv, NULL, NULL); + &kf_mondata_ops, priv, NULL, NULL, NULL); if (IS_ERR(kn)) return PTR_ERR(kn);
diff --git a/drivers/base/core.c b/drivers/base/core.c index 7758223f040c..38f07072ab44 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3507,6 +3507,7 @@ bool kill_device(struct device *dev) if (dev->p->dead) return false; dev->p->dead = true; + kobject_set_being_removed(&dev->kobj); return true; } EXPORT_SYMBOL_GPL(kill_device); diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index ba581429bf7b..7d14f6b2c12d 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -14,6 +14,7 @@ #include <linux/slab.h> #include <linux/security.h> #include <linux/hash.h> +#include <linux/kobject.h>
#include "kernfs-internal.h"
@@ -414,15 +415,38 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn) */ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) { + int v; + if (unlikely(!kn)) return NULL;
if (!atomic_inc_unless_negative(&kn->active)) return NULL;
+ /* + * If a kobject created the kernfs_node, the kobject cannot possibly be + * removed if the above atomic_inc_unless_negative() succeeded. But we + * need to inspect if its on its way out to ensure that we don't + * deadlock in case a kernfs operation and the code responsible for + * the kobject removal used a shared lock. + */ + if (kn->kobj) { + if (WARN_ON(!kobject_get_unless_zero(kn->kobj))) { + goto fail; + } else if (kobject_being_removed(kn->kobj)) { + kobject_put(kn->kobj); + goto fail; + } + } + if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn; +fail: + v = atomic_dec_return(&kn->active); + if (unlikely(v == KN_DEACTIVATED_BIAS)) + wake_up_all(&kernfs_root(kn)->deactivate_waitq); + return NULL; }
/** @@ -442,6 +466,7 @@ void kernfs_put_active(struct kernfs_node *kn) if (kernfs_lockdep(kn)) rwsem_release(&kn->dep_map, _RET_IP_); v = atomic_dec_return(&kn->active); + kobject_put(kn->kobj); if (likely(v != KN_DEACTIVATED_BIAS)) return;
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - unsigned flags) + unsigned flags, + struct kobject *kobj) { struct kernfs_node *kn; u32 id_highbits; @@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, kn->name = name; kn->mode = mode; kn->flags = flags; + kn->kobj = kobj;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) { struct iattr iattr = { @@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - unsigned flags) + unsigned flags, + struct kobject *kobj) { struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent, - name, mode, uid, gid, flags); + name, mode, uid, gid, flags, kobj); if (kn) { kernfs_get(parent); kn->parent = parent; @@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - KERNFS_DIR); + KERNFS_DIR, NULL); if (!kn) { idr_destroy(&root->ino_idr); kfree(root); @@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root) * @gid: gid of the new directory * @priv: opaque data associated with the new directory * @ns: optional namespace tag of the directory + * @kobj: if set, the kobject responsible for this directory * * Returns the created node on success, ERR_PTR() value on failure. */ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - void *priv, const void *ns) + void *priv, const void *ns, + struct kobject *kobj) { struct kernfs_node *kn; int rc;
/* allocate */ kn = kernfs_new_node(parent, name, mode | S_IFDIR, - uid, gid, KERNFS_DIR); + uid, gid, KERNFS_DIR, kobj); if (!kn) return ERR_PTR(-ENOMEM);
@@ -1014,7 +1044,8 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */ kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, + parent->kobj); if (!kn) return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 4479c6580333..1b02f3e69c81 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = { * @priv: private data for the file * @ns: optional namespace tag of the file * @key: lockdep key for the file's active_ref, %NULL to disable lockdep + * @kobj: if set, the kobject responsible for the file * * Returns the created node on success, ERR_PTR() value on error. */ @@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, - struct lock_class_key *key) + struct lock_class_key *key, + struct kobject *kobj) { struct kernfs_node *kn; unsigned flags; @@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG, - uid, gid, flags); + uid, gid, flags, kobj); if (!kn) return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index 9e3abf597e2d..44983720d292 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn); struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - unsigned flags); + unsigned flags, + struct kobject *kobj);
/* * file.c diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index 19a6c71c6ff5..c877de06e53a 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; }
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK); + kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK, + target->kobj); if (!kn) return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index b6b6796e1616..9cc159e9fb55 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid);
kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid, - kobj, ns); + kobj, ns, kobj); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, kobject_name(kobj)); diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 42dcf96881b6..e1a3315dba35 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -292,7 +292,7 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, #endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - PAGE_SIZE, ops, (void *)attr, ns, key); + PAGE_SIZE, ops, (void *)attr, ns, key, kobj); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); @@ -309,6 +309,7 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent, struct lock_class_key *key = NULL; const struct kernfs_ops *ops; struct kernfs_node *kn; + struct kobject *kobj = parent->priv;
if (battr->mmap) ops = &sysfs_bin_kfops_mmap; @@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent, #endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - battr->size, ops, (void *)attr, ns, key); + battr->size, ops, (void *)attr, ns, key, + kobj); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index eeb0e3099421..36022fe2b21d 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update, } else { kn = kernfs_create_dir_ns(kobj->sd, grp->name, S_IRWXU | S_IRUGO | S_IXUGO, - uid, gid, kobj, NULL); + uid, gid, kobj, NULL, + kobj); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(kobj->sd, grp->name); diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index cd968ee2b503..38155414e6e5 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -161,6 +161,7 @@ struct kernfs_node { unsigned short flags; umode_t mode; struct kernfs_iattrs *iattr; + struct kobject *kobj; };
/* @@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - void *priv, const void *ns); + void *priv, const void *ns, + struct kobject *kobj); struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, const char *name); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, @@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, - struct lock_class_key *key); + struct lock_class_key *key, + struct kobject *kobj); struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, const char *name, struct kernfs_node *target); @@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { } static inline struct kernfs_node * kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, - void *priv, const void *ns) + void *priv, const void *ns, struct kobject *kobj) { return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node * __kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, - void *priv, const void *ns, struct lock_class_key *key) + void *priv, const void *ns, struct lock_class_key *key, + struct kobject *kobj) { return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node * @@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode, { return kernfs_create_dir_ns(parent, name, mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - priv, NULL); + priv, NULL, parent->kobj); }
static inline int kernfs_remove_by_name(struct kernfs_node *parent, diff --git a/include/linux/kobject.h b/include/linux/kobject.h index efd56f990a46..cb26ebeb7cf1 100644 --- a/include/linux/kobject.h +++ b/include/linux/kobject.h @@ -77,6 +77,7 @@ struct kobject { unsigned int state_add_uevent_sent:1; unsigned int state_remove_uevent_sent:1; unsigned int uevent_suppress:1; + unsigned int being_removed:1; };
extern __printf(2, 3) @@ -117,6 +118,15 @@ extern void kobject_get_ownership(struct kobject *kobj, kuid_t *uid, kgid_t *gid); extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
+static inline bool kobject_being_removed(const struct kobject *kobj) +{ + if (!kobj) + return false; + return !!kobj->being_removed; +} + +void kobject_set_being_removed(struct kobject *kobj); + /** * kobject_has_children - Returns whether a kobject has children. * @kobj: the object to test diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9e0390000025..c6b0a28f599c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, cgroup_file_mode(cft), GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft, - NULL, key); + NULL, key, NULL); if (IS_ERR(kn)) return PTR_ERR(kn);
diff --git a/lib/kobject.c b/lib/kobject.c index 4a56f519139d..ef89bf2ac218 100644 --- a/lib/kobject.c +++ b/lib/kobject.c @@ -221,6 +221,12 @@ static void kobject_init_internal(struct kobject *kobj) kobj->state_initialized = 1; }
+void kobject_set_being_removed(struct kobject *kobj) +{ + if (!kobj) + return; + kobj->being_removed = 1; +}
static int kobject_add_internal(struct kobject *kobj) {
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
When driver sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the driver's sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called. As
I would agree: kernfs must know about the module containing the ops structure it has been given. (Without this, there are, at the very least, removal races for looking at kernfs_ops structures.)
In other places in the kernel, function callback dependencies are more explicit in that if code is holding such things, it has already taken a module reference, etc. But kernfs is special in the sense that just because a kernfs entry exists, we don't want to pin the module use count too.
But simple locking isn't workable to solve this because kernfs_remove() must be able to be called from a module_exit routine without deadlocking. (i.e. we would create exactly the situation that caused this condition to get noticed in the first place.)
documented in the prior patch, we now know that once kernfs_get_active() is called the module is implicitly guarded to exist and cannot be removed. This is because the module is the one in charge of removing the same sysfs file it created, and removal of sysfs files on module exit will wait until they don't have any active references. By using a try_module_get() after kernfs_get_active() we yield to let module removal trump calls to process a sysfs operation, while also preventing module removal if a sysfs operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live patching folks have acknowledged they have observed this as well with live patching, when a live patch is removed. I was then able to reproduce easily by creating a dedicated selftest for it.
A sketch of how this can happen follows, consider foo a local mutex part of a driver, and used on the driver's module exit routine and on one of its sysfs ops:
foo.c: static DEFINE_MUTEX(foo); static ssize_t foo_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { ... mutex_lock(&foo); ... mutex_lock(&foo); ... } static DEVICE_ATTR_RW(foo); ... void foo_exit(void) { mutex_lock(&foo); ... mutex_unlock(&foo); } module_exit(foo_exit);
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
Please expand this further, where does device_remove_groups() end up waiting for that never happens?
In this situation foo_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked at completes which is waiting for a lock already held.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- fs/kernfs/dir.c | 44 ++++++++++++++++++---- fs/kernfs/file.c | 6 ++- fs/kernfs/kernfs-internal.h | 3 +- fs/kernfs/symlink.c | 3 +- fs/sysfs/dir.c | 2 +- fs/sysfs/file.c | 6 ++- fs/sysfs/group.c | 3 +- include/linux/kernfs.h | 14 ++++--- include/linux/sysfs.h | 52 ++++++++++++++++++++------ kernel/cgroup/cgroup.c | 2 +- 11 files changed, 105 insertions(+), 34 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b57b3db9a6a7..4edf3b37fd2c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, rft->kf_ops, rft, NULL, NULL);
if (IS_ERR(kn)) return PTR_ERR(kn);0, rft->kf_ops, rft, NULL, NULL, NULL);
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name, kn = __kernfs_create_file(parent_kn, name, 0444, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
&kf_mondata_ops, priv, NULL, NULL);
if (IS_ERR(kn)) return PTR_ERR(kn);&kf_mondata_ops, priv, NULL, NULL, NULL);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index ba581429bf7b..e841201fd11b 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -14,6 +14,7 @@ #include <linux/slab.h> #include <linux/security.h> #include <linux/hash.h> +#include <linux/module.h> #include "kernfs-internal.h" @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn) */ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) {
- int v;
- if (unlikely(!kn)) return NULL;
if (!atomic_inc_unless_negative(&kn->active)) return NULL;
- /*
* If a module created the kernfs_node, the module cannot possibly be
* removed if the above atomic_inc_unless_negative() succeeded. So the
* try_module_get() below is not to protect the lifetime of the module
* as that is already guaranteed. The try_module_get() below is used
* to ensure that we don't deadlock in case a kernfs operation and
* module removal used a shared lock.
*/
- if (!try_module_get(kn->owner)) {
v = atomic_dec_return(&kn->active);
if (unlikely(v == KN_DEACTIVATED_BIAS))
wake_up_all(&kernfs_root(kn)->deactivate_waitq);
return NULL;
- }
The special casing in here makes me think this isn't happening the right place. (i.e this looks like an open-coded version of kernfs_put_active())
- if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn;
@@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn) if (kernfs_lockdep(kn)) rwsem_release(&kn->dep_map, _RET_IP_); v = atomic_dec_return(&kn->active);
- /*
* We prevent module exit *until* we know for sure all possible
* kernfs ops are done.
*/
- module_put(kn->owner);
- if (likely(v != KN_DEACTIVATED_BIAS)) return;
What I don't understand, however, is what kernfs_get/put_active() is intending to do -- it looks like it's trying to provide an interruption point for open kernfs file operations?
This all seems extremely complex for what seems like it should just be a global "am I being removed?" bool?
Regardless, while I do see the logic of associating the module get/put with get/put of kernfs "active", why is it not better tied to strictly kernfs open/close? That would seem to be much simpler and not require any special handling?
For example, why does this not work?
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 60e2a86c535e..e44502ac244d 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -525,6 +525,9 @@ static int kernfs_get_open_node(struct kernfs_node *kn, { struct kernfs_open_node *on, *new_on = NULL;
+ if (!try_module_get(kn->owner)) + return -ENODEV; + retry: mutex_lock(&kernfs_open_file_mutex); spin_lock_irq(&kernfs_open_node_lock); @@ -592,6 +595,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn, mutex_unlock(&kernfs_open_file_mutex);
kfree(on); + module_put(kn->owner); }
static int kernfs_fop_open(struct inode *inode, struct file *file) @@ -719,6 +723,7 @@ static int kernfs_fop_open(struct inode *inode, struct file *file) kfree(of); err_out: kernfs_put_active(kn); + module_put(kn->owner); return error; }
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid,
unsigned flags)
unsigned flags,
struct module *owner)
{ struct kernfs_node *kn; u32 id_highbits; @@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, kn->name = name; kn->mode = mode; kn->flags = flags;
- kn->owner = owner;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) { struct iattr iattr = { @@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid,
unsigned flags)
unsigned flags,
struct module *owner)
{ struct kernfs_node *kn; kn = __kernfs_new_node(kernfs_root(parent), parent,
name, mode, uid, gid, flags);
if (kn) { kernfs_get(parent); kn->parent = parent;name, mode, uid, gid, flags, owner);
@@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops, kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
KERNFS_DIR);
if (!kn) { idr_destroy(&root->ino_idr); kfree(root);KERNFS_DIR, NULL);
@@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
- @gid: gid of the new directory
- @priv: opaque data associated with the new directory
- @ns: optional namespace tag of the directory
*/
- @owner: if set, the module owner of this directory
- Returns the created node on success, ERR_PTR() value on failure.
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid,
void *priv, const void *ns)
void *priv, const void *ns,
struct module *owner)
{ struct kernfs_node *kn; int rc; /* allocate */ kn = kernfs_new_node(parent, name, mode | S_IFDIR,
uid, gid, KERNFS_DIR);
if (!kn) return ERR_PTR(-ENOMEM);uid, gid, KERNFS_DIR, owner);
@@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, /* allocate */ kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
if (!kn) return ERR_PTR(-ENOMEM);GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 4479c6580333..0e125287e050 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
- @priv: private data for the file
- @ns: optional namespace tag of the file
- @key: lockdep key for the file's active_ref, %NULL to disable lockdep
*/
- @owner: if set, the module owner of the file
- Returns the created node on success, ERR_PTR() value on error.
@@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns,
struct lock_class_key *key)
struct lock_class_key *key,
struct module *owner)
{ struct kernfs_node *kn; unsigned flags; @@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, flags = KERNFS_FILE; kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
uid, gid, flags);
if (!kn) return ERR_PTR(-ENOMEM);uid, gid, flags, owner);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index 9e3abf597e2d..6d275d661987 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn); struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid,
unsigned flags);
unsigned flags,
struct module *owner);
/*
- file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index 19a6c71c6ff5..5a053eebee52 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; }
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
if (!kn) return ERR_PTR(-ENOMEM);target->owner);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index b6b6796e1616..9763c2fde3c7 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid); kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
kobj, ns);
if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, kobject_name(kobj));kobj, ns, NULL);
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 42dcf96881b6..af9e91fd1a92 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, #endif kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
PAGE_SIZE, ops, (void *)attr, ns, key);
PAGE_SIZE, ops, (void *)attr, ns, key,
if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name);attr->owner);
@@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent, #endif kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
battr->size, ops, (void *)attr, ns, key);
battr->size, ops, (void *)attr, ns, key,
if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name);attr->owner);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index eeb0e3099421..372864d1cb54 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update, } else { kn = kernfs_create_dir_ns(kobj->sd, grp->name, S_IRWXU | S_IRUGO | S_IXUGO,
uid, gid, kobj, NULL);
uid, gid, kobj, NULL,
grp->owner); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index cd968ee2b503..818b00ebd107 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -161,6 +161,7 @@ struct kernfs_node { unsigned short flags; umode_t mode; struct kernfs_iattrs *iattr;
- struct module *owner;
}; /* @@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid,
void *priv, const void *ns);
void *priv, const void *ns,
struct module *owner);
struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, const char *name); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, @@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns,
struct lock_class_key *key);
struct lock_class_key *key,
struct module *owner);
struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, const char *name, struct kernfs_node *target); @@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { } static inline struct kernfs_node * kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid,
void *priv, const void *ns)
void *priv, const void *ns, struct module *owner)
{ return ERR_PTR(-ENOSYS); } static inline struct kernfs_node * __kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops,
void *priv, const void *ns, struct lock_class_key *key)
void *priv, const void *ns, struct lock_class_key *key,
struct module *owner)
{ return ERR_PTR(-ENOSYS); } static inline struct kernfs_node * @@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode, { return kernfs_create_dir_ns(parent, name, mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
priv, NULL);
priv, NULL, parent->owner);
} static inline int kernfs_remove_by_name(struct kernfs_node *parent, diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index e3f1e8ac1f85..babbabb460dc 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -30,6 +30,7 @@ enum kobj_ns_type; struct attribute { const char *name; umode_t mode;
- struct module *owner;
#ifdef CONFIG_DEBUG_LOCK_ALLOC bool ignore_lockdep:1; struct lock_class_key *key; @@ -80,6 +81,7 @@ do { \
- @attrs: Pointer to NULL terminated list of attributes.
- @bin_attrs: Pointer to NULL terminated list of binary attributes.
Either attrs or bin_attrs or both must be provided.
*/
- @module: If set, module responsible for this attribute group
struct attribute_group { const char *name; @@ -89,6 +91,7 @@ struct attribute_group { struct bin_attribute *, int); struct attribute **attrs; struct bin_attribute **bin_attrs;
- struct module *owner;
}; /* @@ -100,38 +103,52 @@ struct attribute_group { #define __ATTR(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \
.mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
.mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
.owner = THIS_MODULE, \
- }, \ .show = _show, \ .store = _store, \
} #define __ATTR_PREALLOC(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \
.mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
.mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
.owner = THIS_MODULE, \
- }, \ .show = _show, \ .store = _store, \
} #define __ATTR_RO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
- .attr = { .name = __stringify(_name), \
.mode = 0444, \
.owner = THIS_MODULE, \
.show = _name##_show, \}, \
} #define __ATTR_RO_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \
.mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
.mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
.owner = THIS_MODULE, \
- }, \ .show = _name##_show, \
} #define __ATTR_RW_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \
.mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
.mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
.owner = THIS_MODULE, \
- }, \ .show = _name##_show, \ .store = _name##_store, \
} #define __ATTR_WO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
- .attr = { .name = __stringify(_name), \
.mode = 0200, \
.owner = THIS_MODULE, \
- }, \ .store = _name##_store, \
} @@ -141,8 +158,11 @@ struct attribute_group { #ifdef CONFIG_DEBUG_LOCK_ALLOC #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \
- .attr = {.name = __stringify(_name), .mode = _mode, \
.ignore_lockdep = true }, \
- .attr = {.name = __stringify(_name), \
.mode = _mode, \
.ignore_lockdep = true, \
.owner = THIS_MODULE, \
- }, \ .show = _show, \ .store = _store, \
} @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \ #define ATTRIBUTE_GROUPS(_name) \ static const struct attribute_group _name##_group = { \ .attrs = _name##_attrs, \
- .owner = THIS_MODULE, \
}; \ __ATTRIBUTE_GROUPS(_name) @@ -199,20 +220,29 @@ struct bin_attribute { /* macros to create static binary attributes easier */ #define __BIN_ATTR(_name, _mode, _read, _write, _size) { \
- .attr = { .name = __stringify(_name), .mode = _mode }, \
- .attr = { .name = __stringify(_name), \
.mode = _mode, \
.owner = THIS_MODULE, \
- }, \ .read = _read, \ .write = _write, \ .size = _size, \
} #define __BIN_ATTR_RO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
- .attr = { .name = __stringify(_name), \
.mode = 0444, \
.owner = THIS_MODULE, \
- }, \ .read = _name##_read, \ .size = _size, \
} #define __BIN_ATTR_WO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
- .attr = { .name = __stringify(_name), \
.mode = 0200, \
.owner = THIS_MODULE, \
- }, \ .write = _name##_write, \ .size = _size, \
} diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9e0390000025..c6b0a28f599c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, cgroup_file_mode(cft), GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft,
NULL, key);
if (IS_ERR(kn)) return PTR_ERR(kn);NULL, key, NULL);
2.30.2
On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
A sketch of how this can happen follows, consider foo a local mutex part of a driver, and used on the driver's module exit routine and on one of its sysfs ops:
foo.c: static DEFINE_MUTEX(foo); static ssize_t foo_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { ... mutex_lock(&foo); ... mutex_lock(&foo); ... } static DEVICE_ATTR_RW(foo); ... void foo_exit(void) { mutex_lock(&foo); ... mutex_unlock(&foo); } module_exit(foo_exit);
And this can lead to this condition:
CPU A CPU B foo_store() foo_exit() mutex_lock(&foo) mutex_lock(&foo) del_gendisk(some_struct->disk); device_del() device_remove_groups()
Please expand this further, where does device_remove_groups() end up waiting for that never happens?
Sure. How about:
Furthermore, device_remove_groups() will just go on trying to remove the sysfs files, which are kernfs entries. The way kernfs deals with removal is that it will wait until all active references for the files being removed are done. The active reference is obtained through kernfs_get_active(). Removal ends up waiting through kernfs_drain() for the active references to be done, and that only happens if the kernfs file ops can complete. If these kernfs ops / sysfs files are waiting for a mutex which taken by the module's exit routine prior to trying to remove the sysfs files we deadlock.
In this situation foo_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked at completes which is waiting for a lock already held.
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- fs/kernfs/dir.c | 44 ++++++++++++++++++---- fs/kernfs/file.c | 6 ++- fs/kernfs/kernfs-internal.h | 3 +- fs/kernfs/symlink.c | 3 +- fs/sysfs/dir.c | 2 +- fs/sysfs/file.c | 6 ++- fs/sysfs/group.c | 3 +- include/linux/kernfs.h | 14 ++++--- include/linux/sysfs.h | 52 ++++++++++++++++++++------ kernel/cgroup/cgroup.c | 2 +- 11 files changed, 105 insertions(+), 34 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b57b3db9a6a7..4edf3b37fd2c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, rft->kf_ops, rft, NULL, NULL);
if (IS_ERR(kn)) return PTR_ERR(kn);0, rft->kf_ops, rft, NULL, NULL, NULL);
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name, kn = __kernfs_create_file(parent_kn, name, 0444, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
&kf_mondata_ops, priv, NULL, NULL);
if (IS_ERR(kn)) return PTR_ERR(kn);&kf_mondata_ops, priv, NULL, NULL, NULL);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index ba581429bf7b..e841201fd11b 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -14,6 +14,7 @@ #include <linux/slab.h> #include <linux/security.h> #include <linux/hash.h> +#include <linux/module.h> #include "kernfs-internal.h" @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn) */ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) {
- int v;
- if (unlikely(!kn)) return NULL;
if (!atomic_inc_unless_negative(&kn->active)) return NULL;
- /*
* If a module created the kernfs_node, the module cannot possibly be
* removed if the above atomic_inc_unless_negative() succeeded. So the
* try_module_get() below is not to protect the lifetime of the module
* as that is already guaranteed. The try_module_get() below is used
* to ensure that we don't deadlock in case a kernfs operation and
* module removal used a shared lock.
*/
- if (!try_module_get(kn->owner)) {
v = atomic_dec_return(&kn->active);
if (unlikely(v == KN_DEACTIVATED_BIAS))
wake_up_all(&kernfs_root(kn)->deactivate_waitq);
return NULL;
- }
The special casing in here makes me think this isn't happening the right place. (i.e this looks like an open-coded version of kernfs_put_active())
No, well you see, in effect the special care taken in kernfs_put_active() *is* the right way to inform a waiter that that the *taken* reference right above *also* is no longer active.
The special casing here is because we took the active reference before the try_module_get() in the above atomic_inc_unless_negative() call. Outside callers deal with this through kernfs_put_active().
We are special casing to deal with the deadlock case.
- if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn;
@@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn) if (kernfs_lockdep(kn)) rwsem_release(&kn->dep_map, _RET_IP_); v = atomic_dec_return(&kn->active);
- /*
* We prevent module exit *until* we know for sure all possible
* kernfs ops are done.
*/
- module_put(kn->owner);
- if (likely(v != KN_DEACTIVATED_BIAS)) return;
What I don't understand, however, is what kernfs_get/put_active() is intending to do -- it looks like it's trying to provide an interruption point for open kernfs file operations?
It is essentially ensuring that removal does not happen if any ops are being used.
This all seems extremely complex for what seems like it should just be a global "am I being removed?" bool?
It used to be worse :) And Tejun has cleaned this up over time. Yes, perhaps we can improve that more but, given how sensible this code is I think such improvements should be made separately.
Regardless, while I do see the logic of associating the module get/put with get/put of kernfs "active", why is it not better tied to strictly kernfs open/close?
It's not just files, consider kernfs_iop_mkdir() which also calls kernfs_get_active(). How about kernfs_fop_mmap()? And so, the common denominator is actually kernfs_get_active().
That would seem to be much simpler and not require any special handling?
Yes true, but it I think this would still leave open some other possible deadlocks.
For example, why does this not work?
It does for the write case for sure, but I haven't written tests for the other odd cases, but suspect that would deadlock as well.
Luis
On Mon, Oct 11, 2021 at 03:26:02PM -0700, Luis Chamberlain wrote:
On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
For example, why does this not work?
It does for the write case for sure,
I mispoke, just for the record, the changes you mentioned actually don't suffice for the test cases in question for test_sysfs, the deadlock still occurs with those changes. At first I thought it did but I had failed to remove my own fix first on fs/kernfs/dir.c. After removing that and just trying the proposed changes I confirm it does not fix the deadlock.
Luis
Now that sysfs has the deadlock race fixed with module removal, enable the deadlock tests module removal tests. They were left disabled by default as otherwise you would deadlock your system
./tools/testing/selftests/sysfs/sysfs.sh -t 0027 Running test: sysfs_test_0027 - run #0 Test for possible rmmod deadlock while writing x ... ok
./tools/testing/selftests/sysfs/sysfs.sh -t 0028 Running test: sysfs_test_0028 - run #0 Test for possible rmmod deadlock using rtnl_lock while writing x ... ok
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- tools/testing/selftests/sysfs/sysfs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh index f928635d0e35..4047ac48e764 100755 --- a/tools/testing/selftests/sysfs/sysfs.sh +++ b/tools/testing/selftests/sysfs/sysfs.sh @@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block" ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block" -ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test -ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock +ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test +ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
Provide a simple state machine to fix races with driver exit where we remove the CPU multistate callbacks and re-initialization / creation of new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it associates a struct zcomp per CPU. Each struct zcomp represents a compression algorithm in charge of managing compression streams per CPU. Although a compiled zram driver only supports a fixed set of compression algorithms, each zram device gets a struct zcomp allocated per CPU. The "multi" in CPU hotplug multstate refers to these per cpu struct zcomp instances. Each of these will have the CPU hotplug callback called for it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a linked list of these different structures so that it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device (num_devices=1) and a zcomp structure then set for the now default lzo-rle comrpession algorithm. At driver removal we first remove each zram device, and so we destroy the associated struct zcomp per CPU. But since we expose sysfs attributes to create new devices or reset / initialize existing zram devices, we can easily end up re-initializing a struct zcomp for a zram device before the exit routine of the module removes the cpu hotplug callback. When this happens the kernel's CPU hotplug will detect that at least one instance (struct zcomp for us) exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...); class_unregister(...); idr_for_each(...); zram_debugfs_destroy();
idr_destroy(...); unregister_blkdev(...); cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees that the state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list. In this case, that a struct zcom still exists, the driver allowed its creation per CPU even though we could have just freed them per CPU though a call on another CPU, and we are then later trying to remove the hotplug callback.
Fix all this by providing a zram initialization boolean protected the shared in the driver zram_index_mutex, which we can use to annotate when sysfs attributes are safe to use or not -- once the driver is properly initialized. When the driver is going down we also are sure to not let userspace muck with attributes which may affect each per cpu struct zcomp.
This also fixes a series of possible memory leaks. The crashes and memory leaks can easily be caused by issuing the zram02.sh script from the LTP project [0] in a loop in two separate windows:
cd testcases/kernel/device-drivers/zram while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0 kernel: zram: Added device: zram0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: Adding 104857596k swap on /dev/zram0. <etc> kernel: zram0: detected capacitky change from 209715200 to 0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: ------------[ cut here ]------------ kernel: Error: Removing state 63 which has instances left. kernel: WARNING: CPU: 7 PID: 70457 at \ kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \ E 5.12.0-rc1-next-20210304 #3 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \ BIOS 1.14.0-2 04/01/2014 kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Code: <etc> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282 kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0 kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8 kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000 kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033 kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: __cpuhp_remove_state+0x2e/0x80 kernel: __do_sys_delete_module+0x190/0x2a0 kernel: do_syscall_64+0x33/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers to the zram per CPU struct zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Acked-by: Minchan Kim minchan@kernel.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f61910c65f0f..b26abcb955cc 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex); static int zram_major; static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+static bool zram_up; + /* Module params (documentation at end) */ static unsigned int num_devices = 1; /* @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram) comp = zram->comp; disksize = zram->disksize; zram->disksize = 0; + zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0); @@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev, struct zram *zram = dev_to_zram(dev); int err;
+ mutex_lock(&zram_index_mutex); + + if (!zram_up) { + err = -ENODEV; + goto out; + } + disksize = memparse(buf, NULL); - if (!disksize) - return -EINVAL; + if (!disksize) { + err = -EINVAL; + goto out; + }
down_write(&zram->init_lock); if (init_done(zram)) { @@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev, set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock);
+ mutex_unlock(&zram_index_mutex); + return len;
out_free_meta: zram_meta_free(zram, disksize); out_unlock: up_write(&zram->init_lock); +out: + mutex_unlock(&zram_index_mutex); return err; }
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev, if (ret) return ret;
- if (!do_reset) - return -EINVAL; + mutex_lock(&zram_index_mutex); + + if (!zram_up) { + len = -ENODEV; + goto out; + } + + if (!do_reset) { + len = -EINVAL; + goto out; + }
zram = dev_to_zram(dev); bdev = zram->disk->part0; @@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev, /* Do not reset an active device or claimed device */ if (bdev->bd_openers || zram->claim) { mutex_unlock(&bdev->bd_disk->open_mutex); - return -EBUSY; + len = -EBUSY; + goto out; }
/* From now on, anyone can't open /dev/zram[0-9] */ @@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev, zram->claim = false; mutex_unlock(&bdev->bd_disk->open_mutex);
+out: + mutex_unlock(&zram_index_mutex); return len; }
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class, int ret;
mutex_lock(&zram_index_mutex); + if (!zram_up) { + mutex_unlock(&zram_index_mutex); + return -ENODEV; + } ret = zram_add(); mutex_unlock(&zram_index_mutex);
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
mutex_lock(&zram_index_mutex);
+ if (!zram_up) { + ret = -ENODEV; + goto out; + } + zram = idr_find(&zram_index_idr, dev_id); if (zram) { ret = zram_remove(zram); @@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class, ret = -ENODEV; }
+out: mutex_unlock(&zram_index_mutex); return ret ? ret : count; } @@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void) { + mutex_lock(&zram_index_mutex); + zram_up = false; class_unregister(&zram_control_class); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); + mutex_unlock(&zram_index_mutex); }
static int __init zram_init(void) @@ -2105,15 +2146,21 @@ static int __init zram_init(void) return -EBUSY; }
+ mutex_lock(&zram_index_mutex); + while (num_devices != 0) { - mutex_lock(&zram_index_mutex); ret = zram_add(); - mutex_unlock(&zram_index_mutex); - if (ret < 0) + if (ret < 0) { + mutex_unlock(&zram_index_mutex); goto out_error; + } num_devices--; }
+ zram_up = true; + + mutex_unlock(&zram_index_mutex); + return 0;
out_error:
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
Provide a simple state machine to fix races with driver exit where we remove the CPU multistate callbacks and re-initialization / creation of new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it associates a struct zcomp per CPU. Each struct zcomp represents a compression algorithm in charge of managing compression streams per CPU. Although a compiled zram driver only supports a fixed set of compression algorithms, each zram device gets a struct zcomp allocated per CPU. The "multi" in CPU hotplug multstate refers to these per cpu struct zcomp instances. Each of these will have the CPU hotplug callback called for it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a linked list of these different structures so that it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device (num_devices=1) and a zcomp structure then set for the now default lzo-rle comrpession algorithm. At driver removal we first remove each zram device, and so we destroy the associated struct zcomp per CPU. But since we expose sysfs attributes to create new devices or reset / initialize existing zram devices, we can easily end up re-initializing a struct zcomp for a zram device before the exit routine of the module removes the cpu hotplug callback. When this happens the kernel's CPU hotplug will detect that at least one instance (struct zcomp for us) exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...);
class_unregister(...); idr_for_each(...); zram_debugfs_destroy();
idr_destroy(...); unregister_blkdev(...); cpuhp_remove_multi_state(...);
So this is strictly separate from the sysfs/module unloading race?
-Kees
The warning comes up on cpuhp_remove_multi_state() when it sees that the state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list. In this case, that a struct zcom still exists, the driver allowed its creation per CPU even though we could have just freed them per CPU though a call on another CPU, and we are then later trying to remove the hotplug callback.
Fix all this by providing a zram initialization boolean protected the shared in the driver zram_index_mutex, which we can use to annotate when sysfs attributes are safe to use or not -- once the driver is properly initialized. When the driver is going down we also are sure to not let userspace muck with attributes which may affect each per cpu struct zcomp.
This also fixes a series of possible memory leaks. The crashes and memory leaks can easily be caused by issuing the zram02.sh script from the LTP project [0] in a loop in two separate windows:
cd testcases/kernel/device-drivers/zram while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0 kernel: zram: Added device: zram0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: Adding 104857596k swap on /dev/zram0. <etc> kernel: zram0: detected capacitky change from 209715200 to 0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: ------------[ cut here ]------------ kernel: Error: Removing state 63 which has instances left. kernel: WARNING: CPU: 7 PID: 70457 at \ kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \ E 5.12.0-rc1-next-20210304 #3 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \ BIOS 1.14.0-2 04/01/2014 kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Code: <etc> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282 kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0 kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8 kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000 kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033 kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: __cpuhp_remove_state+0x2e/0x80 kernel: __do_sys_delete_module+0x190/0x2a0 kernel: do_syscall_64+0x33/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers to the zram per CPU struct zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Acked-by: Minchan Kim minchan@kernel.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org
drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f61910c65f0f..b26abcb955cc 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex); static int zram_major; static const char *default_compressor = CONFIG_ZRAM_DEF_COMP; +static bool zram_up;
/* Module params (documentation at end) */ static unsigned int num_devices = 1; /* @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram) comp = zram->comp; disksize = zram->disksize; zram->disksize = 0;
- zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0); @@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev, struct zram *zram = dev_to_zram(dev); int err;
- mutex_lock(&zram_index_mutex);
- if (!zram_up) {
err = -ENODEV;
goto out;
- }
- disksize = memparse(buf, NULL);
- if (!disksize)
return -EINVAL;
- if (!disksize) {
err = -EINVAL;
goto out;
- }
down_write(&zram->init_lock); if (init_done(zram)) { @@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev, set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock);
- mutex_unlock(&zram_index_mutex);
- return len;
out_free_meta: zram_meta_free(zram, disksize); out_unlock: up_write(&zram->init_lock); +out:
- mutex_unlock(&zram_index_mutex); return err;
} @@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev, if (ret) return ret;
- if (!do_reset)
return -EINVAL;
- mutex_lock(&zram_index_mutex);
- if (!zram_up) {
len = -ENODEV;
goto out;
- }
- if (!do_reset) {
len = -EINVAL;
goto out;
- }
zram = dev_to_zram(dev); bdev = zram->disk->part0; @@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev, /* Do not reset an active device or claimed device */ if (bdev->bd_openers || zram->claim) { mutex_unlock(&bdev->bd_disk->open_mutex);
return -EBUSY;
len = -EBUSY;
}goto out;
/* From now on, anyone can't open /dev/zram[0-9] */ @@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev, zram->claim = false; mutex_unlock(&bdev->bd_disk->open_mutex); +out:
- mutex_unlock(&zram_index_mutex); return len;
} @@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class, int ret; mutex_lock(&zram_index_mutex);
- if (!zram_up) {
mutex_unlock(&zram_index_mutex);
return -ENODEV;
- } ret = zram_add(); mutex_unlock(&zram_index_mutex);
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class, mutex_lock(&zram_index_mutex);
- if (!zram_up) {
ret = -ENODEV;
goto out;
- }
- zram = idr_find(&zram_index_idr, dev_id); if (zram) { ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class, ret = -ENODEV; } +out: mutex_unlock(&zram_index_mutex); return ret ? ret : count; } @@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) {
- mutex_lock(&zram_index_mutex);
- zram_up = false; class_unregister(&zram_control_class); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
- mutex_unlock(&zram_index_mutex);
} static int __init zram_init(void) @@ -2105,15 +2146,21 @@ static int __init zram_init(void) return -EBUSY; }
- mutex_lock(&zram_index_mutex);
- while (num_devices != 0) {
ret = zram_add();mutex_lock(&zram_index_mutex);
mutex_unlock(&zram_index_mutex);
if (ret < 0)
if (ret < 0) {
mutex_unlock(&zram_index_mutex); goto out_error;
num_devices--; }}
- zram_up = true;
- mutex_unlock(&zram_index_mutex);
- return 0;
out_error: -- 2.30.2
On Tue, Oct 05, 2021 at 01:55:35PM -0700, Kees Cook wrote:
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
Provide a simple state machine to fix races with driver exit where we remove the CPU multistate callbacks and re-initialization / creation of new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it associates a struct zcomp per CPU. Each struct zcomp represents a compression algorithm in charge of managing compression streams per CPU. Although a compiled zram driver only supports a fixed set of compression algorithms, each zram device gets a struct zcomp allocated per CPU. The "multi" in CPU hotplug multstate refers to these per cpu struct zcomp instances. Each of these will have the CPU hotplug callback called for it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a linked list of these different structures so that it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device (num_devices=1) and a zcomp structure then set for the now default lzo-rle comrpession algorithm. At driver removal we first remove each zram device, and so we destroy the associated struct zcomp per CPU. But since we expose sysfs attributes to create new devices or reset / initialize existing zram devices, we can easily end up re-initializing a struct zcomp for a zram device before the exit routine of the module removes the cpu hotplug callback. When this happens the kernel's CPU hotplug will detect that at least one instance (struct zcomp for us) exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...);
class_unregister(...); idr_for_each(...); zram_debugfs_destroy();
idr_destroy(...); unregister_blkdev(...); cpuhp_remove_multi_state(...);
So this is strictly separate from the sysfs/module unloading race?
It is only related in the sense that the sysfs/module unloading race happened *after* this other issue, but addressing these through separate threads created a break in conversation and focus. For instance, a theoretical race was mentioned in one thread, which I worked to prove/disprove and then I disproved it was not possible.
But at this point, yes, this is a purely separate issue, and this patch *should* be picked up already.
Andrew, can you merge this? It already has the respective maintainer Ack, and I can continue to work on the rest of the patches. The only issue I can think of would be a conflict with the last patch but that's a oneliner, I think chances are low that would create a conflict if its all merged separately, and if so, it should be an easy fix for a merge conflict.
Luis
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
Provide a simple state machine to fix races with driver exit where we remove the CPU multistate callbacks and re-initialization / creation of new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it associates a struct zcomp per CPU. Each struct zcomp represents a compression algorithm in charge of managing compression streams per CPU. Although a compiled zram driver only supports a fixed set of compression algorithms, each zram device gets a struct zcomp allocated per CPU. The "multi" in CPU hotplug multstate refers to these per cpu struct zcomp instances. Each of these will have the CPU hotplug callback called for it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a linked list of these different structures so that it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device (num_devices=1) and a zcomp structure then set for the now default lzo-rle comrpession algorithm. At driver removal we first remove each zram device, and so we destroy the associated struct zcomp per CPU. But since we expose sysfs attributes to create new devices or reset / initialize existing zram devices, we can easily end up re-initializing a struct zcomp for a zram device before the exit routine of the module removes the cpu hotplug callback. When this happens the kernel's CPU hotplug will detect that at least one instance (struct zcomp for us) exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...);
class_unregister(...); idr_for_each(...); zram_debugfs_destroy();
idr_destroy(...); unregister_blkdev(...); cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees that the state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list. In this case, that a struct zcom still exists, the driver allowed its creation per CPU even though we could have just freed them per CPU though a call on another CPU, and we are then later trying to remove the hotplug callback.
Fix all this by providing a zram initialization boolean protected the shared in the driver zram_index_mutex, which we can use to annotate when sysfs attributes are safe to use or not -- once the driver is properly initialized. When the driver is going down we also are sure to not let userspace muck with attributes which may affect each per cpu struct zcomp.
This also fixes a series of possible memory leaks. The crashes and memory leaks can easily be caused by issuing the zram02.sh script from the LTP project [0] in a loop in two separate windows:
cd testcases/kernel/device-drivers/zram while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0 kernel: zram: Added device: zram0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: Adding 104857596k swap on /dev/zram0. <etc> kernel: zram0: detected capacitky change from 209715200 to 0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: ------------[ cut here ]------------ kernel: Error: Removing state 63 which has instances left. kernel: WARNING: CPU: 7 PID: 70457 at \ kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \ E 5.12.0-rc1-next-20210304 #3 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \ BIOS 1.14.0-2 04/01/2014 kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Code: <etc> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282 kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0 kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8 kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000 kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033 kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: __cpuhp_remove_state+0x2e/0x80 kernel: __do_sys_delete_module+0x190/0x2a0 kernel: do_syscall_64+0x33/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers to the zram per CPU struct zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Acked-by: Minchan Kim minchan@kernel.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org
Hello Luis,
Can you test the following patch and see if the issue can be addressed?
Please see the idea from the inline comment.
Also zram_index_mutex isn't needed in zram disk's store() compared with your patch, then the deadlock issue you are addressing in this series can be avoided.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..3c17927d23a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
/* Make sure all the pending I/O are finished */ fsync_bdev(bdev); - zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk); + + /* + * reset device after gendisk is removed, so any change from sysfs + * store won't come in, then we can really reset device here + */ + zram_reset_device(zram); + blk_cleanup_disk(zram->disk); kfree(zram); return 0; @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { class_unregister(&zram_control_class); + + /* hold the global lock so new device can't be added */ + mutex_lock(&zram_index_mutex); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); + mutex_unlock(&zram_index_mutex); + zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram");
Thanks, Ming
On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
...
Hello Luis,
Can you test the following patch and see if the issue can be addressed?
Please see the idea from the inline comment.
Also zram_index_mutex isn't needed in zram disk's store() compared with your patch, then the deadlock issue you are addressing in this series can be avoided.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..3c17927d23a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { class_unregister(&zram_control_class);
- /* hold the global lock so new device can't be added */
- mutex_lock(&zram_index_mutex); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
- mutex_unlock(&zram_index_mutex);
Actually zram_index_mutex isn't needed when calling zram_remove_cb() since the zram-control sysfs interface has been removed, so userspace can't add new device any more, then the issue is supposed to be fixed by the following one line change, please test it:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..96dd641de233 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
/* Make sure all the pending I/O are finished */ fsync_bdev(bdev); - zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk); + + /* + * reset device after gendisk is removed, so any change from sysfs + * store won't come in, then we can really reset device here + */ + zram_reset_device(zram); + blk_cleanup_disk(zram->disk); kfree(zram); return 0;
Thanks, Ming
On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
...
Hello Luis,
Can you test the following patch and see if the issue can be addressed?
Please see the idea from the inline comment.
Also zram_index_mutex isn't needed in zram disk's store() compared with your patch, then the deadlock issue you are addressing in this series can be avoided.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..3c17927d23a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { class_unregister(&zram_control_class);
- /* hold the global lock so new device can't be added */
- mutex_lock(&zram_index_mutex); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
- mutex_unlock(&zram_index_mutex);
Actually zram_index_mutex isn't needed when calling zram_remove_cb() since the zram-control sysfs interface has been removed, so userspace can't add new device any more, then the issue is supposed to be fixed by the following one line change, please test it:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..96dd641de233 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
Sorry but nope, the cpu multistate issue is still present and we end up eventually with page faults. I tried with both patches.
Oct 14 20:21:34 kdevops kernel: ------------[ cut here ]------------ Oct 14 20:21:34 kdevops kernel: Error: Removing state 65 which has instances left. Oct 14 20:21:34 kdevops kernel: WARNING: CPU: 4 PID: 3358 at kernel/cpu.c:2151 __cpuhp_remove_state_cpuslocked+0xf9/0x100 Oct 14 20:21:34 kdevops kernel: Modules linked in: zram(E-) zstd(E) zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) > Oct 14 20:21:34 kdevops kernel: CPU: 4 PID: 3358 Comm: rmmod Tainted: G E 5.15.0-rc3-next-20210927+ #89 Oct 14 20:21:34 kdevops kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Oct 14 20:21:34 kdevops kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100 Oct 14 20:21:34 kdevops kernel: Code: 21 00 48 c7 43 18 00 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f e9 d8 17 84 00 0f 0b 44 89 e6 48 c7 c7 78 0c 8b ad e8 56 92 7f 00 <0f> 0b > Oct 14 20:21:34 kdevops kernel: RSP: 0018:ffffaac980a1fe90 EFLAGS: 00010286 Oct 14 20:21:34 kdevops kernel: RAX: 0000000000000000 RBX: ffffffffada3e208 RCX: 0000000000000000 Oct 14 20:21:34 kdevops kernel: RDX: 0000000000000001 RSI: ffffffffad8efdb6 RDI: 00000000ffffffff Oct 14 20:21:34 kdevops kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaac980a1fcc0 Oct 14 20:21:34 kdevops kernel: R10: ffffaac980a1fcb8 R11: ffffffffadac3c68 R12: 0000000000000041 Oct 14 20:21:34 kdevops kernel: R13: 0000000000000a28 R14: 0000000000000000 R15: 0000000000000000 Oct 14 20:21:34 kdevops kernel: FS: 00007fc0c2882580(0000) GS:ffff9ed6f7d00000(0000) knlGS:0000000000000000 Oct 14 20:21:34 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 14 20:21:34 kdevops kernel: CR2: 00005621b0490b78 CR3: 000000011a538005 CR4: 0000000000370ee0 Oct 14 20:21:34 kdevops kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 14 20:21:34 kdevops kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Oct 14 20:21:34 kdevops kernel: Call Trace: Oct 14 20:21:34 kdevops kernel: <TASK> Oct 14 20:21:34 kdevops kernel: __cpuhp_remove_state+0x4d/0xc0 Oct 14 20:21:34 kdevops kernel: __do_sys_delete_module+0x18d/0x2a0 Oct 14 20:21:34 kdevops kernel: ? fpregs_assert_state_consistent+0x1e/0x40 Oct 14 20:21:34 kdevops kernel: ? exit_to_user_mode_prepare+0x3a/0x180 Oct 14 20:21:34 kdevops kernel: do_syscall_64+0x38/0xc0 Oct 14 20:21:34 kdevops kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Oct 14 20:21:34 kdevops kernel: RIP: 0033:0x7fc0c29a84a7 <etc> Oct 14 20:21:35 kdevops kernel: sysfs: cannot create duplicate filename '/devices/virtual/block/zram0' Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted: G W E 5.15.0-rc3-next-20210927+ #89 Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Oct 14 20:21:35 kdevops kernel: Call Trace: Oct 14 20:21:35 kdevops kernel: <TASK> Oct 14 20:21:35 kdevops kernel: dump_stack_lvl+0x48/0x5e Oct 14 20:21:35 kdevops kernel: sysfs_warn_dup.cold+0x17/0x24 Oct 14 20:21:35 kdevops kernel: sysfs_create_dir_ns+0xbc/0xd0 Oct 14 20:21:35 kdevops kernel: kobject_add_internal+0xbd/0x2b0 Oct 14 20:21:35 kdevops kernel: kobject_add+0x7e/0xb0 Oct 14 20:21:35 kdevops kernel: ? _raw_spin_unlock_irqrestore+0x25/0x40 Oct 14 20:21:35 kdevops kernel: ? preempt_count_add+0x68/0xa0 Oct 14 20:21:35 kdevops kernel: device_add+0x11a/0x980 Oct 14 20:21:35 kdevops kernel: ? dev_set_name+0x53/0x70 Oct 14 20:21:35 kdevops kernel: device_add_disk+0x9d/0x3a0 Oct 14 20:21:35 kdevops kernel: zram_add+0x1ad/0x200 [zram] Oct 14 20:21:35 kdevops kernel: ? 0xffffffffc0c10000 Oct 14 20:21:35 kdevops kernel: zram_init+0xd7/0x1000 [zram] Oct 14 20:21:35 kdevops kernel: do_one_initcall+0x41/0x200 Oct 14 20:21:35 kdevops kernel: ? _raw_spin_unlock_irqrestore+0x25/0x40 Oct 14 20:21:35 kdevops kernel: ? kmem_cache_alloc_trace+0x2ab/0x420 Oct 14 20:21:35 kdevops kernel: do_init_module+0x5c/0x270 Oct 14 20:21:35 kdevops kernel: __do_sys_finit_module+0xae/0x110 Oct 14 20:21:35 kdevops kernel: do_syscall_64+0x38/0xc0 Oct 14 20:21:35 kdevops kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9 Oct 14 20:21:35 kdevops kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d > Oct 14 20:21:35 kdevops kernel: RSP: 002b:00007fff142417b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 Oct 14 20:21:35 kdevops kernel: RAX: ffffffffffffffda RBX: 0000558ba9491bd0 RCX: 00007fca3aa555e9 Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI: 0000558ba9491f60 RDI: 0000000000000003 Oct 14 20:21:35 kdevops kernel: RBP: 0000000000040000 R08: 0000000000000000 R09: 0000558ba9491db0 Oct 14 20:21:35 kdevops kernel: R10: 0000000000000003 R11: 0000000000000246 R12: 0000558ba9491f60 Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14: 0000558ba9491d00 R15: 0000558ba9491bd0 Oct 14 20:21:35 kdevops kernel: </TASK> <etc> Oct 14 20:21:35 kdevops kernel: kobject_add_internal failed for zram0 with -EEXIST, don't try to register things with the same name in the same directory. Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------ Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 5 PID: 3388 at block/genhd.c:537 device_add_disk+0x1b9/0x3a0 Oct 14 20:21:35 kdevops kernel: Modules linked in: zram(E+) zstd(E) zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) > Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted: G W E 5.15.0-rc3-next-20210927+ #89 Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Oct 14 20:21:35 kdevops kernel: RIP: 0010:device_add_disk+0x1b9/0x3a0 Oct 14 20:21:35 kdevops kernel: Code: 00 03 01 00 00 0f 85 32 ff ff ff e9 1e ff ff ff 0f 0b 41 bc ea ff ff ff e9 29 ff ff ff 4c 89 ff e8 5c 45 1c 00 e9 ef fe ff ff <0f> 0b > Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980607d90 EFLAGS: 00010287 Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000023005 Oct 14 20:21:35 kdevops kernel: RDX: 0000000000022e05 RSI: ffffffffacc4b710 RDI: 0000000000000000 Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788a600 R08: 0000000000000000 R09: ffffaac980607a98 Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5c795ef00 R11: ffffffffadac3c68 R12: 00000000ffffffef Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5600000 R14: ffffffffc0a52100 R15: ffff9ed5d5600040 Oct 14 20:21:35 kdevops kernel: FS: 00007fca3a935580(0000) GS:ffff9ed6f7d40000(0000) knlGS:0000000000000000 Oct 14 20:21:35 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 14 20:21:35 kdevops kernel: CR2: 00007fff1423e6d8 CR3: 0000000136752002 CR4: 0000000000370ee0 Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Oct 14 20:21:35 kdevops kernel: Call Trace: Oct 14 20:21:35 kdevops kernel: <TASK> Oct 14 20:21:35 kdevops kernel: zram_add+0x1ad/0x200 [zram] Oct 14 20:21:35 kdevops kernel: ? 0xffffffffc0c10000 Oct 14 20:21:35 kdevops kernel: zram_init+0xd7/0x1000 [zram] Oct 14 20:21:35 kdevops kernel: do_one_initcall+0x41/0x200 Oct 14 20:21:35 kdevops kernel: ? _raw_spin_unlock_irqrestore+0x25/0x40 Oct 14 20:21:35 kdevops kernel: ? kmem_cache_alloc_trace+0x2ab/0x420 Oct 14 20:21:35 kdevops kernel: do_init_module+0x5c/0x270 Oct 14 20:21:35 kdevops kernel: __do_sys_finit_module+0xae/0x110 Oct 14 20:21:35 kdevops kernel: do_syscall_64+0x38/0xc0 Oct 14 20:21:35 kdevops kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9 <etc> Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------ Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 2 PID: 3457 at block/genhd.c:564 del_gendisk+0x1a2/0x1d0 Oct 14 20:21:35 kdevops kernel: Modules linked in: 842(E) 842_decompress(E) 842_compress(E) zram(E-) zstd(E) zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E> Oct 14 20:21:35 kdevops kernel: CPU: 2 PID: 3457 Comm: rmmod Tainted: G W E 5.15.0-rc3-next-20210927+ #89 Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Oct 14 20:21:35 kdevops kernel: RIP: 0010:del_gendisk+0x1a2/0x1d0 Oct 14 20:21:35 kdevops kernel: Code: 48 8d 78 40 e8 8f 87 1d 00 48 8b 7b 40 5b 5d 41 5c 48 83 c7 40 e9 4e 47 1c 00 48 8b 70 40 eb ce f6 43 61 04 0f 85 85 fe ff ff <0f> 0b > Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac9807cfe30 EFLAGS: 00010246 Oct 14 20:21:35 kdevops kernel: RAX: ffff9ed5d5600380 RBX: ffff9ed5d788a600 RCX: 0000000000000000 Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI: ffffffffad8efdb6 RDI: ffff9ed5d788a600 Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788b600 R08: 0000000000000000 R09: ffffaac9807cfc88 Oct 14 20:21:35 kdevops kernel: R10: ffffaac9807cfc80 R11: ffffffffadac3c68 R12: ffff9ed5d5600000 Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14: ffffffffc0a52360 R15: ffff9ed5c4a87b78 Oct 14 20:21:35 kdevops kernel: FS: 00007f292a2bb580(0000) GS:ffff9ed6f7c80000(0000) knlGS:0000000000000000 Oct 14 20:21:35 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 14 20:21:35 kdevops kernel: CR2: 000056161b453b78 CR3: 000000013213e002 CR4: 0000000000370ee0 Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Oct 14 20:21:35 kdevops kernel: Call Trace: Oct 14 20:21:35 kdevops kernel: <TASK> Oct 14 20:21:35 kdevops kernel: zram_remove+0x96/0xc0 [zram] Oct 14 20:21:35 kdevops kernel: ? hot_remove_store+0xe0/0xe0 [zram] Oct 14 20:21:35 kdevops kernel: zram_remove_cb+0xd/0x10 [zram] Oct 14 20:21:35 kdevops kernel: idr_for_each+0x5b/0xd0 Oct 14 20:21:35 kdevops kernel: destroy_devices+0x32/0x68 [zram] Oct 14 20:21:35 kdevops kernel: __do_sys_delete_module+0x18d/0x2a0 Oct 14 20:21:35 kdevops kernel: ? fpregs_assert_state_consistent+0x1e/0x40 Oct 14 20:21:35 kdevops kernel: ? exit_to_user_mode_prepare+0x3a/0x180 Oct 14 20:21:35 kdevops kernel: do_syscall_64+0x38/0xc0 Oct 14 20:21:35 kdevops kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7f292a3e14a7 <etc> Oct 14 20:21:35 kdevops kernel: BUG: unable to handle page fault for address: ffffffffc0a4e0ae Oct 14 20:21:35 kdevops kernel: #PF: supervisor instruction fetch in kernel mode Oct 14 20:21:35 kdevops kernel: #PF: error_code(0x0010) - not-present page Oct 14 20:21:35 kdevops kernel: PGD 3ba0e067 P4D 3ba0e067 PUD 3ba10067 PMD 10526c067 PTE 0 Oct 14 20:21:35 kdevops kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI Oct 14 20:21:35 kdevops kernel: CPU: 6 PID: 3655 Comm: zram02.sh Tainted: G W E 5.15.0-rc3-next-20210927+ #89 Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Oct 14 20:21:35 kdevops kernel: RIP: 0010:0xffffffffc0a4e0ae Oct 14 20:21:35 kdevops kernel: Code: Unable to access opcode bytes at RIP 0xffffffffc0a4e084. Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980687da8 EFLAGS: 00010286 Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX: ffff9ed5c40be400 RCX: 0000000080400035 Oct 14 20:21:35 kdevops kernel: RDX: 0000000080400036 RSI: fffffa3544561080 RDI: 0000000040000000 Oct 14 20:21:35 kdevops kernel: RBP: 0000000001900000 R08: ffff9ed5d5842cc0 R09: 0000000080400035 Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5d5842c00 R11: ffff9ed5f1341350 R12: 0000000001900000 Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5666c00 R14: ffff9ed5c40be420 R15: ffff9ed5dfa8c8c0 Oct 14 20:21:35 kdevops kernel: FS: 00007f978fe2d5c0(0000) GS:ffff9ed6f7d80000(0000) knlGS:0000000000000000 Oct 14 20:21:35 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 14 20:21:35 kdevops kernel: CR2: ffffffffc0a4e084 CR3: 0000000133fd4006 CR4: 0000000000370ee0 Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Oct 14 20:21:35 kdevops kernel: Call Trace: Oct 14 20:21:35 kdevops kernel: <TASK> Oct 14 20:21:35 kdevops kernel: ? kernfs_fop_write_iter+0x177/0x220 Oct 14 20:21:35 kdevops kernel: ? new_sync_write+0x11c/0x1b0 Oct 14 20:21:35 kdevops kernel: ? vfs_write+0x20d/0x2a0 Oct 14 20:21:35 kdevops kernel: ? ksys_write+0x5f/0xe0 Oct 14 20:21:35 kdevops kernel: ? do_syscall_64+0x38/0xc0 Oct 14 20:21:35 kdevops kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae Oct 14 20:21:35 kdevops kernel: </TASK> <etc, etc, etc, this goes on and on>
Luis
On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
...
Hello Luis,
Can you test the following patch and see if the issue can be addressed?
Please see the idea from the inline comment.
Also zram_index_mutex isn't needed in zram disk's store() compared with your patch, then the deadlock issue you are addressing in this series can be avoided.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..3c17927d23a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { class_unregister(&zram_control_class);
- /* hold the global lock so new device can't be added */
- mutex_lock(&zram_index_mutex); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
- mutex_unlock(&zram_index_mutex);
Actually zram_index_mutex isn't needed when calling zram_remove_cb() since the zram-control sysfs interface has been removed, so userspace can't add new device any more, then the issue is supposed to be fixed by the following one line change, please test it:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..96dd641de233 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
Sorry but nope, the cpu multistate issue is still present and we end up eventually with page faults. I tried with both patches.
In theory disksize_store() can't come in after del_gendisk() returns, then zram_reset_device() should cleanup everything, that is the issue you described in commit log.
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
thanks, Ming
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
...
Hello Luis,
Can you test the following patch and see if the issue can be addressed?
Please see the idea from the inline comment.
Also zram_index_mutex isn't needed in zram disk's store() compared with your patch, then the deadlock issue you are addressing in this series can be avoided.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..3c17927d23a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { class_unregister(&zram_control_class);
- /* hold the global lock so new device can't be added */
- mutex_lock(&zram_index_mutex); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
- mutex_unlock(&zram_index_mutex);
Actually zram_index_mutex isn't needed when calling zram_remove_cb() since the zram-control sysfs interface has been removed, so userspace can't add new device any more, then the issue is supposed to be fixed by the following one line change, please test it:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fcaf2750f68f..96dd641de233 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram) /* Make sure all the pending I/O are finished */ fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name); del_gendisk(zram->disk);
- /*
* reset device after gendisk is removed, so any change from sysfs
* store won't come in, then we can really reset device here
*/
- zram_reset_device(zram);
- blk_cleanup_disk(zram->disk); kfree(zram); return 0;
Sorry but nope, the cpu multistate issue is still present and we end up eventually with page faults. I tried with both patches.
In theory disksize_store() can't come in after del_gendisk() returns, then zram_reset_device() should cleanup everything, that is the issue you described in commit log.
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
Luis
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
https://github.com/ming1/linux/commits/my_v5.15-blk-dev
Thanks, Ming
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
At a quick glance, those look sane to me, nice work.
greg k-h
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
Thanks for trying an alternative fix! A crash stops yes, however this also ends up leaving the driver in an unrecoverable state after a few tries. Ie, you CTRL-C the scripts and try again over and over again and the driver ends up in a situation where it just says:
zram: Can't change algorithm for initialized device
And the zram module can't be removed at that point.
Luis
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
Thanks for trying an alternative fix! A crash stops yes, however this
I doubt it is alternative since your patchset doesn't mention the exact reason of 'Error: Removing state 63 which has instances left.', that is simply caused by failing to remove zram because ->claim is set during unloading module.
Yeah, you mentioned the race between disksize_store() vs. zram_remove(), however I don't think it is reproduced easily in the test because the race window is pretty small, also it can be fixed easily in my 3rd path without any complicated tricks.
Not dig into details of your patchset via grabbing module reference count during show/store attribute of kernfs which is done in your patch 9, but IMO this way isn't necessary:
1) any driver module has to cleanup anything which may refer to symbols or data defined in module_exit of this driver
2) device_del() is often done in module_exit(), once device_del() returns, no any new show/store on the device's kobject attribute is possible.
3) it is _not_ a must or pattern for fixing bugs to hold one lock before calling device_del(), meantime the lock is required in the device's attribute show()/store(), which causes AA deadlock easily. Your approach just avoids the issue by not releasing module until all show/store are done.
Also the model of using module refcount is usually that if anyone will use the module, grab one extra ref, and once the use is done, release it. For example of block device, the driver's module refcnt is grabbed when the disk/part is opened, and released when the disk/part is closed.
also ends up leaving the driver in an unrecoverable state after a few tries. Ie, you CTRL-C the scripts and try again over and over again and the driver ends up in a situation where it just says:
zram: Can't change algorithm for initialized device
It means the algorithm can't be changed for one initialized device at the exact time. That is understandable because two zram02.sh are running concurrently.
Your test script just runs two ./zram02.sh tasks concurrently forever, so what is your expected result for the test? Of course, it can't be over.
I can't reproduce the 'unrecoverable' state in my test, can you share the stack trace log after that happens?
Is the zram02.sh still running or slept somewhere in the 'unrecoverable' state? If it is still running, it means the current sleep point isn't interruptable when running 'CTRL-C'. In my test, after several 'CTRL-C', both the two zram02.sh started from two terminals can be terminated. If it is slept somewhere forever, it can be one problem.
And the zram module can't be removed at that point.
It is just that systemd opens the zram or the disk is opened as swap disk, and once systemd closes it or after you run swapoff, it can be unloaded.
Thanks, Ming
On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
Thanks for trying an alternative fix! A crash stops yes, however this
I doubt it is alternative since your patchset doesn't mention the exact reason of 'Error: Removing state 63 which has instances left.', that is simply caused by failing to remove zram because ->claim is set during unloading module.
Well I disagree because it does explain how the race can happen, and it also explains how since the sysfs interface is exposed until module removal completes, it leaves exposed knobs to allow re-initializing of a struct zcomp for a zram device before the exit.
Yeah, you mentioned the race between disksize_store() vs. zram_remove(), however I don't think it is reproduced easily in the test because the race window is pretty small, also it can be fixed easily in my 3rd path without any complicated tricks.
Reproducing for me is... extremely easy.
Not dig into details of your patchset via grabbing module reference count during show/store attribute of kernfs which is done in your patch 9, but IMO this way isn't necessary:
That's to address the deadlock only.
- any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver
Yes, and as the cpu multistate hotplug documentation warns (although such documentation is kind of hidden) that driver authors need to be careful with module removal too, refer to the warning at the end of __cpuhp_remove_state_cpuslocked() about module removal.
- device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute is possible.
Right and if a syfs knob is exposed before device_del() completely and is allowed to do things, the driver should take care to prevent races for CPU multistate support. The small state machine I added ensures we don't run over any expectations from cpu hotplug multistate support.
I've *never* suggested there cannot be alternatives to my solution with the small state machine, but for you to say it is incorrect is simply not right either.
- it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's attribute show()/store(), which causes AA deadlock easily. Your approach just avoids the issue by not releasing module until all show/store are done.
Right, there are two approaches here:
a) Your approach is to accept the deadlock as a requirement and so you would prefer to implement an alternative to using a shared lock on module exit and sysfs op.
b) While I address such a deadlock head on as I think this sort of locking be allowed for two reasons: b1) as we never documented such requirement otherwise. b2) There is a possibility that other drivers already exist too which *do* use a shared lock on module removal and sysfs ops (and I just confirmed this to be true)
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
So I *really* don't think it is wise for us to simply accept this new found deadlock as a *new* requirement, specially if we can fix it easily.
A cursory review using Coccinelle potential issues with mutex lock directly used on module exit (so this doesn't cover drivers like zram which uses a routine and then grabs the lock through indirection) and a sysfs op shows these drivers are also affected by this deadlock:
* arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c * lib/test_firmware.c
Note that this cursory review does not cover spin_lock uses, and other forms locks. Consider the case where a routine is used and then that routine grabs a lock, so one level indirection. There are many levels of indirections possible here. And likewise there are different types of locks.
also ends up leaving the driver in an unrecoverable state after a few tries. Ie, you CTRL-C the scripts and try again over and over again and the driver ends up in a situation where it just says:
zram: Can't change algorithm for initialized device
It means the algorithm can't be changed for one initialized device at the exact time. That is understandable because two zram02.sh are running concurrently.
Indeed but with your patch it can get stuck and cannot be taken out of this state.
Your test script just runs two ./zram02.sh tasks concurrently forever, so what is your expected result for the test? Of course, it can't be over.
I can't reproduce the 'unrecoverable' state in my test, can you share the stack trace log after that happens?
Try a bit harder, cancel the scripts after running for a while randomly (CTRL C a few times until the script finishes) and have them race again. Do this a few times.
And the zram module can't be removed at that point.
It is just that systemd opens the zram or the disk is opened as swap disk, and once systemd closes it or after you run swapoff, it can be unloaded.
With my patch this issues does not happen.
Luis
On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
We need to understand the exact reason why there is still cpuhp node left, can you share us the exact steps for reproducing the issue? Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
Thanks for trying an alternative fix! A crash stops yes, however this
I doubt it is alternative since your patchset doesn't mention the exact reason of 'Error: Removing state 63 which has instances left.', that is simply caused by failing to remove zram because ->claim is set during unloading module.
Well I disagree because it does explain how the race can happen, and it also explains how since the sysfs interface is exposed until module removal completes, it leaves exposed knobs to allow re-initializing of a struct zcomp for a zram device before the exit.
Yeah, you mentioned the race between disksize_store() vs. zram_remove(), however I don't think it is reproduced easily in the test because the race window is pretty small, also it can be fixed easily in my 3rd path without any complicated tricks.
Reproducing for me is... extremely easy.
In my observation, failing zram_remove() is extremely easy to trigger, which is caused by reset_store() which sets ->reclaim as true, so zram_remove() is failed and zram_reset_device() is bypassed , then the failure of 'Error: Removing state 63 which has instances left.' is caused.
We are in same page?
Not dig into details of your patchset via grabbing module reference count during show/store attribute of kernfs which is done in your patch 9, but IMO this way isn't necessary:
That's to address the deadlock only.
- any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver
Yes, and as the cpu multistate hotplug documentation warns (although such documentation is kind of hidden) that driver authors need to be careful with module removal too, refer to the warning at the end of __cpuhp_remove_state_cpuslocked() about module removal.
It is zram's bug. zram has to clean everything in module_exit(), unfortunately zram_remove() can be failed when calling from module_exit() because ->claim is set as true by reset_store(), then zram_reset_device()(->zcomp_destroy) isn't called, and this failure should not happen when unloading module, should it?
- device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute is possible.
Right and if a syfs knob is exposed before device_del() completely and is allowed to do things, the driver should take care to prevent races for CPU multistate support. The small state machine I added ensures
What is the race for CPU multistate support? If you mean 'Error: Removing state 63 which has instances left.', it is zram's bug since zram has to cleanup everything in module_exit().
we don't run over any expectations from cpu hotplug multistate support.
I've *never* suggested there cannot be alternatives to my solution with the small state machine, but for you to say it is incorrect is simply not right either.
- it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's attribute show()/store(), which causes AA deadlock easily. Your approach just avoids the issue by not releasing module until all show/store are done.
Right, there are two approaches here:
a) Your approach is to accept the deadlock as a requirement and so you would prefer to implement an alternative to using a shared lock on module exit and sysfs op.
wrt. in-tree zram, there is neither any deadlock in linus tree, nor after applying my 3 patches. If you think there is, please share us the code or lockdep warning.
b) While I address such a deadlock head on as I think this sort of locking be allowed for two reasons: b1) as we never documented such requirement otherwise. b2) There is a possibility that other drivers already exist too which *do* use a shared lock on module removal and sysfs ops (and I just confirmed this to be true)
The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex) in destroy_devices().
We can fix this issue easily without needing the global lock, please see the attached(pre-V2) patch.
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
So I *really* don't think it is wise for us to simply accept this new found deadlock as a *new* requirement, specially if we can fix it easily.
A cursory review using Coccinelle potential issues with mutex lock directly used on module exit (so this doesn't cover drivers like zram which uses a routine and then grabs the lock through indirection) and a sysfs op shows these drivers are also affected by this deadlock:
- arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
In fsl_wakeup_sys_exit(), device_remove_file() is called before acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
- lib/test_firmware.c
Yeah, there is the AA deadlock risk, but it should be fixed by moving misc_deregister() out of &test_fw_mutex.
Note that this cursory review does not cover spin_lock uses, and other forms locks. Consider the case where a routine is used and then that routine grabs a lock, so one level indirection. There are many levels of indirections possible here. And likewise there are different types of locks.
also ends up leaving the driver in an unrecoverable state after a few tries. Ie, you CTRL-C the scripts and try again over and over again and the driver ends up in a situation where it just says:
zram: Can't change algorithm for initialized device
It means the algorithm can't be changed for one initialized device at the exact time. That is understandable because two zram02.sh are running concurrently.
Indeed but with your patch it can get stuck and cannot be taken out of this state.
OK, I can keep current behavior: fail open() in case of removing or resetting, meantime not hold open_mutex when sync bdev and reset device, see attached patch.
Your test script just runs two ./zram02.sh tasks concurrently forever, so what is your expected result for the test? Of course, it can't be over.
I can't reproduce the 'unrecoverable' state in my test, can you share the stack trace log after that happens?
Try a bit harder, cancel the scripts after running for a while randomly (CTRL C a few times until the script finishes) and have them race again. Do this a few times.
And the zram module can't be removed at that point.
It is just that systemd opens the zram or the disk is opened as swap disk, and once systemd closes it or after you run swapoff, it can be unloaded.
With my patch this issues does not happen.
It is because the patch 2 holds ->open_mutex() for sync bdev and reset zram, so several 'CTRL-C' is needed for terminating the test script, then zram02.sh's cleanup handler can be interrupted too. We can keep current behavior easily.
Please try the following patch against upstream(linus or next) tree(basically fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in module_exit(), race between zram_remove() and disksize_store()), and see if everything is fine for you:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index a68297fb51a2..320822a80b64 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1967,25 +1967,45 @@ static int zram_add(void) static int zram_remove(struct zram *zram) { struct block_device *bdev = zram->disk->part0; + bool claimed;
mutex_lock(&bdev->bd_disk->open_mutex); - if (bdev->bd_openers || zram->claim) { + if (bdev->bd_openers) { mutex_unlock(&bdev->bd_disk->open_mutex); return -EBUSY; }
- zram->claim = true; + claimed = zram->claim; + if (!claimed) + zram->claim = true; mutex_unlock(&bdev->bd_disk->open_mutex);
zram_debugfs_unregister(zram);
- /* Make sure all the pending I/O are finished */ - fsync_bdev(bdev); - zram_reset_device(zram); + if (claimed) { + /* + * If we were claimed by reset_store(), del_gendisk() will + * wait until sync & reset is completed, so do nothing here. + */ + ; + } else { + /* Make sure all the pending I/O are finished */ + sync_blockdev(bdev); + zram_reset_device(zram); + }
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk); + + WARN_ON_ONCE(claimed && zram->claim); + + /* + * disksize store may come after the above zram_reset_device + * returns, so run the last reset to avoid the race + */ + zram_reset_device(zram); + blk_cleanup_disk(zram->disk); kfree(zram); return 0;
Thanks, Ming
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
Just my two cents.
Miroslav
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
1) during module exit: mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
IMO, the above usage simply caused AA deadlock. Even in Luis's patch 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock (hot_remove_store() vs. disksize_store() or reset_store()) is added because hot_remove_store() isn't called from module_exit().
Luis tries to delay unloading module until all show()/store() are done. But that can be obtained by the following way simply during module_exit():
kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done, //no new store()/show() can come after //kobject_del() returns mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock);
Or can you explain your requirement on kobject/module unload in a bit details?
Thanks, Ming
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
IMO, the above usage simply caused AA deadlock. Even in Luis's patch 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock (hot_remove_store() vs. disksize_store() or reset_store()) is added because hot_remove_store() isn't called from module_exit().
Luis tries to delay unloading module until all show()/store() are done. But that can be obtained by the following way simply during module_exit():
kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done, //no new store()/show() can come after //kobject_del() returns mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock);
kobject_del() already calls kobject_put(). Did you mean __kobject_del(). That one is internal though.
Or can you explain your requirement on kobject/module unload in a bit details?
Does the above makes sense?
Thanks
Miroslav
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
IMO, the above usage simply caused AA deadlock. Even in Luis's patch 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock (hot_remove_store() vs. disksize_store() or reset_store()) is added because hot_remove_store() isn't called from module_exit().
Luis tries to delay unloading module until all show()/store() are done. But that can be obtained by the following way simply during module_exit():
kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done, //no new store()/show() can come after //kobject_del() returns mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock);
kobject_del() already calls kobject_put(). Did you mean __kobject_del(). That one is internal though.
kobject_del() is counter-part of kobject_add(), and kobject_put() will call kobject_del() automatically() if it isn't deleted yet, but usually kobject_put() is for releasing the object only. It is more often to release kobject by calling kobject_del() and kobject_put().
Or can you explain your requirement on kobject/module unload in a bit details?
Does the above makes sense?
I think now focus is the shared lock between kobject_del() and show()/store() of the kobject's attributes.
Thanks, Ming
On Wed, 20 Oct 2021, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
By you only addressing the deadlock as a requirement on approach a) you are forgetting that there *may* already be present drivers which *do* implement such patterns in the kernel. I worked on addressing the deadlock because I was informed livepatching *did* have that issue as well and so very likely a generic solution to the deadlock could be beneficial to other random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to structures which livepatch uses to maintain data. That is discouraged generally, but all the attempts to implement it correctly were utter failures.
Miroslav
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
On Wed, 20 Oct 2021, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> By you only addressing the deadlock as a requirement on approach a) you are > forgetting that there *may* already be present drivers which *do* implement > such patterns in the kernel. I worked on addressing the deadlock because > I was informed livepatching *did* have that issue as well and so very > likely a generic solution to the deadlock could be beneficial to other > random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to structures which livepatch uses to maintain data. That is discouraged generally, but all the attempts to implement it correctly were utter failures.
Sounds like this is the real problem that needs to be fixed. kobjects should always control the lifespan of the structure they are embedded in. If not, then that is a design flaw of the user of the kobject :(
Where in the kernel is this happening? And where have been the attempts to fix this up?
thanks,
greg k-h
On Wed, 20 Oct 2021, Greg KH wrote:
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
On Wed, 20 Oct 2021, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > By you only addressing the deadlock as a requirement on approach a) you are > > forgetting that there *may* already be present drivers which *do* implement > > such patterns in the kernel. I worked on addressing the deadlock because > > I was informed livepatching *did* have that issue as well and so very > > likely a generic solution to the deadlock could be beneficial to other > > random drivers. > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, > just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to structures which livepatch uses to maintain data. That is discouraged generally, but all the attempts to implement it correctly were utter failures.
Sounds like this is the real problem that needs to be fixed. kobjects should always control the lifespan of the structure they are embedded in. If not, then that is a design flaw of the user of the kobject :(
Right, and you've already told us. A couple of times.
For example here https://lore.kernel.org/all/20190502074230.GA27847@kroah.com/
:)
Where in the kernel is this happening? And where have been the attempts to fix this up?
include/linux/livepatch.h and kernel/livepatch/core.c. See klp_{patch,object,func}.
It took some archeology, but I think https://lore.kernel.org/all/1464018848-4303-1-git-send-email-pmladek@suse.co... is it. Petr might correct me.
It was long before we added some important features to the code, so it might be even more difficult today.
It resurfaced later when Tobin tried to fix some of kobject call sites in the kernel...
https://lore.kernel.org/all/20190430001534.26246-1-tobin@kernel.org/ https://lore.kernel.org/all/20190430233803.GB10777@eros.localdomain/ https://lore.kernel.org/all/20190502023142.20139-6-tobin@kernel.org/
There are probably more references.
Anyway, the current code works fine (well, one could argue about that). If someone wants to take a (another) stab at this, then why not, but it seemed like a rabbit hole without a substantial gain in the past. On the other hand, we currently misuse the API to some extent.
/me scratches head
Miroslav
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
On Wed, 20 Oct 2021, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> By you only addressing the deadlock as a requirement on approach a) you are > forgetting that there *may* already be present drivers which *do* implement > such patterns in the kernel. I worked on addressing the deadlock because > I was informed livepatching *did* have that issue as well and so very > likely a generic solution to the deadlock could be beneficial to other > random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to structures which livepatch uses to maintain data. That is discouraged generally, but all the attempts to implement it correctly were utter failures.
OK, then it isn't one common usage, in which kobject covers the release of the external object. What is the exact kobject in livepatching?
But kobject_del() won't release the kobject, you shouldn't need the lock to delete kobject first. After the kobject is deleted, no any show() and store() any more, isn't such sync[1] you expected?
Thanks, Ming
On Wed 2021-10-20 18:09:51, Ming Lei wrote:
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
On Wed, 20 Oct 2021, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > By you only addressing the deadlock as a requirement on approach a) you are > > forgetting that there *may* already be present drivers which *do* implement > > such patterns in the kernel. I worked on addressing the deadlock because > > I was informed livepatching *did* have that issue as well and so very > > likely a generic solution to the deadlock could be beneficial to other > > random drivers. > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, > just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the generic infrastructure lacked (lacks) the proper support in my opinion. Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to structures which livepatch uses to maintain data. That is discouraged generally, but all the attempts to implement it correctly were utter failures.
OK, then it isn't one common usage, in which kobject covers the release of the external object. What is the exact kobject in livepatching?
Below are more details about the livepatch code. I hope that it will help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources because they define what is livepatched, see samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched objects, and patched functions, in sysfs. And most importantly, the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay in the memory as long as the sysfs interface exists. It can be solved in module_exit() callback. It could wait until the sysfs interface is destroyed.
kobject API does not support this scenario. The relase() callbacks are called asynchronously. It expects that the structure is bundled in a dynamically allocated structure. As a result, the sysfs interface can be removed even after the module removal.
The livepatching might create the dynamic structures by duplicating the structures defined in the module statically. It might safe us some headaches with kobject release. But it would also need an extra code that would need to be maintained. The structure constrains strings than need to be duplicated and later freed...
But kobject_del() won't release the kobject, you shouldn't need the lock to delete kobject first. After the kobject is deleted, no any show() and store() any more, isn't such sync[1] you expected?
Livepatch code never called kobject_del() under a lock. It would cause the obvious deadlock. The historic code only waited in the module_exit() callback until the sysfs interface was removed.
It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch: Simplify API by removing registration step"). The livepatch could never get enabled again after it was disabled now. The sysfs interface is removed when the livepatch gets disabled. The module could be removed only after the sysfs interface is destroyed, see the module_put() in klp_free_patch_finish().
The livepatch code uses workqueue because the livepatch can be disabled via sysfs interface. It obviously could not wait until the sysfs interface is removed in the sysfs write() callback that triggered the removal.
HTH, Petr
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
On Wed 2021-10-20 18:09:51, Ming Lei wrote:
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
On Wed, 20 Oct 2021, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
On Tue, 19 Oct 2021, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote: > > > By you only addressing the deadlock as a requirement on approach a) you are > > > forgetting that there *may* already be present drivers which *do* implement > > > such patterns in the kernel. I worked on addressing the deadlock because > > > I was informed livepatching *did* have that issue as well and so very > > > likely a generic solution to the deadlock could be beneficial to other > > > random drivers. > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock, > > just fixed it, and seems it has been fixed by 3ec24776bfd0. > > I would not call it a fix. It is a kind of ugly workaround because the > generic infrastructure lacked (lacks) the proper support in my opinion. > Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not familiar with livepatching's model(especially with module unload), you mean livepatching have to do the following way from sysfs:
- during module exit:
mutex_lock(lp_lock); kobject_put(lp_kobj); mutex_unlock(lp_lock); 2) show()/store() method of attributes of lp_kobj mutex_lock(lp_lock) ... mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so now the call sequence is different. kobject_put() is basically offloaded to a workqueue scheduled right from the store() method. Meaning that Luis's work would probably not help us currently, but on the other hand the issues with AA deadlock were one of the main drivers of the redesign (if I remember correctly). There were other reasons too as the changelog of the commit describes.
So, from my perspective, if there was a way to easily synchronize between a data cleanup from module_exit callback and sysfs/kernfs operations, it could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock which is required in show()/store() method. Once kobject_del() returns, no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to structures which livepatch uses to maintain data. That is discouraged generally, but all the attempts to implement it correctly were utter failures.
OK, then it isn't one common usage, in which kobject covers the release of the external object. What is the exact kobject in livepatching?
Below are more details about the livepatch code. I hope that it will help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources because they define what is livepatched, see samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched objects, and patched functions, in sysfs. And most importantly, the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay in the memory as long as the sysfs interface exists. It can be solved in module_exit() callback. It could wait until the sysfs interface is destroyed.
kobject API does not support this scenario. The relase() callbacks
kobject_delete() is for supporting this scenario, that is why we don't need to grab module refcnt before calling show()/store() of the kobject's attributes.
kobject_delete() can be called in module_exit(), then any show()/store() will be done after kobject_delete() returns.
are called asynchronously. It expects that the structure is bundled in a dynamically allocated structure. As a result, the sysfs interface can be removed even after the module removal.
That should be one bug, otherwise store()/show() method could be called into after the module is unloaded.
The livepatching might create the dynamic structures by duplicating the structures defined in the module statically. It might safe us some headaches with kobject release. But it would also need an extra code that would need to be maintained. The structure constrains strings than need to be duplicated and later freed...
But kobject_del() won't release the kobject, you shouldn't need the lock to delete kobject first. After the kobject is deleted, no any show() and store() any more, isn't such sync[1] you expected?
Livepatch code never called kobject_del() under a lock. It would cause the obvious deadlock. The historic code only waited in the module_exit() callback until the sysfs interface was removed.
OK, then Luis shouldn't consider livepatching as one such issue to solve with one generic solution.
It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch: Simplify API by removing registration step"). The livepatch could never get enabled again after it was disabled now. The sysfs interface is removed when the livepatch gets disabled. The module could be removed only after the sysfs interface is destroyed, see the module_put() in klp_free_patch_finish().
OK, that is livepatching's implementation: all the kobjects are deleted & freed after disabling the livepatch module, that looks one kill-me operation, instead of disabling, so this way isn't a normal usage, scsi has similar sysfs interface of delete. Also kobjects can't be removed in enable's store() directly, since deadlock could be caused, looks wq has to be used here for avoiding deadlock.
BTW, what is the livepatching module use model? try_module_get() is called in klp_init_patch_early()<-klp_enable_patch()<-module_init(), module_put() is called in klp_free_patch_finish() which seems only be called after 'echo 0 > /sys/kernel/livepatch/$lp_mod/enabled'.
Usually when the module isn't used, module_exit() gets chance to be called by userspace rmmod, then all kobjects created in this module can be deleted in module_exit().
The livepatch code uses workqueue because the livepatch can be disabled via sysfs interface. It obviously could not wait until the sysfs interface is removed in the sysfs write() callback that triggered the removal.
If klp_free_patch_* is moved into module_exit() and not let enable store() to kill kobjects, all kobjects can be deleted in module_exit(), then wait_for_completion(patch->finish) may be removed, also wq isn't required for the async cleanup.
Thanks, Ming
On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Livepatch code never called kobject_del() under a lock. It would cause the obvious deadlock.
Never?
The historic code only waited in the module_exit() callback until the sysfs interface was removed.
OK, then Luis shouldn't consider livepatching as one such issue to solve with one generic solution.
It's not what I was told when the deadlock was found with zram, so I was informed quite the contrary.
I'm working on a generic coccinelle patch which hunts for actual cases using iteration (a feature of coccinelle for complex searches). The search is pretty involved, so I don't think I'll have an answer to this soon.
Since the question of how generic this deadlock is remains questionable, I think it makes sense to put the generic deadlock fix off the table for now, and we address this once we have a more concrete search with coccinelle.
But to say we *don't* have drivers which can cause this is obviously wrong as well, from a cursory search so far. But let's wait and see how big this list actually is.
I'll drop the deadlock generic fixes and move on with at least a starter kernfs / sysfs tests.
Luis
On Tue, 26 Oct 2021, Luis Chamberlain wrote:
On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Livepatch code never called kobject_del() under a lock. It would cause the obvious deadlock.
Never?
kobject_put() to be precise.
When I started working on the support for module/live patches removal, calling kobject_put() under our klp_mutex lock was the obvious first choice given how the code was structured, but I ran into problems with deadlocks immediately. So it was changed to async approach with the workqueue. Thus the mainline code has never suffered from this, but we knew about the issues.
The historic code only waited in the module_exit() callback until the sysfs interface was removed.
OK, then Luis shouldn't consider livepatching as one such issue to solve with one generic solution.
It's not what I was told when the deadlock was found with zram, so I was informed quite the contrary.
From my perspective, it is quite easy to get it wrong due to either a lack
of generic support, or missing rules/documentation. So if this thread leads to "do not share locks between a module removal and a sysfs operation" strict rule, it would be at least something. In the same manner as Luis proposed to document try_module_get() expectations.
I'm working on a generic coccinelle patch which hunts for actual cases using iteration (a feature of coccinelle for complex searches). The search is pretty involved, so I don't think I'll have an answer to this soon.
Since the question of how generic this deadlock is remains questionable, I think it makes sense to put the generic deadlock fix off the table for now, and we address this once we have a more concrete search with coccinelle.
But to say we *don't* have drivers which can cause this is obviously wrong as well, from a cursory search so far. But let's wait and see how big this list actually is.
I'll drop the deadlock generic fixes and move on with at least a starter kernfs / sysfs tests.
It makes sense to me.
Thanks, Luis, for pursuing it.
Miroslav
On Wed, Oct 27, 2021 at 01:57:40PM +0200, Miroslav Benes wrote:
On Tue, 26 Oct 2021, Luis Chamberlain wrote:
On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
OK, then Luis shouldn't consider livepatching as one such issue to solve with one generic solution.
It's not what I was told when the deadlock was found with zram, so I was informed quite the contrary.
From my perspective, it is quite easy to get it wrong due to either a lack of generic support, or missing rules/documentation.
Indeed. I agree some level of guidence is needed, even if subtle, rather than tribal knowledge. I'll start off with the test_sysfs demo'ing what not to do and documenting this there. I don't think it makes sense to formalize yet documentation for "though shalt not do this" generically until a full depth search is done with Coccinelle.
So if this thread leads to "do not share locks between a module removal and a sysfs operation" strict rule, it would be at least something.
I think that's where we are at. I'll wait to complete my coccinelle deadlock hunt patch to complete the full search, and that could be useful to *warn* aboute new use cases, so to prevent this deadlock in the future. Until then I agree that the complexity introduced is not worth it given the evidence of users, but the full evidence of actual users still remains to be determined. A perfect job left to advances with Coccinelle.
In the same manner as Luis proposed to document try_module_get() expectations.
Right and so sysfs ops using try_module_get() *still* remains safe, and so will keep that patch in my next iteration because there *are* *many* uses cases for that.
Luis
On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
On Tue, 26 Oct 2021, Luis Chamberlain wrote:
On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Livepatch code never called kobject_del() under a lock. It would cause the obvious deadlock.
I have to correct myself. IMHO, the deadlock is far from obvious. I always get lost in the code and the documentation is not clear. I always get lost.
Never?
kobject_put() to be precise.
IMHO, the problem is actually with kobject_del() that gets blocked until the sysfs interface gets removed. kobject_put() will have the same problem only when the clean up is not delayed.
When I started working on the support for module/live patches removal, calling kobject_put() under our klp_mutex lock was the obvious first choice given how the code was structured, but I ran into problems with deadlocks immediately. So it was changed to async approach with the workqueue. Thus the mainline code has never suffered from this, but we knew about the issues.
The historic code only waited in the module_exit() callback until the sysfs interface was removed.
OK, then Luis shouldn't consider livepatching as one such issue to solve with one generic solution.
It's not what I was told when the deadlock was found with zram, so I was informed quite the contrary.
From my perspective, it is quite easy to get it wrong due to either a lack
of generic support, or missing rules/documentation. So if this thread leads to "do not share locks between a module removal and a sysfs operation" strict rule, it would be at least something. In the same manner as Luis proposed to document try_module_get() expectations.
The rule "do not share locks between a module removal and a sysfs operation" is not clear to me.
IMHO, there are the following rules:
1. rule: kobject_del() or kobject_put() must not be called under a lock that is used by store()/show() callbacks.
reason: kobject_del() waits until the sysfs interface is destroyed. It has to wait until all store()/show() callbacks are finished.
2. rule: kobject_del()/kobject_put() must not be called from the related store() callbacks.
reason: same as in 1st rule.
3. rule: module_exit() must wait until all release() callbacks are called when kobject are static.
reason: kobject_put() must be called to clean up internal dependencies. The clean up might be done asynchronously and need access to the kobject structure.
Best Regards, Petr
PS: I am sorry if I am messing things. I want to be sure that we are all talking about the same and understand it the same way.
On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
From my perspective, it is quite easy to get it wrong due to either a lack
of generic support, or missing rules/documentation. So if this thread leads to "do not share locks between a module removal and a sysfs operation" strict rule, it would be at least something. In the same manner as Luis proposed to document try_module_get() expectations.
The rule "do not share locks between a module removal and a sysfs operation" is not clear to me.
That's exactly it. It *is* not. The test_sysfs selftest will hopefully help with this. But I'll wait to take a final position on whether or not a generic fix should be merged until the Coccinelle patch which looks for all uses cases completes.
So I think that once that Coccinelle hunt is done for the deadlock, we should also remind folks of the potential deadlock and some of the rules you mentioned below so that if we take a position that we don't support this, we at least inform developers why and what to avoid. If Coccinelle finds quite a bit of cases, then perhaps evaluating the generic fix might be worth evaluating.
IMHO, there are the following rules:
rule: kobject_del() or kobject_put() must not be called under a lock that is used by store()/show() callbacks.
reason: kobject_del() waits until the sysfs interface is destroyed. It has to wait until all store()/show() callbacks are finished.
Right, this is what actually started this entire conversation.
Note that as Ming pointed out, the generic kernfs fix I proposed would only cover the case when kobject_del() ends up being called on module exit, so it would not cover the cases where perhaps kobject_del() might be called outside of module exit, and so the cope of the possible deadlock then increases in scope.
Likewise, the Coccinelle hunt I'm trying would only cover the module exit case. I'm a bit of afraid of the complexity of a generic hunt as expresed in rule 1.
rule: kobject_del()/kobject_put() must not be called from the related store() callbacks.
reason: same as in 1st rule.
Sensible corollary.
Given tha the exact kobjet_del() / kobject_put() which must not be called from the respective sysfs ops depends on which kobject is underneath the device for which the sysfs ops is being created, it would make this hunt in Coccinelle a bit tricky. My current iteration of a coccinelle hunt cheats and looks at any sysfs looking op and ensures a module exit exists.
rule: module_exit() must wait until all release() callbacks are called when kobject are static.
reason: kobject_put() must be called to clean up internal dependencies. The clean up might be done asynchronously and need access to the kobject structure.
This might be an easier rule to implement a respective Coccinelle rule for.
Luis
On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
From my perspective, it is quite easy to get it wrong due to either a lack
of generic support, or missing rules/documentation. So if this thread leads to "do not share locks between a module removal and a sysfs operation" strict rule, it would be at least something. In the same manner as Luis proposed to document try_module_get() expectations.
The rule "do not share locks between a module removal and a sysfs operation" is not clear to me.
That's exactly it. It *is* not. The test_sysfs selftest will hopefully help with this. But I'll wait to take a final position on whether or not a generic fix should be merged until the Coccinelle patch which looks for all uses cases completes.
So I think that once that Coccinelle hunt is done for the deadlock, we should also remind folks of the potential deadlock and some of the rules you mentioned below so that if we take a position that we don't support this, we at least inform developers why and what to avoid. If Coccinelle finds quite a bit of cases, then perhaps evaluating the generic fix might be worth evaluating.
IMHO, there are the following rules:
rule: kobject_del() or kobject_put() must not be called under a lock that is used by store()/show() callbacks.
reason: kobject_del() waits until the sysfs interface is destroyed. It has to wait until all store()/show() callbacks are finished.
Right, this is what actually started this entire conversation.
Note that as Ming pointed out, the generic kernfs fix I proposed would only cover the case when kobject_del() ends up being called on module exit, so it would not cover the cases where perhaps kobject_del() might be called outside of module exit, and so the cope of the possible deadlock then increases in scope.
Likewise, the Coccinelle hunt I'm trying would only cover the module exit case. I'm a bit of afraid of the complexity of a generic hunt as expresed in rule 1.
Question is that why one shared lock is required between kobject_del() and its show()/store(), both zram and livepatch needn't that. Is it one common usage?
rule: kobject_del()/kobject_put() must not be called from the related store() callbacks.
reason: same as in 1st rule.
Sensible corollary.
Given tha the exact kobjet_del() / kobject_put() which must not be called from the respective sysfs ops depends on which kobject is underneath the device for which the sysfs ops is being created, it would make this hunt in Coccinelle a bit tricky. My current iteration of a coccinelle hunt cheats and looks at any sysfs looking op and ensures a module exit exists.
Actually kernfs/sysfs provides interface for supporting deleting kobject/attr from the attr's show()/store(), see example of sdev_store_delete(), and the livepatch example:
https://lore.kernel.org/lkml/20211102145932.3623108-4-ming.lei@redhat.com/
rule: module_exit() must wait until all release() callbacks are called when kobject are static.
reason: kobject_put() must be called to clean up internal dependencies. The clean up might be done asynchronously and need access to the kobject structure.
This might be an easier rule to implement a respective Coccinelle rule for.
If kobject_del() is done in module_exit() or before module_exit(), kobject should have been freed in module_exit() via kobject_put().
But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE, seems like one real issue.
Thanks, Ming
On Wed, Nov 03, 2021 at 08:01:45AM +0800, Ming Lei wrote:
On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
From my perspective, it is quite easy to get it wrong due to either a lack
of generic support, or missing rules/documentation. So if this thread leads to "do not share locks between a module removal and a sysfs operation" strict rule, it would be at least something. In the same manner as Luis proposed to document try_module_get() expectations.
The rule "do not share locks between a module removal and a sysfs operation" is not clear to me.
That's exactly it. It *is* not. The test_sysfs selftest will hopefully help with this. But I'll wait to take a final position on whether or not a generic fix should be merged until the Coccinelle patch which looks for all uses cases completes.
So I think that once that Coccinelle hunt is done for the deadlock, we should also remind folks of the potential deadlock and some of the rules you mentioned below so that if we take a position that we don't support this, we at least inform developers why and what to avoid. If Coccinelle finds quite a bit of cases, then perhaps evaluating the generic fix might be worth evaluating.
IMHO, there are the following rules:
rule: kobject_del() or kobject_put() must not be called under a lock that is used by store()/show() callbacks.
reason: kobject_del() waits until the sysfs interface is destroyed. It has to wait until all store()/show() callbacks are finished.
Right, this is what actually started this entire conversation.
Note that as Ming pointed out, the generic kernfs fix I proposed would only cover the case when kobject_del() ends up being called on module exit, so it would not cover the cases where perhaps kobject_del() might be called outside of module exit, and so the cope of the possible deadlock then increases in scope.
Likewise, the Coccinelle hunt I'm trying would only cover the module exit case. I'm a bit of afraid of the complexity of a generic hunt as expresed in rule 1.
Question is that why one shared lock is required between kobject_del() and its show()/store(), both zram and livepatch needn't that. Is it one common usage?
That is the question the coccinelle hunt is aimed at finding. Answering that in the context of module removal is easier than the generic case.
But also note that I had mentioned before that we have semantics to check *when* we're in the module removal case, and as such can address that case. For the other cases we have no possible semantics to be able to address a generic fix. I tried though, refer to my reply in this thread and refer to the new kobject_being_removed() I'm adding:
https://lkml.kernel.org/r/YWdMpv8lAFYtc18c@bombadil.infradead.org
So we have semantics for knowing when about to remove a module but, my attempt with kobject_being_removed() isn't sufficient to address this generically.
In either case, having a gauge of how common this is either on module removal of generally would be wonderful. It is easier to answer the question from a module removal perspective though.
rule: kobject_del()/kobject_put() must not be called from the related store() callbacks.
reason: same as in 1st rule.
Sensible corollary.
Given tha the exact kobjet_del() / kobject_put() which must not be called from the respective sysfs ops depends on which kobject is underneath the device for which the sysfs ops is being created, it would make this hunt in Coccinelle a bit tricky. My current iteration of a coccinelle hunt cheats and looks at any sysfs looking op and ensures a module exit exists.
Actually kernfs/sysfs provides interface for supporting deleting kobject/attr from the attr's show()/store(), see example of sdev_store_delete(), and the livepatch example:
https://lore.kernel.org/lkml/20211102145932.3623108-4-ming.lei@redhat.com/
Imagine that.. is that the suicidal thing?
rule: module_exit() must wait until all release() callbacks are called when kobject are static.
reason: kobject_put() must be called to clean up internal dependencies. The clean up might be done asynchronously and need access to the kobject structure.
This might be an easier rule to implement a respective Coccinelle rule for.
If kobject_del() is done in module_exit() or before module_exit(), kobject should have been freed in module_exit() via kobject_put().
But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE, seems like one real issue.
Alright thanks for confirming.
Luis
The livepatch code uses workqueue because the livepatch can be disabled via sysfs interface. It obviously could not wait until the sysfs interface is removed in the sysfs write() callback that triggered the removal.
If klp_free_patch_* is moved into module_exit() and not let enable store() to kill kobjects, all kobjects can be deleted in module_exit(), then wait_for_completion(patch->finish) may be removed, also wq isn't required for the async cleanup.
It sounds like a nice cleanup. If we combine kobject_del() to prevent any show()/store() accesses and free everything later in module_exit(), it could work. If I am not missing something around how we maintain internal lists of live patches and their modules.
Thanks
Miroslav
On Tue 2021-10-26 23:37:30, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Below are more details about the livepatch code. I hope that it will help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources because they define what is livepatched, see samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched objects, and patched functions, in sysfs. And most importantly, the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay in the memory as long as the sysfs interface exists. It can be solved in module_exit() callback. It could wait until the sysfs interface is destroyed.
kobject API does not support this scenario. The relase() callbacks
kobject_delete() is for supporting this scenario, that is why we don't need to grab module refcnt before calling show()/store() of the kobject's attributes.
kobject_delete() can be called in module_exit(), then any show()/store() will be done after kobject_delete() returns.
I am a bit confused. I do not see kobject_delete() anywhere in kernel sources.
I see only kobject_del() and kobject_put(). AFAIK, they do _not_ guarantee that either the sysfs interface was destroyed or the release callbacks were called. For example, see schedule_delayed_work(&kobj->release, delay) in kobject_release().
By other words, anyone could still be using either the sysfs interface or the related structures after kobject_del() or kobject_put() returns.
IMHO, kobject API does not support static structures and module removal.
Best Regards, Petr
On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
On Tue 2021-10-26 23:37:30, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Below are more details about the livepatch code. I hope that it will help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources because they define what is livepatched, see samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched objects, and patched functions, in sysfs. And most importantly, the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay in the memory as long as the sysfs interface exists. It can be solved in module_exit() callback. It could wait until the sysfs interface is destroyed.
kobject API does not support this scenario. The relase() callbacks
kobject_delete() is for supporting this scenario, that is why we don't need to grab module refcnt before calling show()/store() of the kobject's attributes.
kobject_delete() can be called in module_exit(), then any show()/store() will be done after kobject_delete() returns.
I am a bit confused. I do not see kobject_delete() anywhere in kernel sources.
I see only kobject_del() and kobject_put(). AFAIK, they do _not_ guarantee that either the sysfs interface was destroyed or the release callbacks were called. For example, see schedule_delayed_work(&kobj->release, delay) in kobject_release().
Grr, I always get confused by the code. kobject_del() actually waits until the sysfs interface gets destroyed. This is why there is the deadlock.
But kobject_put() is _not_ synchronous. And the comment above kobject_add() repeat 3 times that kobject_put() must be called on success:
* Return: If this function returns an error, kobject_put() must be * called to properly clean up the memory associated with the * object. Under no instance should the kobject that is passed * to this function be directly freed with a call to kfree(), * that can leak memory. * * If this function returns success, kobject_put() must also be called * in order to properly clean up the memory associated with the object. * * In short, once this function is called, kobject_put() MUST be called * when the use of the object is finished in order to properly free * everything.
and similar text in Documentation/core-api/kobject.rst
After a kobject has been registered with the kobject core successfully, it must be cleaned up when the code is finished with it. To do that, call kobject_put().
If I read the code correctly then kobject_put() calls kref_put() that might call kobject_delayed_cleanup(). This function does a lot of things and need to access struct kobject.
IMHO, kobject API does not support static structures and module removal.
If kobject_put() has to be called also for static structures then module_exit() must explicitly wait until the clean up is finished.
Best Regards, Petr
On Tue, Nov 02, 2021 at 03:51:33PM +0100, Petr Mladek wrote:
On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
On Tue 2021-10-26 23:37:30, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Below are more details about the livepatch code. I hope that it will help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources because they define what is livepatched, see samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched objects, and patched functions, in sysfs. And most importantly, the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay in the memory as long as the sysfs interface exists. It can be solved in module_exit() callback. It could wait until the sysfs interface is destroyed.
kobject API does not support this scenario. The relase() callbacks
kobject_delete() is for supporting this scenario, that is why we don't need to grab module refcnt before calling show()/store() of the kobject's attributes.
kobject_delete() can be called in module_exit(), then any show()/store() will be done after kobject_delete() returns.
I am a bit confused. I do not see kobject_delete() anywhere in kernel sources.
I see only kobject_del() and kobject_put(). AFAIK, they do _not_ guarantee that either the sysfs interface was destroyed or the release callbacks were called. For example, see schedule_delayed_work(&kobj->release, delay) in kobject_release().
Grr, I always get confused by the code. kobject_del() actually waits until the sysfs interface gets destroyed. This is why there is the deadlock.
Right.
But kobject_put() is _not_ synchronous. And the comment above kobject_add() repeat 3 times that kobject_put() must be called on success:
- Return: If this function returns an error, kobject_put() must be
called to properly clean up the memory associated with the
object. Under no instance should the kobject that is passed
to this function be directly freed with a call to kfree(),
that can leak memory.
If this function returns success, kobject_put() must also be called
in order to properly clean up the memory associated with the object.
In short, once this function is called, kobject_put() MUST be called
when the use of the object is finished in order to properly free
everything.
and similar text in Documentation/core-api/kobject.rst
After a kobject has been registered with the kobject core successfully, it must be cleaned up when the code is finished with it. To do that, call kobject_put().
If I read the code correctly then kobject_put() calls kref_put() that might call kobject_delayed_cleanup(). This function does a lot of things and need to access struct kobject.
Yes, then what is the problem here wrt. kobject_put() which may not be synchronous?
IMHO, kobject API does not support static structures and module removal.
If kobject_put() has to be called also for static structures then module_exit() must explicitly wait until the clean up is finished.
Right, that is exactly how klp_patch kobject is implemented. klp_patch kobject has to be disabled first, then module refcnt can be dropped after the klp_patch kobject is released. Then module_exit() is possible.
Thanks, Ming
On Tue, Nov 02, 2021 at 03:15:15PM +0100, Petr Mladek wrote:
On Tue 2021-10-26 23:37:30, Ming Lei wrote:
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
Below are more details about the livepatch code. I hope that it will help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources because they define what is livepatched, see samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched objects, and patched functions, in sysfs. And most importantly, the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay in the memory as long as the sysfs interface exists. It can be solved in module_exit() callback. It could wait until the sysfs interface is destroyed.
kobject API does not support this scenario. The relase() callbacks
kobject_delete() is for supporting this scenario, that is why we don't need to grab module refcnt before calling show()/store() of the kobject's attributes.
kobject_delete() can be called in module_exit(), then any show()/store() will be done after kobject_delete() returns.
I am a bit confused. I do not see kobject_delete() anywhere in kernel sources.
I see only kobject_del() and kobject_put(). AFAIK, they do _not_ guarantee that either the sysfs interface was destroyed or the release callbacks were called. For example, see schedule_delayed_work(&kobj->release, delay) in kobject_release().
After kobject_del() returns, no one can call run into show()/store(), and all pending show()/store() are drained meantime. But yes, the release handler may still be called later, and the kobject has to be freed during or before module_exit().
https://lore.kernel.org/lkml/20211101112548.3364086-2-ming.lei@redhat.com/
By other words, anyone could still be using either the sysfs interface or the related structures after kobject_del() or kobject_put() returns.
No, no one can do that after kobject_del() returns.
IMHO, kobject API does not support static structures and module removal.
But so far klp_patch can only be defined as static instance, and it depends on the implementation, especially the release handler.
Thanks, Ming
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
Please try the following patch against upstream(linus or next) tree(basically fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in module_exit(), race between zram_remove() and disksize_store()), and see if everything is fine for you:
Page fault ...
[ 18.284256] zram: Removed device: zram0 [ 18.312974] BUG: unable to handle page fault for address: ffffad86de903008 [ 18.313707] #PF: supervisor read access in kernel mode [ 18.314248] #PF: error_code(0x0000) - not-present page [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067 PTE 0 [ 18.315538] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 18.316012] CPU: 3 PID: 1198 Comm: rmmod Tainted: G E 5.15.0-rc3-next-20210927+ #89 [ 18.316979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 18.317876] RIP: 0010:zram_free_page+0x1b/0xf0 [zram] [ 18.318430] Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 fb 48 01 ea <48> 8b 42 08 a9 00 00 00 20 74 14 48 25 ff ff ff df 48 89 42 08 48 [ 18.320412] RSP: 0018:ffffad86f8013df8 EFLAGS: 00010286 [ 18.320978] RAX: 0000000000000001 RBX: ffff9b7b435c7800 RCX: 0000000000000200 [ 18.321758] RDX: ffffad86de903000 RSI: 0000000000000000 RDI: ffff9b7b435c7800 [ 18.322524] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000 [ 18.323299] R10: 0000000000000200 R11: 0000000000000000 R12: 0000000000000000 [ 18.324030] R13: ffff9b7b55191800 R14: ffff9b7b435c7820 R15: ffff9b7b4677f960 [ 18.324784] FS: 00007fc8e4c90580(0000) GS:ffff9b7c77cc0000(0000) knlGS:0000000000000000 [ 18.325651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.326272] CR2: ffffad86de903008 CR3: 000000014f1de003 CR4: 0000000000370ee0 [ 18.327047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.327818] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.328586] Call Trace: [ 18.328852] <TASK> [ 18.329284] zram_reset_device+0xd8/0x140 [zram] [ 18.329983] zram_remove.cold+0xa/0x20 [zram] [ 18.330644] ? hot_remove_store+0xe0/0xe0 [zram] [ 18.331367] zram_remove_cb+0xd/0x10 [zram] [ 18.332010] idr_for_each+0x5b/0xd0 [ 18.332578] destroy_devices+0x26/0x50 [zram] [ 18.333238] __do_sys_delete_module+0x18d/0x2a0 [ 18.333913] ? fpregs_assert_state_consistent+0x1e/0x40 [ 18.334665] ? exit_to_user_mode_prepare+0x3a/0x180 [ 18.335395] do_syscall_64+0x38/0xc0 [ 18.335966] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.336681] RIP: 0033:0x7fc8e4db64a7
On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
Please try the following patch against upstream(linus or next) tree(basically fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in module_exit(), race between zram_remove() and disksize_store()), and see if everything is fine for you:
Page fault ...
[ 18.284256] zram: Removed device: zram0 [ 18.312974] BUG: unable to handle page fault for address: ffffad86de903008 [ 18.313707] #PF: supervisor read access in kernel mode [ 18.314248] #PF: error_code(0x0000) - not-present page [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
That is another race between zram_reset_device() and disksize_store(), which is supposed to be covered by ->init_lock, and follows the delta fix against the last patch I posted, and the whole patch can be found in the github link:
https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb168...
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d0cae7a42f4d..a14ba3d350ea 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram) set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0);
- up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ zram_meta_free(zram, disksize); memset(&zram->stats, 0, sizeof(zram->stats)); zcomp_destroy(comp); reset_bdev(zram); + up_write(&zram->init_lock); }
static ssize_t disksize_store(struct device *dev,
On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
Please try the following patch against upstream(linus or next) tree(basically fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in module_exit(), race between zram_remove() and disksize_store()), and see if everything is fine for you:
Page fault ...
[ 18.284256] zram: Removed device: zram0 [ 18.312974] BUG: unable to handle page fault for address: ffffad86de903008 [ 18.313707] #PF: supervisor read access in kernel mode [ 18.314248] #PF: error_code(0x0000) - not-present page [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
That is another race between zram_reset_device() and disksize_store(), which is supposed to be covered by ->init_lock, and follows the delta fix against the last patch I posted, and the whole patch can be found in the github link:
https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb168...
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d0cae7a42f4d..a14ba3d350ea 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram) set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0);
- up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ zram_meta_free(zram, disksize); memset(&zram->stats, 0, sizeof(zram->stats)); zcomp_destroy(comp); reset_bdev(zram);
- up_write(&zram->init_lock);
} static ssize_t disksize_store(struct device *dev,
With this, it still ends up in a state where we loop and can't get out of:
zram: Can't change algorithm for initialized device
Luis
On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
Please try the following patch against upstream(linus or next) tree(basically fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in module_exit(), race between zram_remove() and disksize_store()), and see if everything is fine for you:
Page fault ...
[ 18.284256] zram: Removed device: zram0 [ 18.312974] BUG: unable to handle page fault for address: ffffad86de903008 [ 18.313707] #PF: supervisor read access in kernel mode [ 18.314248] #PF: error_code(0x0000) - not-present page [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
That is another race between zram_reset_device() and disksize_store(), which is supposed to be covered by ->init_lock, and follows the delta fix against the last patch I posted, and the whole patch can be found in the github link:
https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb168...
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d0cae7a42f4d..a14ba3d350ea 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram) set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0);
- up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ zram_meta_free(zram, disksize); memset(&zram->stats, 0, sizeof(zram->stats)); zcomp_destroy(comp); reset_bdev(zram);
- up_write(&zram->init_lock);
} static ssize_t disksize_store(struct device *dev,
With this, it still ends up in a state where we loop and can't get out of:
zram: Can't change algorithm for initialized device
Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected behavior. Here the difference is just timing. In my test VM, this message shows a while on one task, then it may be switched to another task.
Just run your patches a while, nothing real difference here, and the following message can be dumped from one task for long time:
can't set '107374182400' to /sys/block/zram0/disksize
Also you did not answer my question about your test expected result when running the following script from two terminal concurrently:
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
Thanks, Ming
On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d0cae7a42f4d..a14ba3d350ea 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram) set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0);
- up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ zram_meta_free(zram, disksize); memset(&zram->stats, 0, sizeof(zram->stats)); zcomp_destroy(comp); reset_bdev(zram);
- up_write(&zram->init_lock);
} static ssize_t disksize_store(struct device *dev,
With this, it still ends up in a state where we loop and can't get out of:
zram: Can't change algorithm for initialized device
Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
You mean that it is not expected? If so then yes, of course.
behavior. Here the difference is just timing.
Right, but that is what helped reproduce a difficutl to re-produce customer bug. Once you find an easy way to reproduce a reported issue you stick with it and try to make the situation worse to ensure no more bugs are present.
Also you did not answer my question about your test expected result when running the following script from two terminal concurrently:
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
If you run this, you should see no failures.
Once you start a second script that one should cause odd issues on both sides but never crash or stall the module.
A second series of tests is hitting CTRL-C on either randonly and restarting testing once again randomly.
Again, neither should crash the kernel or stall the module.
In the end of these tests you should be able to run the script alone just once and not see issues.
Luis
On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d0cae7a42f4d..a14ba3d350ea 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram) set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0);
- up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ zram_meta_free(zram, disksize); memset(&zram->stats, 0, sizeof(zram->stats)); zcomp_destroy(comp); reset_bdev(zram);
- up_write(&zram->init_lock);
} static ssize_t disksize_store(struct device *dev,
With this, it still ends up in a state where we loop and can't get out of:
zram: Can't change algorithm for initialized device
Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
You mean that it is not expected? If so then yes, of course.
My meaning is clear: it is not unexpected, so it is expected.
behavior. Here the difference is just timing.
Right, but that is what helped reproduce a difficutl to re-produce customer bug. Once you find an easy way to reproduce a reported issue you stick with it and try to make the situation worse to ensure no more bugs are present.
Also you did not answer my question about your test expected result when running the following script from two terminal concurrently:
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
If you run this, you should see no failures.
OK, not see any failure when running single zram02.sh after applying my patch V2.
Once you start a second script that one should cause odd issues on both sides but never crash or stall the module.
crash can't be observed with my patch V2, what do you mean 'stall' the module? Is that 'zram' can't be unloaded after the test is terminated via multiple 'ctrl-c'?
A second series of tests is hitting CTRL-C on either randonly and restarting testing once again randomly.
ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/ rmmod), ctrl-c will terminate current forground task and cause shell to run the cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler, then the cleanup won't be done completely, such as zram disk is left as swap device and zram can't be unloaded. The idea can be observed via the following script:
#!/bin/bash trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT sleep 30
After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30' is terminated, then the trap command is run, so you can see "enter trap" dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately. So 'swapoff' from zram02.sh's trap function can be terminated in this way.
zram disk being left as swap disk can be observed with your patch too after terminating via multiple ctrl-c which has to be done this way because the test is dead loop.
So it is hard to cleanup everything completely after multiple 'CTRL-C' is involved, and it should be impossible. It needs violent multiple ctrl-c to terminate the dealoop test.
So it isn't reasonable to expect that zram can be always unloaded successfully after the test script is terminated via multiple ctrl-c.
But zram can be unloaded after running swapoff manually, from driver viewpoint, nothing is wrong.
Again, neither should crash the kernel or stall the module.
In the end of these tests you should be able to run the script alone just once and not see issues.
Thanks, Ming
On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
A second series of tests is hitting CTRL-C on either randonly and restarting testing once again randomly.
ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/ rmmod), ctrl-c will terminate current forground task and cause shell to run the cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler, then the cleanup won't be done completely, such as zram disk is left as swap device and zram can't be unloaded. The idea can be observed via the following script:
#!/bin/bash trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT sleep 30
After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30' is terminated, then the trap command is run, so you can see "enter trap" dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately. So 'swapoff' from zram02.sh's trap function can be terminated in this way.
zram disk being left as swap disk can be observed with your patch too after terminating via multiple ctrl-c which has to be done this way because the test is dead loop.
So it is hard to cleanup everything completely after multiple 'CTRL-C' is involved, and it should be impossible. It needs violent multiple ctrl-c to terminate the dealoop test.
So it isn't reasonable to expect that zram can be always unloaded successfully after the test script is terminated via multiple ctrl-c.
For the life of me, I do not run into these issue with my patch. But with yours I had.
To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave CTRL-C pressed to issue multiple terminations until the script is done on each terminal at a time, until I see both have completed.
I repeat the same test, noting always that when I start one one terminal the test is succeeding. And also when I cancel completely one script the test continue fine without issue.
But zram can be unloaded after running swapoff manually, from driver viewpoint, nothing is wrong.
I had not run into that issue with my patch FWIW.
Luis
On Thu, Oct 21, 2021 at 10:18:47AM -0700, Luis Chamberlain wrote:
On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
A second series of tests is hitting CTRL-C on either randonly and restarting testing once again randomly.
ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/ rmmod), ctrl-c will terminate current forground task and cause shell to run the cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler, then the cleanup won't be done completely, such as zram disk is left as swap device and zram can't be unloaded. The idea can be observed via the following script:
#!/bin/bash trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT sleep 30
After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30' is terminated, then the trap command is run, so you can see "enter trap" dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately. So 'swapoff' from zram02.sh's trap function can be terminated in this way.
zram disk being left as swap disk can be observed with your patch too after terminating via multiple ctrl-c which has to be done this way because the test is dead loop.
So it is hard to cleanup everything completely after multiple 'CTRL-C' is involved, and it should be impossible. It needs violent multiple ctrl-c to terminate the dealoop test.
So it isn't reasonable to expect that zram can be always unloaded successfully after the test script is terminated via multiple ctrl-c.
For the life of me, I do not run into these issue with my patch. But with yours I had.
To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave CTRL-C pressed to issue multiple terminations until the script is done on each terminal at a time, until I see both have completed.
I repeat the same test, noting always that when I start one one terminal the test is succeeding. And also when I cancel completely one script the test continue fine without issue.
As I explained wrt. shell's trap, this issue won't be avoided from userspace because trap function can be terminated by ctrl-c too, otherwise one shell script may not be terminated at all.
The unclean shutdown can be observed in single 'while true; do zram02.sh; done' too on both your patches and mine.
Also it is insane to write write test in a deadloop, and people seldom do that, not see such way in either blktests/xfstests.
I you limit completion time of this test in long enough time(one or several hours) or big enough loops, I believe it can be done cleanly, such as:
cnt=0 MAX=10000 while [ $cnt -lt $MAX ]; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
Thanks, Ming
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
> > We need to understand the exact reason why there is still cpuhp node > left, can you share us the exact steps for reproducing the issue? > Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
Thanks for trying an alternative fix! A crash stops yes, however this
I doubt it is alternative since your patchset doesn't mention the exact reason of 'Error: Removing state 63 which has instances left.', that is simply caused by failing to remove zram because ->claim is set during unloading module.
Well I disagree because it does explain how the race can happen, and it also explains how since the sysfs interface is exposed until module removal completes, it leaves exposed knobs to allow re-initializing of a struct zcomp for a zram device before the exit.
Yeah, you mentioned the race between disksize_store() vs. zram_remove(), however I don't think it is reproduced easily in the test because the race window is pretty small, also it can be fixed easily in my 3rd path without any complicated tricks.
Reproducing for me is... extremely easy.
In my observation, failing zram_remove() is extremely easy to trigger, which is caused by reset_store() which sets ->reclaim as true, so zram_remove() is failed and zram_reset_device() is bypassed , then the failure of 'Error: Removing state 63 which has instances left.' is caused.
We are in same page?
The actual first issue is the CPU hotplug remove callback is long gone and in the meantime we allow a race to add a new "instance", in the zram driver's case a cpu struct zcomp instance though the sysfs interface.
Regardless of if zram_remove() can fail or not, the above race needs to be addressed.
Not dig into details of your patchset via grabbing module reference count during show/store attribute of kernfs which is done in your patch 9, but IMO this way isn't necessary:
That's to address the deadlock only.
- any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver
Yes, and as the cpu multistate hotplug documentation warns (although such documentation is kind of hidden) that driver authors need to be careful with module removal too, refer to the warning at the end of __cpuhp_remove_state_cpuslocked() about module removal.
It is zram's bug. zram has to clean everything in module_exit(), unfortunately zram_remove() can be failed when calling from module_exit() because ->claim is set as true by reset_store(), then zram_reset_device()(->zcomp_destroy) isn't called, and this failure should not happen when unloading module, should it?
You're addressing a possible failig zram_remove() while I address not allowing entry to muck with the zram driver at all once we're bailing on module removal.
- device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute is possible.
Right and if a syfs knob is exposed before device_del() completely and is allowed to do things, the driver should take care to prevent races for CPU multistate support. The small state machine I added ensures
What is the race for CPU multistate support? If you mean 'Error: Removing state 63 which has instances left.', it is zram's bug since zram has to cleanup everything in module_exit().
Yes. And it is what my out of tree yet Acked patch, 'zram: fix crashes with cpu hotplug multistate' does.
we don't run over any expectations from cpu hotplug multistate support.
I've *never* suggested there cannot be alternatives to my solution with the small state machine, but for you to say it is incorrect is simply not right either.
- it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's attribute show()/store(), which causes AA deadlock easily. Your approach just avoids the issue by not releasing module until all show/store are done.
Right, there are two approaches here:
a) Your approach is to accept the deadlock as a requirement and so you would prefer to implement an alternative to using a shared lock on module exit and sysfs op.
wrt. in-tree zram, there is neither any deadlock in linus tree, nor after applying my 3 patches. If you think there is, please share us the code or lockdep warning.
Right, 'zram: fix crashes with cpu hotplug multistate' is not yet merged, my approach to fixing that does add a lock use on module removal which does introduce a possible deadlock with syfs, which is later addressed generically between sysfs and module removal for all drivers.
b) While I address such a deadlock head on as I think this sort of locking be allowed for two reasons: b1) as we never documented such requirement otherwise. b2) There is a possibility that other drivers already exist too which *do* use a shared lock on module removal and sysfs ops (and I just confirmed this to be true)
The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex) in destroy_devices().
Yes yes, but you are completely throwing out the window that other possible deadlocks can exist in the kernel *and* that *new* cases of the deadlock can easily also be added!
We can fix this issue easily without needing the global lock, please see the attached(pre-V2) patch.
So far your patches do not fix the issues though...
So I *really* don't think it is wise for us to simply accept this new found deadlock as a *new* requirement, specially if we can fix it easily.
A cursory review using Coccinelle potential issues with mutex lock directly used on module exit (so this doesn't cover drivers like zram which uses a routine and then grabs the lock through indirection) and a sysfs op shows these drivers are also affected by this deadlock:
- arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
In fsl_wakeup_sys_exit(), device_remove_file() is called before acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
- lib/test_firmware.c
Yeah, there is the AA deadlock risk, but it should be fixed by moving misc_deregister() out of &test_fw_mutex.
And just like that you are ignoring other possible uses in the kernel which might have similar deadlocks.
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
Luis
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
Yes, I would not recommend using such a lock at all. sysfs operations happen on a per-device basis, so you can lock the device structure. Module removal happens on a driver basis, and I have no idea what you want to lock there, but odds are it is NOT shared with your per-device structures either, right?
If so, then yes, that is a bug, but a very rare one as drivers should do almost nothing except register/unregister_driver() in their module init/exit calls.
zram is not a "normal" driver at all here, so fixing this type of problem up should be done in the zram code, it is not a generic module/sysfs issue at all.
thanks,
greg k-h
On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
Yes, I would not recommend using such a lock at all. sysfs operations happen on a per-device basis, so you can lock the device structure.
All devices are going to be removed on module removal and so cannot be locked.
Luis
On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
Yes, I would not recommend using such a lock at all. sysfs operations happen on a per-device basis, so you can lock the device structure.
All devices are going to be removed on module removal and so cannot be locked.
devices are not normally created by a driver, that is up to the bus controller logic. A module will just disconnect itself from the device, the device does not go away.
But yes, there are exceptions, and if you are doing something odd like that, then you need to be aware of crazy things like this, so be careful. But for all normal drivers, they do not have to worry about this.
thanks,
greg k-h
On Tue, Oct 19, 2021 at 07:28:35PM +0200, Greg KH wrote:
On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
Yes, I would not recommend using such a lock at all. sysfs operations happen on a per-device basis, so you can lock the device structure.
All devices are going to be removed on module removal and so cannot be locked.
devices are not normally created by a driver, that is up to the bus controller logic. A module will just disconnect itself from the device, the device does not go away.
But yes, there are exceptions, and if you are doing something odd like that, then you need to be aware of crazy things like this, so be careful. But for all normal drivers, they do not have to worry about this.
"Recommend" is a weak position to take given a possible deadlock with sysfs.
Do we want to at the very least document this is not a supported scheme?
If so I can also add a simple 1 level indirrection coccinelle patch to detect these schemes and complain about them as wel, if we are going to take this position.
But to simply disregard this as "not an issue", or we won't do anything seems pretty counter productive given we *do* had drivers with this issue before *and* still have them upstream, and can end up with more drivers like this later.
Luis
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote: > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote: ... > > > > We need to understand the exact reason why there is still cpuhp node > > left, can you share us the exact steps for reproducing the issue? > > Otherwise we may have to trace and narrow down the reason. > > See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
Thanks for trying an alternative fix! A crash stops yes, however this
I doubt it is alternative since your patchset doesn't mention the exact reason of 'Error: Removing state 63 which has instances left.', that is simply caused by failing to remove zram because ->claim is set during unloading module.
Well I disagree because it does explain how the race can happen, and it also explains how since the sysfs interface is exposed until module removal completes, it leaves exposed knobs to allow re-initializing of a struct zcomp for a zram device before the exit.
Yeah, you mentioned the race between disksize_store() vs. zram_remove(), however I don't think it is reproduced easily in the test because the race window is pretty small, also it can be fixed easily in my 3rd path without any complicated tricks.
Reproducing for me is... extremely easy.
In my observation, failing zram_remove() is extremely easy to trigger, which is caused by reset_store() which sets ->reclaim as true, so zram_remove() is failed and zram_reset_device() is bypassed , then the failure of 'Error: Removing state 63 which has instances left.' is caused.
We are in same page?
The actual first issue is the CPU hotplug remove callback is long gone and in the meantime we allow a race to add a new "instance", in the zram driver's case a cpu struct zcomp instance though the sysfs interface.
Regardless of if zram_remove() can fail or not, the above race needs to be addressed.
Not dig into details of your patchset via grabbing module reference count during show/store attribute of kernfs which is done in your patch 9, but IMO this way isn't necessary:
That's to address the deadlock only.
- any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver
Yes, and as the cpu multistate hotplug documentation warns (although such documentation is kind of hidden) that driver authors need to be careful with module removal too, refer to the warning at the end of __cpuhp_remove_state_cpuslocked() about module removal.
It is zram's bug. zram has to clean everything in module_exit(), unfortunately zram_remove() can be failed when calling from module_exit() because ->claim is set as true by reset_store(), then zram_reset_device()(->zcomp_destroy) isn't called, and this failure should not happen when unloading module, should it?
You're addressing a possible failig zram_remove() while I address not allowing entry to muck with the zram driver at all once we're bailing on module removal.
- device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute is possible.
Right and if a syfs knob is exposed before device_del() completely and is allowed to do things, the driver should take care to prevent races for CPU multistate support. The small state machine I added ensures
What is the race for CPU multistate support? If you mean 'Error: Removing state 63 which has instances left.', it is zram's bug since zram has to cleanup everything in module_exit().
Yes. And it is what my out of tree yet Acked patch, 'zram: fix crashes with cpu hotplug multistate' does.
Unfortunately that patch adds new deadlock between hot_remove_store() and disksize_store() & others, see my below comment.
we don't run over any expectations from cpu hotplug multistate support.
I've *never* suggested there cannot be alternatives to my solution with the small state machine, but for you to say it is incorrect is simply not right either.
- it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's attribute show()/store(), which causes AA deadlock easily. Your approach just avoids the issue by not releasing module until all show/store are done.
Right, there are two approaches here:
a) Your approach is to accept the deadlock as a requirement and so you would prefer to implement an alternative to using a shared lock on module exit and sysfs op.
wrt. in-tree zram, there is neither any deadlock in linus tree, nor after applying my 3 patches. If you think there is, please share us the code or lockdep warning.
Right, 'zram: fix crashes with cpu hotplug multistate' is not yet merged, my approach to fixing that does add a lock use on module removal which does introduce a possible deadlock with syfs, which is later addressed generically between sysfs and module removal for all drivers.
b) While I address such a deadlock head on as I think this sort of locking be allowed for two reasons: b1) as we never documented such requirement otherwise. b2) There is a possibility that other drivers already exist too which *do* use a shared lock on module removal and sysfs ops (and I just confirmed this to be true)
The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex) in destroy_devices().
Yes yes, but you are completely throwing out the window that other possible deadlocks can exist in the kernel *and* that *new* cases of the deadlock can easily also be added!
We can fix this issue easily without needing the global lock, please see the attached(pre-V2) patch.
So far your patches do not fix the issues though...
So I *really* don't think it is wise for us to simply accept this new found deadlock as a *new* requirement, specially if we can fix it easily.
A cursory review using Coccinelle potential issues with mutex lock directly used on module exit (so this doesn't cover drivers like zram which uses a routine and then grabs the lock through indirection) and a sysfs op shows these drivers are also affected by this deadlock:
- arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
In fsl_wakeup_sys_exit(), device_remove_file() is called before acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
- lib/test_firmware.c
Yeah, there is the AA deadlock risk, but it should be fixed by moving misc_deregister() out of &test_fw_mutex.
And just like that you are ignoring other possible uses in the kernel which might have similar deadlocks.
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate', when you added mutex_lock(zram_index_mutex) to disksize_store() and other attribute show() or store() method. You have added new deadlock between hot_remove_store() and disksize_store() & others, which can't be addressed by your approach of holding module refcnt.
So far not see ltp tests covers hot add/remove interface yet.
Thanks, Ming
On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate', when you added mutex_lock(zram_index_mutex) to disksize_store() and other attribute show() or store() method. You have added new deadlock between hot_remove_store() and disksize_store() & others, which can't be addressed by your approach of holding module refcnt.
So far not see ltp tests covers hot add/remove interface yet.
Care to show what commands to use to cause this deadlock with my patches?
Luis
On Tue, Oct 19, 2021 at 12:38:42PM -0700, Luis Chamberlain wrote:
On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and on sysfs ops?
IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate', when you added mutex_lock(zram_index_mutex) to disksize_store() and other attribute show() or store() method. You have added new deadlock between hot_remove_store() and disksize_store() & others, which can't be addressed by your approach of holding module refcnt.
So far not see ltp tests covers hot add/remove interface yet.
Care to show what commands to use to cause this deadlock with my patches?
Build a kernel with your patch 4,7,8,9,11 and 12(all others are test module or document change), with lockdep enabled, run the following command, then you will see the warning, and it is one real deadlock, not false warning.
BTW, your patch 9 can't be applied cleanly against both linus and next tree, so I edited it manually, but that can't make difference wrt. this issue.
[root@ktest-09 ~]# lsblk | grep zram zram0 253:0 0 0B 0 disk cat /sys/class/zram-control/hot_add [root@ktest-09 ~]# lsblk | grep zram zram0 253:0 0 0B 0 disk zram1 253:1 0 0B 0 disk [root@ktest-09 ~]# echo 256M > /sys/block/zram1/disksize [root@ktest-09 ~]# echo 1 > /sys/class/zram-control/hot_remove [root@ktest-09 ~]# dmesg ... [ 75.599882] ====================================================== [ 75.601355] WARNING: possible circular locking dependency detected [ 75.602818] 5.15.0-rc3_zram_fix_luis+ #24 Not tainted [ 75.604038] ------------------------------------------------------ [ 75.605512] bash/1154 is trying to acquire lock: [ 75.606634] ffff91ce026cd428 (kn->active#237){++++}-{0:0}, at: __kernfs_remove+0x1ab/0x1e0 [ 75.608570] but task is already holding lock: [ 75.609955] ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0 [ 75.611910] which lock already depends on the new lock.
[ 75.613896] the existing dependency chain (in reverse order) is: [ 75.615830] -> #1 (zram_index_mutex){+.+.}-{3:3}: [ 75.617483] __lock_acquire+0x4d2/0x930 [ 75.618650] lock_acquire+0xbb/0x2d0 [ 75.619748] __mutex_lock+0x8e/0x8a0 [ 75.620854] disksize_store+0x38/0x180 [ 75.621996] kernfs_fop_write_iter+0x134/0x1d0 [ 75.623287] new_sync_write+0x122/0x1b0 [ 75.624442] vfs_write+0x23e/0x350 [ 75.625506] ksys_write+0x68/0xe0 [ 75.626550] do_syscall_64+0x3b/0x90 [ 75.627649] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 75.629070] -> #0 (kn->active#237){++++}-{0:0}: [ 75.630677] check_prev_add+0x91/0xc10 [ 75.631816] validate_chain+0x474/0x500 [ 75.632972] __lock_acquire+0x4d2/0x930 [ 75.634131] lock_acquire+0xbb/0x2d0 [ 75.635234] kernfs_drain+0x139/0x190 [ 75.636355] __kernfs_remove+0x1ab/0x1e0 [ 75.637532] kernfs_remove_by_name_ns+0x3f/0x80 [ 75.638843] remove_files+0x2b/0x60 [ 75.639926] sysfs_remove_group+0x38/0x80 [ 75.641120] sysfs_remove_groups+0x29/0x40 [ 75.642334] device_remove_attrs+0x5b/0x90 [ 75.643552] device_del+0x184/0x400 [ 75.644635] zram_remove+0xac/0xc0 [ 75.645700] hot_remove_store+0xa3/0xf0 [ 75.646856] kernfs_fop_write_iter+0x134/0x1d0 [ 75.648147] new_sync_write+0x122/0x1b0 [ 75.649311] vfs_write+0x23e/0x350 [ 75.650372] ksys_write+0x68/0xe0 [ 75.651412] do_syscall_64+0x3b/0x90 [ 75.652512] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 75.653929] other info that might help us debug this:
[ 75.656054] Possible unsafe locking scenario:
[ 75.657637] CPU0 CPU1 [ 75.658833] ---- ---- [ 75.660020] lock(zram_index_mutex); [ 75.661024] lock(kn->active#237); [ 75.662549] lock(zram_index_mutex); [ 75.664103] lock(kn->active#237); [ 75.665072] *** DEADLOCK ***
[ 75.666736] 4 locks held by bash/1154: [ 75.667767] #0: ffff91ce06983470 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x68/0xe0 [ 75.669802] #1: ffff91ce4123d290 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x100/0x1d0 [ 75.672050] #2: ffff91ce05a7ac40 (kn->active#238){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x108/0x1d0 [ 75.674383] #3: ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0 [ 75.676595] stack backtrace: [ 75.677835] CPU: 2 PID: 1154 Comm: bash Not tainted 5.15.0-rc3_zram_fix_luis+ #24 [ 75.679768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014 [ 75.681927] Call Trace: [ 75.682674] dump_stack_lvl+0x57/0x7d [ 75.683680] check_noncircular+0xff/0x110 [ 75.684758] ? stack_trace_save+0x4b/0x70 [ 75.685843] check_prev_add+0x91/0xc10 [ 75.686867] ? add_chain_cache+0x112/0x2d0 [ 75.687965] validate_chain+0x474/0x500 [ 75.689005] __lock_acquire+0x4d2/0x930 [ 75.690054] lock_acquire+0xbb/0x2d0 [ 75.691038] ? __kernfs_remove+0x1ab/0x1e0 [ 75.692131] ? __lock_release+0x179/0x2c0 [ 75.693212] ? kernfs_drain+0x5b/0x190 [ 75.694239] kernfs_drain+0x139/0x190 [ 75.695240] ? __kernfs_remove+0x1ab/0x1e0 [ 75.696341] __kernfs_remove+0x1ab/0x1e0 [ 75.697408] kernfs_remove_by_name_ns+0x3f/0x80 [ 75.698607] remove_files+0x2b/0x60 [ 75.699576] sysfs_remove_group+0x38/0x80 [ 75.700661] sysfs_remove_groups+0x29/0x40 [ 75.701770] device_remove_attrs+0x5b/0x90 [ 75.702870] device_del+0x184/0x400 [ 75.703835] zram_remove+0xac/0xc0 [ 75.704785] hot_remove_store+0xa3/0xf0 [ 75.705831] kernfs_fop_write_iter+0x134/0x1d0 [ 75.707004] new_sync_write+0x122/0x1b0 [ 75.708048] ? __do_fast_syscall_32+0xe0/0xf0 [ 75.709214] vfs_write+0x23e/0x350 [ 75.710161] ksys_write+0x68/0xe0 [ 75.711088] do_syscall_64+0x3b/0x90 [ 75.712078] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 75.713389] RIP: 0033:0x7fcc1893f927 [ 75.714381] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 75.718879] RSP: 002b:00007ffcd56d91a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 75.720832] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fcc1893f927 [ 75.722592] RDX: 0000000000000002 RSI: 000055d7d33f78c0 RDI: 0000000000000001 [ 75.724352] RBP: 000055d7d33f78c0 R08: 0000000000000000 R09: 00007fcc189f44e0 [ 75.726123] R10: 00007fcc189f43e0 R11: 0000000000000246 R12: 0000000000000002 [ 75.727884] R13: 00007fcc18a395a0 R14: 0000000000000002 R15: 00007fcc18a397a0
Thanks, Ming
The ATTRIBUTE_GROUPS is typically used to avoid boiler plate code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was long due on the zram driver, however a recent fix for sysfs allows users of ATTRIBUTE_GROUPS to also associate a module to the group attribute.
In zram's case this also means it allows us to fix a race which triggers a deadlock on the zram driver. This deadlock happens when a sysfs attribute use a lock also used on module removal. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
Sysfs fixes this when the group attributes have a module associated to it, sysfs will *try* to get a refcount to the module when a shared lock is used, prior to mucking with a sysfs attribute. If this fails we just give up right away.
This deadlock was first reported with the zram driver, a sketch of how this can happen follows:
CPU A CPU B whatever_store() module_unload mutex_lock(foo) mutex_lock(foo) del_gendisk(zram->disk); device_del() device_remove_groups()
In this situation whatever_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked completes which is waiting for a lock already held.
This issue can be reproduced easily on the zram driver as follows:
Loop 1 on one terminal:
while true; do modprobe zram; modprobe -r zram; done
Loop 2 on a second terminal: while true; do echo 1024 > /sys/block/zram0/disksize; echo 1 > /sys/block/zram0/reset; done
Without this patch we end up in a deadlock, and the following stack trace is produced which hints to us what the issue was:
INFO: task bash:888 blocked for more than 120 seconds. Tainted: G E 5.12.0-rc1-next-20210304+ #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:bash state:D stack: 0 pid: 888 ppid: 887 flags:<etc> Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x2c3/0x490 ? _kstrtoull+0x35/0xd0 reset_store+0x6c/0x160 [zram] kernfs_fop_write_iter+0x124/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1c2/0x260 ksys_write+0x5f/0xe0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f34f2c3df33 RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33 RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001 RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001 R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0 INFO: task modprobe:1104 can't die for more than 120 seconds. task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:<etc> Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 __kernfs_remove.part.0+0x228/0x2b0 ? finish_wait+0x80/0x80 kernfs_remove_by_name_ns+0x50/0x90 remove_files+0x2b/0x60 sysfs_remove_group+0x38/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x4a/0x80 device_del+0x183/0x3e0 ? mutex_lock+0xe/0x30 del_gendisk+0x27a/0x2d0 zram_remove+0x8a/0xb0 [zram] ? hot_remove_store+0xf0/0xf0 [zram] zram_remove_cb+0xd/0x10 [zram] idr_for_each+0x5e/0xd0 destroy_devices+0x39/0x6f [zram] __do_sys_delete_module+0x190/0x2a0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f32adf727d7 RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78 RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78 R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
Signed-off-by: Luis Chamberlain mcgrof@kernel.org --- drivers/block/zram/zram_drv.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index b26abcb955cc..60a55ae8cd91 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = { NULL, };
-static const struct attribute_group zram_disk_attr_group = { - .attrs = zram_disk_attrs, -}; - -static const struct attribute_group *zram_disk_attr_groups[] = { - &zram_disk_attr_group, - NULL, -}; +ATTRIBUTE_GROUPS(zram_disk);
/* * Allocate and initialize new zram device. the function returns @@ -1981,7 +1974,7 @@ static int zram_add(void) blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue); - device_add_disk(NULL, zram->disk, zram_disk_attr_groups); + device_add_disk(NULL, zram->disk, zram_disk_groups);
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
On Mon, Sep 27, 2021 at 09:38:05AM -0700, Luis Chamberlain wrote:
The ATTRIBUTE_GROUPS is typically used to avoid boiler plate code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was long due on the zram driver, however a recent fix for sysfs allows users of ATTRIBUTE_GROUPS to also associate a module to the group attribute.
Does this mean that other modules using sysfs but _not_ ATTRIBUTE_GROUPS() are still vulnerable to potential use-after-free of the kernfs fops?
-Kees
In zram's case this also means it allows us to fix a race which triggers a deadlock on the zram driver. This deadlock happens when a sysfs attribute use a lock also used on module removal. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
Sysfs fixes this when the group attributes have a module associated to it, sysfs will *try* to get a refcount to the module when a shared lock is used, prior to mucking with a sysfs attribute. If this fails we just give up right away.
This deadlock was first reported with the zram driver, a sketch of how this can happen follows:
CPU A CPU B whatever_store() module_unload mutex_lock(foo) mutex_lock(foo) del_gendisk(zram->disk); device_del() device_remove_groups()
In this situation whatever_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked completes which is waiting for a lock already held.
This issue can be reproduced easily on the zram driver as follows:
Loop 1 on one terminal:
while true; do modprobe zram; modprobe -r zram; done
Loop 2 on a second terminal: while true; do echo 1024 > /sys/block/zram0/disksize; echo 1 > /sys/block/zram0/reset; done
Without this patch we end up in a deadlock, and the following stack trace is produced which hints to us what the issue was:
INFO: task bash:888 blocked for more than 120 seconds. Tainted: G E 5.12.0-rc1-next-20210304+ #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:bash state:D stack: 0 pid: 888 ppid: 887 flags:<etc> Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x2c3/0x490 ? _kstrtoull+0x35/0xd0 reset_store+0x6c/0x160 [zram] kernfs_fop_write_iter+0x124/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1c2/0x260 ksys_write+0x5f/0xe0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f34f2c3df33 RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33 RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001 RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001 R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0 INFO: task modprobe:1104 can't die for more than 120 seconds. task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:<etc> Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 __kernfs_remove.part.0+0x228/0x2b0 ? finish_wait+0x80/0x80 kernfs_remove_by_name_ns+0x50/0x90 remove_files+0x2b/0x60 sysfs_remove_group+0x38/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x4a/0x80 device_del+0x183/0x3e0 ? mutex_lock+0xe/0x30 del_gendisk+0x27a/0x2d0 zram_remove+0x8a/0xb0 [zram] ? hot_remove_store+0xf0/0xf0 [zram] zram_remove_cb+0xd/0x10 [zram] idr_for_each+0x5e/0xd0 destroy_devices+0x39/0x6f [zram] __do_sys_delete_module+0x190/0x2a0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f32adf727d7 RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78 RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78 R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
Signed-off-by: Luis Chamberlain mcgrof@kernel.org
drivers/block/zram/zram_drv.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index b26abcb955cc..60a55ae8cd91 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = { NULL, }; -static const struct attribute_group zram_disk_attr_group = {
- .attrs = zram_disk_attrs,
-};
-static const struct attribute_group *zram_disk_attr_groups[] = {
- &zram_disk_attr_group,
- NULL,
-}; +ATTRIBUTE_GROUPS(zram_disk); /*
- Allocate and initialize new zram device. the function returns
@@ -1981,7 +1974,7 @@ static int zram_add(void) blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX); blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
- device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
- device_add_disk(NULL, zram->disk, zram_disk_groups);
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor)); -- 2.30.2
On Tue, Oct 05, 2021 at 01:57:00PM -0700, Kees Cook wrote:
On Mon, Sep 27, 2021 at 09:38:05AM -0700, Luis Chamberlain wrote:
The ATTRIBUTE_GROUPS is typically used to avoid boiler plate code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was long due on the zram driver, however a recent fix for sysfs allows users of ATTRIBUTE_GROUPS to also associate a module to the group attribute.
Does this mean that other modules using sysfs but _not_ ATTRIBUTE_GROUPS() are still vulnerable to potential use-after-free of the kernfs fops?
The issue is not UAF, its the possible deadlock, but in that sense, yes. If they don't use ATTRIBUTE_GROUPS() then there is no information being provided to sysfs about the module owner.
Luis
linux-kselftest-mirror@lists.linaro.org