TPM 1's support for its hardware RNG is broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet. These issues prevent the system from actually suspending. So disable the driver in this case. Later, when this is fixed properly, we can remove this.
Current breakage amounts to something like:
tpm tpm0: A TPM error (28) occurred continue selftest ... tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
The hwrng driver appears already to be occasionally disabled due to other conditions, so this shouldn't be too large of a surprise.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com --- drivers/char/tpm/tpm-chip.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c index 741d8f3e8fb3..eed67ea8d3a7 100644 --- a/drivers/char/tpm/tpm-chip.c +++ b/drivers/char/tpm/tpm-chip.c @@ -524,6 +524,14 @@ static int tpm_add_hwrng(struct tpm_chip *chip) if (!IS_ENABLED(CONFIG_HW_RANDOM_TPM) || tpm_is_firmware_upgrade(chip)) return 0;
+ /* + * This driver's support for using the RNG across suspend is broken on + * TPM1. Until somebody fixes this, just stop registering a HWRNG in + * that case. + */ + if (!(chip->flags & TPM_CHIP_FLAG_TPM2) && IS_ENABLED(CONFIG_PM_SLEEP)) + return 0; + snprintf(chip->hwrng_name, sizeof(chip->hwrng_name), "tpm-rng-%d", chip->dev_num); chip->hwrng.name = chip->hwrng_name;
On Thu, Jan 05, 2023 at 03:47:42PM +0100, Jason A. Donenfeld wrote:
TPM 1's support for its hardware RNG is broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet. These issues prevent the system from actually suspending. So disable the driver in this case. Later, when this is fixed properly, we can remove this.
Current breakage amounts to something like:
tpm tpm0: A TPM error (28) occurred continue selftest ... tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
The hwrng driver appears already to be occasionally disabled due to other conditions, so this shouldn't be too large of a surprise.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
Quoting from my previous email:
| I spent a long time working through the TPM code when this came up | during 6.1. I set up the TPM emulator with QEMU and reproduced this and | had a whole test setup and S3 fuzzer. It took a long time, and when I was | done, I paged it all out of my brain. When I found that patch from Jan | that fixed the problem most of the time (but not all the time), I wasted | tons of time trying to get the TPM maintainers to take the patch and | send it to Linus as part of rc7 or rc8. But they all ignored me, and | eventually Linus just took that patch directly. | | So I don't think I want to go down another rabbit hole here, having | experienced the TPM maintainers not really caring much, and that sucking | away the remaining energy I had before to keep looking at the issue and | its edge cases not handled by Jan's patch. | | On the contrary, it'd make a big difference if the TPM maintainers could | actually help analyze the code that they're most familiar with, so that | we can get to the bottom of this. That's a lot better than some random | drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd | never even looked at these drivers. | | My plan is to therefore be available to help and analyze and test and | maybe even write some code, if the TPM maintainers take the lead on | getting to the bottom of this. But if this hits neglect again like last | time, I'll just send a `depends on BROKEN if PM` patch to the TPM | hw_random driver and see what happens... That's a really awful solution | though, so I hope the maintainers will wake up this cycle.
Seeing as there's still no life from the TPM maintainers, here's the patch to make the problem go away until they wake up. When they do wake up, though, I will be available to start looking into this again in whatever capacity I might be useful.
Jason
On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld Jason@zx2c4.com wrote:
TPM 1's support for its hardware RNG is broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet. These issues prevent the system from actually suspending. So disable the driver in this case. Later, when this is fixed properly, we can remove this.
How about just keeping it enabled, but not making it a fatal error if the TPM saving doesn't work? IOW, just print the warning, and then "return 0" from the suspend function.
I doubt anybody cares, but your patch disables that TPM device just because PM is *enabled*. That's basically "all the time".
Imagine being on a desktop with a distro kernel that enables suspend - because that kernel obviously is expected to work on laptops too. You're never actually going to suspend things on that machine, but maybe you still want to register it as a source of hw random data?
Linus
On Thu, Jan 05, 2023 at 01:58:48PM -0800, Linus Torvalds wrote:
On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld Jason@zx2c4.com wrote:
TPM 1's support for its hardware RNG is broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet. These issues prevent the system from actually suspending. So disable the driver in this case. Later, when this is fixed properly, we can remove this.
How about just keeping it enabled, but not making it a fatal error if the TPM saving doesn't work? IOW, just print the warning, and then "return 0" from the suspend function.
You're right that returning 0 from the pm notifier would make the problem that users actually care about -- laptop doesn't sleep when you close the lid -- go away.
From a random.c perspective, the RNG is already initialized when the driver loads, which will be before suspend bricks the driver. So even if the behavior afterwards is a buggy driver handing all zeros to random.c, it won't really matter much; random.c can deal with that cryptographically. I have no idea if this is actually the case with the driver's error condition. But if it is, it's good that it doesn't matter.
So okay, I'll roll a patch to do that when I get home. I'm writing on my phone now, but from memory it's just changing a 'return rc;' into 'return 0;'.
Then the TPM folks can fix the underlying issue at their leisure whenever.
Jason
TPM 1 is sometimes broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet, most likely having to do with concurrent reads from the TPM's hardware random number generator driver. These issues prevent the system from actually suspending, with errors like:
tpm tpm0: A TPM error (28) occurred continue selftest ... tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Reported-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com --- This is basically untested and I haven't worked out if there are any awful implications of letting the system sleep when TPM suspend fails. Maybe some PCRs get cleared and that will make everything explode on resume? Maybe it doesn't matter? Somebody well versed in TPMology should probably [n]ack this approach.
drivers/char/tpm/tpm-interface.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) }
suspended: - return rc; + if (rc) + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n", + chip->dev_num, rc); + return 0; } EXPORT_SYMBOL_GPL(tpm_pm_suspend);
Hi Todd & ChromeOS folks,
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
TPM 1 is sometimes broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet, most likely having to do with concurrent reads from the TPM's hardware random number generator driver. These issues prevent the system from actually suspending, with errors like:
tpm tpm0: A TPM error (28) occurred continue selftest ... tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Reported-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
This is basically untested and I haven't worked out if there are any awful implications of letting the system sleep when TPM suspend fails. Maybe some PCRs get cleared and that will make everything explode on resume? Maybe it doesn't matter? Somebody well versed in TPMology should probably [n]ack this approach.
When idling scrolling on my telephone to try to see what the implications of skipping TPM_ORD_SAVESTATE could be, I stumbled across some ChromeOS commits related to it, and realized that, ah-hah, finally there's an obvious group of stakeholders who make heavy use of the TPM and have likely amassed some expertise on it.
So I was wondering if you'd take a look at this patch briefly to make sure it won't break ChromeOS laptops.
Jason
On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld Jason@zx2c4.com wrote:
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
So the patch looks fine to me, but since there's still the ChromeOS discussion pending I'll wait for that to finish.
Perhaps re-send or at least remind me if/when it does?
Also, a query about the printout:
if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
so I suspect that 99% of the time the dev_num isn't actually all that useful, but what *might* be useful is which tpm driver it is.
Just comparing the error dmesg output you had:
.. tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 ..
that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.
So I think "dev_err(dev, ...)" would be more useful here.
Finally - and maybe I'm just being difficult here, I will note here again that TPM2 devices don't have this issue, because the TPM2 path for suspend doesn't do any of this at all.
It just does
tpm_transmit_cmd(..);
with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check the return value. In fact, the tpm2 code *used* to have this comment:
/* In places where shutdown command is sent there's no much we can do * except print the error code on a system failure. */ if (rc < 0 && rc != -EPIPE) dev_warn(&chip->dev, "transmit returned %d while stopping the TPM", rc);
but it was summarily removed when doing some re-organization around buffer handling.
So just by looking at what tpm2 does, I'm not 100% convinced that tpm1 should do this dance at all.
But having a dev_err() is probably a good idea at least as a transitional thing.
Linus
I think it's fine to go ahead with your change, for multiple reasons.
1. I doubt that any of the ChromeOS devices using TPM 1.2 are still being updated. 2. If the SAVESTATE command fails, it is probably better to continue the transition to S3, and fail at resume, than to block the suspend. The suspend is often triggered by closing the lid, so users would not see what's going on and might put their running laptop in a backpack, where it could overheat. 3. I don't recall bugs due to failures of TPM suspend, and I didn't find any such bug in our database. Many (most?) ChromeOS devices left the TPM powered on in S3, so didn't use the suspend/resume path.
Thank you for asking!
On Fri, Jan 6, 2023 at 11:00 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld Jason@zx2c4.com wrote:
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
So the patch looks fine to me, but since there's still the ChromeOS discussion pending I'll wait for that to finish.
Perhaps re-send or at least remind me if/when it does?
Also, a query about the printout:
if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
so I suspect that 99% of the time the dev_num isn't actually all that useful, but what *might* be useful is which tpm driver it is.
Just comparing the error dmesg output you had:
.. tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 ..
that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.
So I think "dev_err(dev, ...)" would be more useful here.
Finally - and maybe I'm just being difficult here, I will note here again that TPM2 devices don't have this issue, because the TPM2 path for suspend doesn't do any of this at all.
It just does
tpm_transmit_cmd(..);
with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check the return value. In fact, the tpm2 code *used* to have this comment:
/* In places where shutdown command is sent there's no much we can do * except print the error code on a system failure. */ if (rc < 0 && rc != -EPIPE) dev_warn(&chip->dev, "transmit returned %d while
stopping the TPM", rc);
but it was summarily removed when doing some re-organization around buffer handling.
So just by looking at what tpm2 does, I'm not 100% convinced that tpm1 should do this dance at all.
But having a dev_err() is probably a good idea at least as a transitional thing.
Linus
On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato semenzato@chromium.org wrote:
I think it's fine to go ahead with your change, for multiple reasons.
Ok, I've applied the patch (although I did end up editing it to use dev_err() before doing that just to make myself happier about the printout).
Linus
On Fri, Jan 06, 2023 at 02:28:00PM -0800, Linus Torvalds wrote:
On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato semenzato@chromium.org wrote:
I think it's fine to go ahead with your change, for multiple reasons.
Ok, I've applied the patch (although I did end up editing it to use dev_err() before doing that just to make myself happier about the printout).
Thanks for fixing it up to use dev_err(). Final commit looks good to me.
Jason
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
TPM 1 is sometimes broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet, most likely having to do with concurrent reads from the TPM's hardware random number generator driver. These issues prevent the system from actually suspending, with errors like:
tpm tpm0: A TPM error (28) occurred continue selftest ... tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Reported-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
This is basically untested and I haven't worked out if there are any awful implications of letting the system sleep when TPM suspend fails. Maybe some PCRs get cleared and that will make everything explode on resume? Maybe it doesn't matter? Somebody well versed in TPMology should probably [n]ack this approach.
drivers/char/tpm/tpm-interface.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) } suspended:
- return rc;
- if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
- return 0;
} EXPORT_SYMBOL_GPL(tpm_pm_suspend); -- 2.39.0
Let me read all the threads through starting from the original report. I've had emails piling up because of getting sick before holiday, and holiday season after that.
This looks sane
Apologies for the lack of response.
BR, Jarkko
Hi Jarkko,
On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen jarkko@kernel.org wrote:
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) }
suspended:
return rc;
if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
return 0;
} EXPORT_SYMBOL_GPL(tpm_pm_suspend);
-- 2.39.0
Let me read all the threads through starting from the original report. I've had emails piling up because of getting sick before holiday, and holiday season after that.
This looks sane
No, not really. I mean, it was sane under the circumstances of, "I'm not going to spend time fixing this for real if the maintainers aren't around," and it fixed the suspend issue. But it doesn't actually fix any real tpm issue. The real issue, AFAICT, is there's some sort of race between the tpm rng read command and either suspend or wakeup or selftest. One of these is missing some locking. And then commands step on each other and the tpm gets upset. This is probably something that should be fixed. I assume the "Fixes: ..." tag will actually go quite far back, with recent things only unearthing a somewhat old bug. But just a hunch.
Jason
On Mon, Jan 16, 2023 at 03:03:17PM +0100, Jason A. Donenfeld wrote:
Hi Jarkko,
On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen jarkko@kernel.org wrote:
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) }
suspended:
return rc;
if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
return 0;
} EXPORT_SYMBOL_GPL(tpm_pm_suspend);
-- 2.39.0
Let me read all the threads through starting from the original report. I've had emails piling up because of getting sick before holiday, and holiday season after that.
This looks sane
No, not really. I mean, it was sane under the circumstances of, "I'm not going to spend time fixing this for real if the maintainers aren't around," and it fixed the suspend issue. But it doesn't actually fix any real tpm issue. The real issue, AFAICT, is there's some sort of race between the tpm rng read command and either suspend or wakeup or selftest. One of these is missing some locking. And then commands step on each other and the tpm gets upset. This is probably something that should be fixed. I assume the "Fixes: ..." tag will actually go quite far back, with recent things only unearthing a somewhat old bug. But just a hunch.
Jason
See my response to Vlastimil:
https://lore.kernel.org/linux-integrity/Y8sr7YJ8e8eSpPFv@kernel.org/
Can you try what happens if you do not call tpm_add_hwrng()?
BR, Jarkko
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
TPM 1 is sometimes broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet, most likely having to do with concurrent reads from the TPM's hardware random number generator driver. These issues prevent the system from actually suspending, with errors like:
tpm tpm0: A TPM error (28) occurred continue selftest ...
<REMOVE>
tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
</REMOVE>
Unrelated to thix particular fix.
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Reported-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
This is basically untested and I haven't worked out if there are any awful implications of letting the system sleep when TPM suspend fails. Maybe some PCRs get cleared and that will make everything explode on resume? Maybe it doesn't matter? Somebody well versed in TPMology should probably [n]ack this approach.
drivers/char/tpm/tpm-interface.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) } suspended:
- return rc;
- if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
- return 0;
} EXPORT_SYMBOL_GPL(tpm_pm_suspend); -- 2.39.0
This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and the call is located in tpm_tis_resume() [*].
[*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/
BR, Jarkko
On 1/16/23 12:44, Jarkko Sakkinen wrote:
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
TPM 1 is sometimes broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet, most likely having to do with concurrent reads from the TPM's hardware random number generator driver. These issues prevent the system from actually suspending, with errors like:
tpm tpm0: A TPM error (28) occurred continue selftest ...
<REMOVE>
tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
</REMOVE>
Unrelated to thix particular fix.
Not sure I understand. AFAIK this is not a proper fix, but a workaround for when laptop suspend no longer works because TPM fails to suspend. The error messages quoted above are very much related to the problem of suspend not working, and this patch did work as advertised at least for me. I see errors but they don't prevent suspend anymore:
https://lore.kernel.org/all/58d7a42c-9e6b-ab2a-617f-d5e373bf63cb@suse.cz/
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Reported-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
This is basically untested and I haven't worked out if there are any awful implications of letting the system sleep when TPM suspend fails. Maybe some PCRs get cleared and that will make everything explode on resume? Maybe it doesn't matter? Somebody well versed in TPMology should probably [n]ack this approach.
drivers/char/tpm/tpm-interface.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) } suspended:
- return rc;
- if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
- return 0;
} EXPORT_SYMBOL_GPL(tpm_pm_suspend); -- 2.39.0
This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and the call is located in tpm_tis_resume() [*].
[*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/
Yes the changelog at the top does say "due to races or locking issues or something else that haven't been diagnosed or fixed yet"
I don't know what causes the TPM to start returning error 28 on resume and never recover from it. But it didn't happen before hwrng started using the TPM. Before that, it was probably just the selftest ever doing anything with the TPM, and on its own I don't recall it ever (before 6.1) failing and preventing further suspend/resume.
BR, Jarkko
On Mon, Jan 16, 2023 at 03:00:03PM +0100, Vlastimil Babka wrote:
On 1/16/23 12:44, Jarkko Sakkinen wrote:
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
TPM 1 is sometimes broken across system suspends, due to races or locking issues or something else that haven't been diagnosed or fixed yet, most likely having to do with concurrent reads from the TPM's hardware random number generator driver. These issues prevent the system from actually suspending, with errors like:
tpm tpm0: A TPM error (28) occurred continue selftest ...
<REMOVE>
tpm tpm0: A TPM error (28) occurred attempting get random ... tpm tpm0: Error (28) sending savestate before suspend tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28 tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28 tpm_tis 00:08: PM: failed to suspend: error 28 PM: Some devices failed to suspend, or early wake event detected
</REMOVE>
Unrelated to thix particular fix.
Not sure I understand. AFAIK this is not a proper fix, but a workaround for when laptop suspend no longer works because TPM fails to suspend. The error messages quoted above are very much related to the problem of suspend not working, and this patch did work as advertised at least for me. I see errors but they don't prevent suspend anymore:
https://lore.kernel.org/all/58d7a42c-9e6b-ab2a-617f-d5e373bf63cb@suse.cz/
This issue was partially fixed by 23393c646142 ("char: tpm: Protect tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took directly because the TPM maintainers weren't available. However, it seems like this just addresses the most common cases of the bug, rather than addressing it entirely. So there are more things to fix still, apparently.
In lieu of actually fixing the underlying bug, just allow system suspend to continue, so that laptops still go to sleep fine. Later, this can be reverted when the real bug is fixed.
Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/ Cc: stable@vger.kernel.org # 6.1+ Reported-by: Vlastimil Babka vbabka@suse.cz Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
This is basically untested and I haven't worked out if there are any awful implications of letting the system sleep when TPM suspend fails. Maybe some PCRs get cleared and that will make everything explode on resume? Maybe it doesn't matter? Somebody well versed in TPMology should probably [n]ack this approach.
drivers/char/tpm/tpm-interface.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index d69905233aff..6df9067ef7f9 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev) } suspended:
- return rc;
- if (rc)
pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
chip->dev_num, rc);
- return 0;
} EXPORT_SYMBOL_GPL(tpm_pm_suspend); -- 2.39.0
This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and the call is located in tpm_tis_resume() [*].
[*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/
Yes the changelog at the top does say "due to races or locking issues or something else that haven't been diagnosed or fixed yet"
I don't know what causes the TPM to start returning error 28 on resume and never recover from it. But it didn't happen before hwrng started using the TPM. Before that, it was probably just the selftest ever doing anything with the TPM, and on its own I don't recall it ever (before 6.1) failing and preventing further suspend/resume.
Would it be possible to test this theory by commenting out tpm_add_hwrng() call from tpm_chip_register()?
Since they are called sequentially any sort of concurrency issue can be probably ruled out.
One thing that I noticed is that probably it would be more safe-play to move tpm_add_hwrng() call after creating the character device, as there's no need to do it before anything else.
BR, Jarkko
linux-stable-mirror@lists.linaro.org