arm beagleboard x15 device failed to boot Linux mainline and linux-next kernel due to below error. This error occurred across all x15 device for these kernel version.
This regression started happening on x15 from this commit onwards (27th Jan) git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git commit: aae1464f46a2403565f75717438118691d31ccf1 git describe: v5.5-489-gaae1464f46a2
Test output log, [ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy [ 37.611850] mmc1: cache flush error -110 [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0 [ 37.627387] Aborting journal on device mmcblk1p9-8. [ 37.635448] systemd[1]: Installed transient /etc/machine-id file. [ 37.659283] systemd[1]: Couldn't move remaining userspace processes, ignoring: Input/output error [ 37.744027] EXT4-fs error (device mmcblk1p9): ext4_journal_check_start:61: Detected aborted journal [ 37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only [ 37.917486] systemd-gpt-auto-generator[108]: Failed to dissect: Input/output error [ 37.927825] systemd[104]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. <> [ 68.856307] mmc1: Card stuck being busy! mmc_poll_for_busy [ 68.861838] mmc1: cache flush error -110 [ 68.865812] blk_update_request: I/O error, dev mmcblk1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 <> [ 98.906243] mmc1: Card stuck being busy! mmc_poll_for_busy [ 98.911774] mmc1: cache flush error -110 [ 98.915747] blk_update_request: I/O error, dev mmcblk1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 <> Dependency failed for Serial Getty on ttyS2. [ 128.946258] mmc1: Card stuck being busy! mmc_poll_for_busy [ 128.951786] mmc1: cache flush error -110 [ 128.955756] blk_update_request: I/O error, dev mmcblk1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 [FAILED] Failed to start File System Check on Root Device. See 'systemctl status systemd-fsck-root.service' for details. [ OK ] Started Apply Kernel Variables. [ OK ] Reached target Login Prompts. Starting Remount Root and Kernel File Systems... [ OK ] Reached target Timers. [ OK ] Closed Syslog Socket. [ OK ] Started Emergency Shell. [ 129.227328] EXT4-fs error (device mmcblk1p9): ext4_remount:5354: Abort forced by user [ OK ] Reached target Emergency Mode. [ OK ] Reached target Sockets. [FAILED] Failed to start Remount Root and Kernel File Systems. <> You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode. Press Enter for maintenance auto-login-action timed out after 874 seconds
ref: https://lkft.validation.linaro.org/scheduler/job/1137693#L4034 https://lkft.validation.linaro.org/scheduler/job/1158106#L4048 https://lkft.validation.linaro.org/scheduler/job/1137690#L3985 https://lkft.validation.linaro.org/scheduler/job/1137691#L4012 https://lkft.validation.linaro.org/scheduler/job/1137696#L4043 https://lkft.validation.linaro.org/scheduler/job/1137699#L4153
On Thu, Feb 13, 2020 at 4:43 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
arm beagleboard x15 device failed to boot Linux mainline and linux-next kernel due to below error. This error occurred across all x15 device for these kernel version.
This regression started happening on x15 from this commit onwards (27th Jan) git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git commit: aae1464f46a2403565f75717438118691d31ccf1 git describe: v5.5-489-gaae1464f46a2
Is it only the merge that introduced the issue, or is the branch that got merged already broken?
If it's easy for you to reproduce, please run the same test on commit e4e4c2ff78ed from Mark's regulator tree to narrow it down further.
Added Mark to Cc as well, in case it is indeed one of those.
Arnd
8<---
Test output log, [ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy [ 37.611850] mmc1: cache flush error -110 [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0 [ 37.627387] Aborting journal on device mmcblk1p9-8. [ 37.635448] systemd[1]: Installed transient /etc/machine-id file. [ 37.659283] systemd[1]: Couldn't move remaining userspace processes, ignoring: Input/output error [ 37.744027] EXT4-fs error (device mmcblk1p9): ext4_journal_check_start:61: Detected aborted journal [ 37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only [ 37.917486] systemd-gpt-auto-generator[108]: Failed to dissect: Input/output error [ 37.927825] systemd[104]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. <> [ 68.856307] mmc1: Card stuck being busy! mmc_poll_for_busy [ 68.861838] mmc1: cache flush error -110 [ 68.865812] blk_update_request: I/O error, dev mmcblk1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 <> [ 98.906243] mmc1: Card stuck being busy! mmc_poll_for_busy [ 98.911774] mmc1: cache flush error -110 [ 98.915747] blk_update_request: I/O error, dev mmcblk1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 <> Dependency failed for Serial Getty on ttyS2. [ 128.946258] mmc1: Card stuck being busy! mmc_poll_for_busy [ 128.951786] mmc1: cache flush error -110 [ 128.955756] blk_update_request: I/O error, dev mmcblk1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 [FAILED] Failed to start File System Check on Root Device. See 'systemctl status systemd-fsck-root.service' for details. [ OK ] Started Apply Kernel Variables. [ OK ] Reached target Login Prompts. Starting Remount Root and Kernel File Systems... [ OK ] Reached target Timers. [ OK ] Closed Syslog Socket. [ OK ] Started Emergency Shell. [ 129.227328] EXT4-fs error (device mmcblk1p9): ext4_remount:5354: Abort forced by user [ OK ] Reached target Emergency Mode. [ OK ] Reached target Sockets. [FAILED] Failed to start Remount Root and Kernel File Systems. <> You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode. Press Enter for maintenance auto-login-action timed out after 874 seconds
ref: https://lkft.validation.linaro.org/scheduler/job/1137693#L4034 https://lkft.validation.linaro.org/scheduler/job/1158106#L4048 https://lkft.validation.linaro.org/scheduler/job/1137690#L3985 https://lkft.validation.linaro.org/scheduler/job/1137691#L4012 https://lkft.validation.linaro.org/scheduler/job/1137696#L4043 https://lkft.validation.linaro.org/scheduler/job/1137699#L4153
On Fri, Feb 14, 2020 at 10:09:34AM +0100, Arnd Bergmann wrote:
On Thu, Feb 13, 2020 at 4:43 PM Naresh Kamboju
arm beagleboard x15 device failed to boot Linux mainline and linux-next kernel due to below error. This error occurred across all x15 device for these kernel version.
This regression started happening on x15 from this commit onwards (27th Jan) git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git commit: aae1464f46a2403565f75717438118691d31ccf1 git describe: v5.5-489-gaae1464f46a2
That's Linus' commit:
Merge tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Please include human readable descriptions of things like commits and issues being discussed in e-mail in your mails, this makes them much easier for humans to read especially when they have no internet access. I do frequently catch up on my mail on flights or while otherwise travelling so this is even more pressing for me than just being about making things a bit easier to read.
Is it only the merge that introduced the issue, or is the branch that got merged already broken?
Or is this just the first commit where the problem happened to be noticed? Commits done just before that on the same day include the dmaengine and MMC merges which seem potentially relevant:
[ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy [ 37.611850] mmc1: cache flush error -110 [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arm beagleboard x15 device failed to boot Linux mainline and linux-next kernel due to below error. This error occurred across all x15 device for these kernel version.
This regression started happening on x15 from this commit onwards (27th Jan) git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git commit: aae1464f46a2403565f75717438118691d31ccf1 git describe: v5.5-489-gaae1464f46a2
Test output log, [ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy [ 37.611850] mmc1: cache flush error -110 [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0 [ 37.627387] Aborting journal on device mmcblk1p9-8. [ 37.635448] systemd[1]: Installed transient /etc/machine-id file. [ 37.659283] systemd[1]: Couldn't move remaining userspace processes, ignoring: Input/output error [ 37.744027] EXT4-fs error (device mmcblk1p9): ext4_journal_check_start:61: Detected aborted journal [ 37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only [ 37.917486] systemd-gpt-auto-generator[108]: Failed to dissect: Input/output error [ 37.927825] systemd[104]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. <>
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
The offending commit could perhaps be this one.
commit 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 Author: Ulf Hansson ulf.hansson@linaro.org Date: Wed Jan 22 15:27:45 2020 +0100 mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
[...]
Kind regards Uffe
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
- Naresh
On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
- Naresh
Thanks for testing!
This sounds a bit weird, I must say. Also, while looking into the logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but maybe I didn't read the logs carefully enough.
In any case, I am looking into creating a debug patch so we can narrow down the problem a bit further.
Kind regards Uffe
On 2/21/20 1:48 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
- Naresh
Thanks for testing!
This sounds a bit weird, I must say. Also, while looking into the logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but maybe I didn't read the logs carefully enough.
In any case, I am looking into creating a debug patch so we can narrow down the problem a bit further.
Kind regards Uffe
Hi Ulf,
I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and below error is seen due to the commit 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 ("mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC"):
## [ 268.976197] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 268.984414] OOM killer disabled. [ 268.987635] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. [ 269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110 [ 269.224255] PM: Device mmc1:0001 failed to suspend: error -110 [ 269.230080] PM: Some devices failed to suspend, or early wake event detected ##
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL, - EXT_CSD_FLUSH_CACHE, 1, 0); + EXT_CSD_FLUSH_CACHE, 1, + MMC_CACHE_FLUSH_TIMEOUT_MS);
##
Do you have suggestion to try for the suspend errors ?
-regards, Bitan
+ Adrian
On Fri, 21 Feb 2020 at 20:44, Bitan Biswas bbiswas@nvidia.com wrote:
On 2/21/20 1:48 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
- Naresh
Thanks for testing!
This sounds a bit weird, I must say. Also, while looking into the logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but maybe I didn't read the logs carefully enough.
In any case, I am looking into creating a debug patch so we can narrow down the problem a bit further.
Kind regards Uffe
Hi Ulf,
I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and below error is seen due to the commit 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 ("mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC"):
## [ 268.976197] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 268.984414] OOM killer disabled. [ 268.987635] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. [ 269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110 [ 269.224255] PM: Device mmc1:0001 failed to suspend: error -110 [ 269.230080] PM: Some devices failed to suspend, or early wake event detected
Is there also a print in the log about "Card stuck being busy!"?
In any case, it seems like the timeout error (-110) is happening way too soon. The cache flush timeout is now 30s, but the timeout seems to fire only a few hundred ms (at most) after the cache flush command has been sent.
##
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1,
MMC_CACHE_FLUSH_TIMEOUT_MS);
##
Do you have suggestion to try for the suspend errors ?
Just as a quick sanity test, please try the below patch, which restores the old cache flush timeout to 10min.
However, as I indicated above, this seems to be a problem that needs to be fixed at in the host driver side. For the sdhci driver, there is a bit of a tricky logic around how to deal with timeouts in sdhci_send_command(). My best guess is that's where we should look more closely (and I am doing that).
From: Ulf Hansson ulf.hansson@linaro.org Date: Mon, 24 Feb 2020 11:43:33 +0100 Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
Signed-off-by: Ulf Hansson ulf.hansson@linaro.org --- drivers/mmc/core/mmc_ops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index da425ee2d9bf..713e7dd6d028 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -21,7 +21,7 @@
#define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/ #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */ -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */ +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */
static const u8 tuning_blk_pattern_4bit[] = { 0xff, 0x0f, 0xff, 0x00, 0xff, 0xcc, 0xc3, 0xcc,
On 24/02/20 1:16 pm, Ulf Hansson wrote:
- Adrian
On Fri, 21 Feb 2020 at 20:44, Bitan Biswas bbiswas@nvidia.com wrote:
On 2/21/20 1:48 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
- Naresh
Thanks for testing!
This sounds a bit weird, I must say. Also, while looking into the logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but maybe I didn't read the logs carefully enough.
In any case, I am looking into creating a debug patch so we can narrow down the problem a bit further.
Kind regards Uffe
Hi Ulf,
I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and below error is seen due to the commit 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 ("mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC"):
## [ 268.976197] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 268.984414] OOM killer disabled. [ 268.987635] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. [ 269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110 [ 269.224255] PM: Device mmc1:0001 failed to suspend: error -110 [ 269.230080] PM: Some devices failed to suspend, or early wake event detected
Is there also a print in the log about "Card stuck being busy!"?
In any case, it seems like the timeout error (-110) is happening way too soon. The cache flush timeout is now 30s, but the timeout seems to fire only a few hundred ms (at most) after the cache flush command has been sent.
##
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1,
MMC_CACHE_FLUSH_TIMEOUT_MS);
##
Do you have suggestion to try for the suspend errors ?
Just as a quick sanity test, please try the below patch, which restores the old cache flush timeout to 10min.
However, as I indicated above, this seems to be a problem that needs to be fixed at in the host driver side. For the sdhci driver, there is a bit of a tricky logic around how to deal with timeouts in sdhci_send_command(). My best guess is that's where we should look more closely (and I am doing that).
Could be, but hard to narrow down without more debug. I would enable dynamic debug and mmc messages e.g.
kernel config
CONFIG_DYNAMIC_DEBUG=y
kernel command line option
dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
But that might lead to a very large number of messages.
From: Ulf Hansson ulf.hansson@linaro.org Date: Mon, 24 Feb 2020 11:43:33 +0100 Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
Signed-off-by: Ulf Hansson ulf.hansson@linaro.org
drivers/mmc/core/mmc_ops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index da425ee2d9bf..713e7dd6d028 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -21,7 +21,7 @@
#define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/ #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */ -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */ +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */
static const u8 tuning_blk_pattern_4bit[] = { 0xff, 0x0f, 0xff, 0x00, 0xff, 0xcc, 0xc3, 0xcc,
On 24/02/2020 11:16, Ulf Hansson wrote:
- Adrian
On Fri, 21 Feb 2020 at 20:44, Bitan Biswas bbiswas@nvidia.com wrote:
On 2/21/20 1:48 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote:
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
Reverting all the above also fixes the problem for me.
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1,
MMC_CACHE_FLUSH_TIMEOUT_MS);
I no longer see the issue on reverting the above hunk as Bitan suggested but now I see the following (which is expected) ...
WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
Just as a quick sanity test, please try the below patch, which restores the old cache flush timeout to 10min.
However, as I indicated above, this seems to be a problem that needs to be fixed at in the host driver side. For the sdhci driver, there is a bit of a tricky logic around how to deal with timeouts in sdhci_send_command(). My best guess is that's where we should look more closely (and I am doing that).
From: Ulf Hansson ulf.hansson@linaro.org Date: Mon, 24 Feb 2020 11:43:33 +0100 Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
Signed-off-by: Ulf Hansson ulf.hansson@linaro.org
drivers/mmc/core/mmc_ops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index da425ee2d9bf..713e7dd6d028 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -21,7 +21,7 @@
#define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/ #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */ -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */ +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */
This does not fix the problem for me.
Jon
On Tue, 25 Feb 2020 at 11:04, Jon Hunter jonathanh@nvidia.com wrote:
On 24/02/2020 11:16, Ulf Hansson wrote:
- Adrian
On Fri, 21 Feb 2020 at 20:44, Bitan Biswas bbiswas@nvidia.com wrote:
On 2/21/20 1:48 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 19 Feb 2020 at 21:54, Ulf Hansson ulf.hansson@linaro.org wrote:
On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju naresh.kamboju@linaro.org wrote: >
Try to restore the value for the cache flush timeout, by updating the define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
I have increased the timeout to 10 minutes but it did not help. Same error found. [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy [ 608.684964] mmc1: cache flush error -110 [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
OTOH, What best i could do for my own experiment to revert all three patches and now the reported error gone and device mount successfully [1].
List of patches reverted, mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
Reverting all the above also fixes the problem for me.
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1,
MMC_CACHE_FLUSH_TIMEOUT_MS);
I no longer see the issue on reverting the above hunk as Bitan suggested but now I see the following (which is expected) ...
WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
Just as a quick sanity test, please try the below patch, which restores the old cache flush timeout to 10min.
However, as I indicated above, this seems to be a problem that needs to be fixed at in the host driver side. For the sdhci driver, there is a bit of a tricky logic around how to deal with timeouts in sdhci_send_command(). My best guess is that's where we should look more closely (and I am doing that).
From: Ulf Hansson ulf.hansson@linaro.org Date: Mon, 24 Feb 2020 11:43:33 +0100 Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
Signed-off-by: Ulf Hansson ulf.hansson@linaro.org
drivers/mmc/core/mmc_ops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index da425ee2d9bf..713e7dd6d028 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -21,7 +21,7 @@
#define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/ #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */ -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */ +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */
This does not fix the problem for me.
Thanks for testing and the report - the results are not surprising.
I am getting back with an update asap, just want to complete my tests over here before I say more.
Kind regards Uffe
On 25/02/2020 10:04, Jon Hunter wrote:
...
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1,
MMC_CACHE_FLUSH_TIMEOUT_MS);
I no longer see the issue on reverting the above hunk as Bitan suggested but now I see the following (which is expected) ...
WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
For Tegra, the default timeout used when no timeout is specified for CMD6 is 100mS. So hard-coding the following also appears to workaround the problem on Tegra ...
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index 868653bc1555..5155e0240fca 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL, - EXT_CSD_FLUSH_CACHE, 1, 0); + EXT_CSD_FLUSH_CACHE, 1, 100); if (err) pr_err("%s: cache flush error %d\n", mmc_hostname(card->host), err);
So the problem appears to be causing by the timeout being too long rather than not long enough.
Looking more at the code, I think now that we are hitting the condition ...
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index 868653bc1555..feae82b1ff35 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value, * the host to avoid HW busy detection, by converting to a R1 response * instead of a R1B. */ - if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) + if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) { + pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout); use_r1b_resp = false; + }
With the above I see ...
WARNING KERN mmc1: timeout (1600) > max busy timeout (672)
So with the longer timeout we are not using/requesting the response.
Cheers Jon
+ Faiz Abbas
On Tue, 25 Feb 2020 at 12:41, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 10:04, Jon Hunter wrote:
...
I find that from the commit the changes in mmc_flush_cache below is the cause.
## @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1,
MMC_CACHE_FLUSH_TIMEOUT_MS);
I no longer see the issue on reverting the above hunk as Bitan suggested but now I see the following (which is expected) ...
WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
For Tegra, the default timeout used when no timeout is specified for CMD6 is 100mS. So hard-coding the following also appears to workaround the problem on Tegra ...
Interesting.
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index 868653bc1555..5155e0240fca 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card) (card->ext_csd.cache_size > 0) && (card->ext_csd.cache_ctrl & 1)) { err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_FLUSH_CACHE, 1, 0);
EXT_CSD_FLUSH_CACHE, 1, 100); if (err) pr_err("%s: cache flush error %d\n", mmc_hostname(card->host), err);
So the problem appears to be causing by the timeout being too long rather than not long enough.
Looking more at the code, I think now that we are hitting the condition ...
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index 868653bc1555..feae82b1ff35 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value, * the host to avoid HW busy detection, by converting to a R1 response * instead of a R1B. */
if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout); use_r1b_resp = false;
}
With the above I see ...
WARNING KERN mmc1: timeout (1600) > max busy timeout (672)
So with the longer timeout we are not using/requesting the response.
You are most likely correct.
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
I have looped in Faiz, as sdhci-omap seems to suffer from very similar problems. One thing I noted for sdhci-omap, is that MMC_ERASE commands is treated in a special manner in sdhci_omap_set_timeout(). This indicates that there is something fishy going on.
Faiz, can you please comment on this?
Kind regards Uffe
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Cheers Jon
+ Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
1. As we all know by now, the cache flush command (CMD6) fails with -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
3. Updating the code to again use 30s as the MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
Kind regards Uffe
On 26/02/2020 15:21, Ulf Hansson wrote:
- Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So no I don't believe that we have explicitly tried erase commands and no there is nothing special that we are doing for erase.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
OK, let me check to see who can help from our side.
Cheers Jon
Uffe,
On 26/02/20 8:51 pm, Ulf Hansson wrote:
- Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
Sorry I missed this earlier.
I don't have an X15 with me here but I'm trying to set one up in our remote farm. In the meantime, I tried to reproduce this issue on two platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue because those eMMC's don't even have a cache. I will keep you updated when I do get a board with a eMMC that has a cache.
Is there a way to reproduce this CMD6 issue with another operation?
Thanks, Faiz
On Mon, 2 Mar 2020 at 14:11, Faiz Abbas faiz_abbas@ti.com wrote:
Uffe,
On 26/02/20 8:51 pm, Ulf Hansson wrote:
- Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
Sorry I missed this earlier.
I don't have an X15 with me here but I'm trying to set one up in our remote farm. In the meantime, I tried to reproduce this issue on two platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue because those eMMC's don't even have a cache. I will keep you updated when I do get a board with a eMMC that has a cache.
Is there a way to reproduce this CMD6 issue with another operation?
Yes, most definitely.
Let me cook a debug patch for you that should trigger the problem for another CMD6 operation. I will post something later this evening or in the mornings (Swedish timezone).
Kind regards Uffe
On 3/2/20 8:50 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Mon, 2 Mar 2020 at 14:11, Faiz Abbas faiz_abbas@ti.com wrote:
Uffe,
On 26/02/20 8:51 pm, Ulf Hansson wrote:
- Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
Sorry I missed this earlier.
I don't have an X15 with me here but I'm trying to set one up in our remote farm. In the meantime, I tried to reproduce this issue on two platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue because those eMMC's don't even have a cache. I will keep you updated when I do get a board with a eMMC that has a cache.
Is there a way to reproduce this CMD6 issue with another operation?
Yes, most definitely.
Let me cook a debug patch for you that should trigger the problem for another CMD6 operation. I will post something later this evening or in the mornings (Swedish timezone).
Kind regards Uffe
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Actually we always use R1B with CMD6 as per spec.
Thanks
Sowjanya
[...]
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
[...]
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Okay, great.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Kind regards Uffe
[...]
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
/s/30ms/30s
Kind regards Uffe
On 3/4/20 2:18 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
[...]
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Okay, great.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
Yes
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
Yes
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Kind regards Uffe
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
[...]
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Okay, great.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
Yes
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
Yes
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Kind regards Uffe
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
[...]
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Okay, great.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
Yes
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
Yes
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Kind regards Uffe
On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
[...]
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Okay, great.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
Yes
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
Yes
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry for confusion. Please ignore above on response. it is using SHORT response for R1. So SW poll should be working.
Will get back on checking on host design side internally.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Kind regards Uffe
On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
[...]
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Okay, great.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
Yes
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
Yes
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry for confusion. Please ignore above on response. it is using SHORT response for R1. So SW poll should be working.
Will get back on checking on host design side internally.
Hi Ulf,
Verified internally regarding the busy state over DATA0 Our host design has known minor bug where command complete interrupt is asserted after waiting for busy cycles from device.So because of this polling for card_busy() returns 0 (DAT0 line High) immediately as waiting for busy is taken care during command complete interrupt in host design. This behavior is same for R1 and R1B.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
Actually we always use R1B with CMD6 as per spec.
I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Kind regards Uffe
On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
> So, from my side, me and Anders Roxell, have been collaborating on > testing the behaviour on a TI Beagleboard x15 (remotely with > limited > debug options), which is using the sdhci-omap variant. I am > trying to > get hold of an Nvidia jetson-TX2, but not found one yet. These > are the > conclusions from the observed behaviour on the Beagleboard for the > CMD6 cache flush command. > > First, the reported host->max_busy_timeout is 2581 (ms) for the > sdhci-omap driver in this configuration. > > 1. As we all know by now, the cache flush command (CMD6) fails with > -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set > to 30 * > 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag > from the command. > > 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that > the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the > timeout_ms parameter is less than max_busy_timeout (2000 < 2581). > Then everything works fine. > > 3. Updating the code to again use 30s as the > MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY > to be > set, even when the timeout_ms becomes greater than > max_busy_timeout. > This also works fine. > > Clearly this indicates a problem that I think needs to be > addressed in > the sdhci driver. However, of course I can revert the three > discussed > patches to fix the problem, but that would only hide the issues > and I > am sure we would then get back to this issue, sooner or later. > > To fix the problem in the sdhci driver, I would appreciate if > someone > from TI and Nvidia can step in to help, as I don't have the HW > on my > desk. > > Comments or other ideas of how to move forward? [...]
> Hi Ulf, > > I could repro during suspend on Jetson TX1/TX2 as when it does > mmc flush cache. Okay, great.
> > Timeout I see is for switch status CMD13 after sending CMD6 as > device side CMD6 is still inflight while host sends CMD13 as we > are using R1 response type with timeout_ms changes to 30s. > > > > Earlier we used timeout_ms of 0 for CMD6 flush cache, and with > it uses R1B response type and host will wait for busy state > followed by response from device for CMD6 and then data lines go > High. > > > > Now with timeout_ms changed to 30s, we use R1 response and SW > waits for busy by checking for DAT0 line to go High. If I understand correctly, because of the timeout now set to 30s, MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in sdhci-tegra as well?
Yes
In other words, mmc_poll_for_busy() is being called, which in your case means the ->card_busy() host ops (set to sdhci_card_busy() in your case) will be invoked to wait for the card to stop signal busy on DAT0.
This indicates to me, that the ->card_busy() ops returns zero to inform that the card is *not* busy, even if the card actually signals busy? Is that correct?
Yes
> > > With R1B type, host design after sending command at end of > completion after end bit waits for 2 cycles for data line to go > low (busy state from device) and waits for response cycles after > which data lines will go back high and then we issue switch > status CMD13. > > > > With R1 type, host after sending command and at end of > completion after end bit, DATA lines will go high immediately as > its R1 type and switch status CMD13 gets issued but by this time > it looks like CMD6 on device side is still in flight for sending > status and data. So, yes, using R1 instead of R1B triggers a different behaviour, but according to the eMMC spec it's perfectly allowed to issue a CMD13 even if the card signals busy on DAT0. The CMD13 is not using the DATA lines, so this should work.
If I understand correctly, your driver (and controller?) has issues with coping with this scenario. Is it something that can be fixed?
> > 30s timeout is the wait time for data0 line to go high and > mmc_busy_status will return success right away with R1 response > type and SW sends switch status CMD13 but during that time on > device side looks like still processing CMD6 as we are not > waiting for enough time when we use R1 response type. Right, as stated above, isn't sdhci_card_busy() working for your case? Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry for confusion. Please ignore above on response. it is using SHORT response for R1. So SW poll should be working.
Will get back on checking on host design side internally.
Hi Ulf,
Verified internally regarding the busy state over DATA0 Our host design has known minor bug where command complete interrupt is asserted after waiting for busy cycles from device.So because of this polling for card_busy() returns 0 (DAT0 line High) immediately as waiting for busy is taken care during command complete interrupt in host design. This behavior is same for R1 and R1B.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
> > > > Actually we always use R1B with CMD6 as per spec. I fully agree that R1B is preferable, but it's not against the spec to send CMD13 to poll for busy.
Moreover, we need to cope with the scenario when the host has specified a maximum timeout that isn't sufficiently long enough for the requested operation. Do you have another proposal for how to manage this, but disabling MMC_RSP_BUSY?
Let's assume you driver would get a R1B for the CMD6 (we force it), then what timeout would the driver be using if we would set cmd.busy_timeout to 30ms?
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
Kind regards Uffe
On 3/4/20 4:20 PM, Sowjanya Komatineni wrote:
On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote: > External email: Use caution opening links or attachments > > > [...] > >> So, from my side, me and Anders Roxell, have been collaborating on >> testing the behaviour on a TI Beagleboard x15 (remotely with >> limited >> debug options), which is using the sdhci-omap variant. I am >> trying to >> get hold of an Nvidia jetson-TX2, but not found one yet. These >> are the >> conclusions from the observed behaviour on the Beagleboard for the >> CMD6 cache flush command. >> >> First, the reported host->max_busy_timeout is 2581 (ms) for the >> sdhci-omap driver in this configuration. >> >> 1. As we all know by now, the cache flush command (CMD6) fails >> with >> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set >> to 30 * >> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag >> from the command. >> >> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means >> that >> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the >> timeout_ms parameter is less than max_busy_timeout (2000 < 2581). >> Then everything works fine. >> >> 3. Updating the code to again use 30s as the >> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the >> MMC_RSP_BUSY to be >> set, even when the timeout_ms becomes greater than >> max_busy_timeout. >> This also works fine. >> >> Clearly this indicates a problem that I think needs to be >> addressed in >> the sdhci driver. However, of course I can revert the three >> discussed >> patches to fix the problem, but that would only hide the issues >> and I >> am sure we would then get back to this issue, sooner or later. >> >> To fix the problem in the sdhci driver, I would appreciate if >> someone >> from TI and Nvidia can step in to help, as I don't have the HW >> on my >> desk. >> >> Comments or other ideas of how to move forward? > [...] > >> Hi Ulf, >> >> I could repro during suspend on Jetson TX1/TX2 as when it does >> mmc flush cache. > Okay, great. > >> >> Timeout I see is for switch status CMD13 after sending CMD6 as >> device side CMD6 is still inflight while host sends CMD13 as we >> are using R1 response type with timeout_ms changes to 30s. >> >> >> >> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with >> it uses R1B response type and host will wait for busy state >> followed by response from device for CMD6 and then data lines >> go High. >> >> >> >> Now with timeout_ms changed to 30s, we use R1 response and SW >> waits for busy by checking for DAT0 line to go High. > If I understand correctly, because of the timeout now set to 30s, > MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in > sdhci-tegra as well? Yes > > In other words, mmc_poll_for_busy() is being called, which in your > case means the ->card_busy() host ops (set to sdhci_card_busy() in > your case) will be invoked to wait for the card to stop signal > busy on > DAT0. > > This indicates to me, that the ->card_busy() ops returns zero to > inform that the card is *not* busy, even if the card actually > signals > busy? Is that correct? Yes > >> >> >> With R1B type, host design after sending command at end of >> completion after end bit waits for 2 cycles for data line to go >> low (busy state from device) and waits for response cycles >> after which data lines will go back high and then we issue >> switch status CMD13. >> >> >> >> With R1 type, host after sending command and at end of >> completion after end bit, DATA lines will go high immediately >> as its R1 type and switch status CMD13 gets issued but by this >> time it looks like CMD6 on device side is still in flight for >> sending status and data. > So, yes, using R1 instead of R1B triggers a different behaviour, > but > according to the eMMC spec it's perfectly allowed to issue a CMD13 > even if the card signals busy on DAT0. The CMD13 is not using > the DATA > lines, so this should work. > > If I understand correctly, your driver (and controller?) has issues > with coping with this scenario. Is it something that can be fixed? > >> >> 30s timeout is the wait time for data0 line to go high and >> mmc_busy_status will return success right away with R1 response >> type and SW sends switch status CMD13 but during that time on >> device side looks like still processing CMD6 as we are not >> waiting for enough time when we use R1 response type. > Right, as stated above, isn't sdhci_card_busy() working for your > case? > Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry for confusion. Please ignore above on response. it is using SHORT response for R1. So SW poll should be working.
Will get back on checking on host design side internally.
Hi Ulf,
Verified internally regarding the busy state over DATA0 Our host design has known minor bug where command complete interrupt is asserted after waiting for busy cycles from device.So because of this polling for card_busy() returns 0 (DAT0 line High) immediately as waiting for busy is taken care during command complete interrupt in host design. This behavior is same for R1 and R1B.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
>
>> >> >> >> Actually we always use R1B with CMD6 as per spec. > I fully agree that R1B is preferable, but it's not against the > spec to > send CMD13 to poll for busy. > > Moreover, we need to cope with the scenario when the host has > specified a maximum timeout that isn't sufficiently long enough for > the requested operation. Do you have another proposal for how to > manage this, but disabling MMC_RSP_BUSY? > > Let's assume you driver would get a R1B for the CMD6 (we force it), > then what timeout would the driver be using if we would set > cmd.busy_timeout to 30ms? >
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
With R1B for CMD6, busy detection timeout on our host is 11s (data timeout count = 0xE)
> Kind regards > Uffe
On Thu, 5 Mar 2020 at 01:20, Sowjanya Komatineni skomatineni@nvidia.com wrote:
On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
On 3/4/20 2:18 AM, Ulf Hansson wrote: > External email: Use caution opening links or attachments > > > [...] > >> So, from my side, me and Anders Roxell, have been collaborating on >> testing the behaviour on a TI Beagleboard x15 (remotely with >> limited >> debug options), which is using the sdhci-omap variant. I am >> trying to >> get hold of an Nvidia jetson-TX2, but not found one yet. These >> are the >> conclusions from the observed behaviour on the Beagleboard for the >> CMD6 cache flush command. >> >> First, the reported host->max_busy_timeout is 2581 (ms) for the >> sdhci-omap driver in this configuration. >> >> 1. As we all know by now, the cache flush command (CMD6) fails with >> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set >> to 30 * >> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag >> from the command. >> >> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that >> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the >> timeout_ms parameter is less than max_busy_timeout (2000 < 2581). >> Then everything works fine. >> >> 3. Updating the code to again use 30s as the >> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY >> to be >> set, even when the timeout_ms becomes greater than >> max_busy_timeout. >> This also works fine. >> >> Clearly this indicates a problem that I think needs to be >> addressed in >> the sdhci driver. However, of course I can revert the three >> discussed >> patches to fix the problem, but that would only hide the issues >> and I >> am sure we would then get back to this issue, sooner or later. >> >> To fix the problem in the sdhci driver, I would appreciate if >> someone >> from TI and Nvidia can step in to help, as I don't have the HW >> on my >> desk. >> >> Comments or other ideas of how to move forward? > [...] > >> Hi Ulf, >> >> I could repro during suspend on Jetson TX1/TX2 as when it does >> mmc flush cache. > Okay, great. > >> >> Timeout I see is for switch status CMD13 after sending CMD6 as >> device side CMD6 is still inflight while host sends CMD13 as we >> are using R1 response type with timeout_ms changes to 30s. >> >> >> >> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with >> it uses R1B response type and host will wait for busy state >> followed by response from device for CMD6 and then data lines go >> High. >> >> >> >> Now with timeout_ms changed to 30s, we use R1 response and SW >> waits for busy by checking for DAT0 line to go High. > If I understand correctly, because of the timeout now set to 30s, > MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in > sdhci-tegra as well? Yes > > In other words, mmc_poll_for_busy() is being called, which in your > case means the ->card_busy() host ops (set to sdhci_card_busy() in > your case) will be invoked to wait for the card to stop signal > busy on > DAT0. > > This indicates to me, that the ->card_busy() ops returns zero to > inform that the card is *not* busy, even if the card actually > signals > busy? Is that correct? Yes > >> >> >> With R1B type, host design after sending command at end of >> completion after end bit waits for 2 cycles for data line to go >> low (busy state from device) and waits for response cycles after >> which data lines will go back high and then we issue switch >> status CMD13. >> >> >> >> With R1 type, host after sending command and at end of >> completion after end bit, DATA lines will go high immediately as >> its R1 type and switch status CMD13 gets issued but by this time >> it looks like CMD6 on device side is still in flight for sending >> status and data. > So, yes, using R1 instead of R1B triggers a different behaviour, but > according to the eMMC spec it's perfectly allowed to issue a CMD13 > even if the card signals busy on DAT0. The CMD13 is not using the > DATA > lines, so this should work. > > If I understand correctly, your driver (and controller?) has issues > with coping with this scenario. Is it something that can be fixed? > >> >> 30s timeout is the wait time for data0 line to go high and >> mmc_busy_status will return success right away with R1 response >> type and SW sends switch status CMD13 but during that time on >> device side looks like still processing CMD6 as we are not >> waiting for enough time when we use R1 response type. > Right, as stated above, isn't sdhci_card_busy() working for your > case? > Can we fix it?
sdhci_card_busy() returned 0 indicating its not busy.
Based on our host design, When CMD6 is issued with R1 type, we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry for confusion. Please ignore above on response. it is using SHORT response for R1. So SW poll should be working.
Will get back on checking on host design side internally.
Hi Ulf,
Verified internally regarding the busy state over DATA0 Our host design has known minor bug where command complete interrupt is asserted after waiting for busy cycles from device.So because of this polling for card_busy() returns 0 (DAT0 line High) immediately as waiting for busy is taken care during command complete interrupt in host design. This behavior is same for R1 and R1B.
When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and with this command complete is end bit of device resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
Regardless of R1/R1B, device side CMD6 will always have busy state on D0 and response on CMD lines.
There will be 2 clock cycles period after sending CMD6 for device to send busy state on data0.
In case of R1 type, after sending command DAT will stay high and looks like we are polling for busy early before busy state has started and sending CMD13 while device is busy and sending response on CMD line is causing timeout.
Probably with this specific case of CMD6 with R1 type, to wait for card busy we should poll for DAT0 to go Low first and then to go High??
>
>> >> >> >> Actually we always use R1B with CMD6 as per spec. > I fully agree that R1B is preferable, but it's not against the > spec to > send CMD13 to poll for busy. > > Moreover, we need to cope with the scenario when the host has > specified a maximum timeout that isn't sufficiently long enough for > the requested operation. Do you have another proposal for how to > manage this, but disabling MMC_RSP_BUSY? > > Let's assume you driver would get a R1B for the CMD6 (we force it), > then what timeout would the driver be using if we would set > cmd.busy_timeout to 30ms? >
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
1. Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
2. Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
On 3/5/20 5:05 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Thu, 5 Mar 2020 at 01:20, Sowjanya Komatineni skomatineni@nvidia.com wrote:
On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
On 3/4/20 8:56 AM, Sowjanya Komatineni wrote: > On 3/4/20 2:18 AM, Ulf Hansson wrote: >> External email: Use caution opening links or attachments >> >> >> [...] >> >>> So, from my side, me and Anders Roxell, have been collaborating on >>> testing the behaviour on a TI Beagleboard x15 (remotely with >>> limited >>> debug options), which is using the sdhci-omap variant. I am >>> trying to >>> get hold of an Nvidia jetson-TX2, but not found one yet. These >>> are the >>> conclusions from the observed behaviour on the Beagleboard for the >>> CMD6 cache flush command. >>> >>> First, the reported host->max_busy_timeout is 2581 (ms) for the >>> sdhci-omap driver in this configuration. >>> >>> 1. As we all know by now, the cache flush command (CMD6) fails with >>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set >>> to 30 * >>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag >>> from the command. >>> >>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that >>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the >>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581). >>> Then everything works fine. >>> >>> 3. Updating the code to again use 30s as the >>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY >>> to be >>> set, even when the timeout_ms becomes greater than >>> max_busy_timeout. >>> This also works fine. >>> >>> Clearly this indicates a problem that I think needs to be >>> addressed in >>> the sdhci driver. However, of course I can revert the three >>> discussed >>> patches to fix the problem, but that would only hide the issues >>> and I >>> am sure we would then get back to this issue, sooner or later. >>> >>> To fix the problem in the sdhci driver, I would appreciate if >>> someone >>> from TI and Nvidia can step in to help, as I don't have the HW >>> on my >>> desk. >>> >>> Comments or other ideas of how to move forward? >> [...] >> >>> Hi Ulf, >>> >>> I could repro during suspend on Jetson TX1/TX2 as when it does >>> mmc flush cache. >> Okay, great. >> >>> Timeout I see is for switch status CMD13 after sending CMD6 as >>> device side CMD6 is still inflight while host sends CMD13 as we >>> are using R1 response type with timeout_ms changes to 30s. >>> >>> >>> >>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with >>> it uses R1B response type and host will wait for busy state >>> followed by response from device for CMD6 and then data lines go >>> High. >>> >>> >>> >>> Now with timeout_ms changed to 30s, we use R1 response and SW >>> waits for busy by checking for DAT0 line to go High. >> If I understand correctly, because of the timeout now set to 30s, >> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in >> sdhci-tegra as well? > Yes >> In other words, mmc_poll_for_busy() is being called, which in your >> case means the ->card_busy() host ops (set to sdhci_card_busy() in >> your case) will be invoked to wait for the card to stop signal >> busy on >> DAT0. >> >> This indicates to me, that the ->card_busy() ops returns zero to >> inform that the card is *not* busy, even if the card actually >> signals >> busy? Is that correct? > Yes >>> >>> With R1B type, host design after sending command at end of >>> completion after end bit waits for 2 cycles for data line to go >>> low (busy state from device) and waits for response cycles after >>> which data lines will go back high and then we issue switch >>> status CMD13. >>> >>> >>> >>> With R1 type, host after sending command and at end of >>> completion after end bit, DATA lines will go high immediately as >>> its R1 type and switch status CMD13 gets issued but by this time >>> it looks like CMD6 on device side is still in flight for sending >>> status and data. >> So, yes, using R1 instead of R1B triggers a different behaviour, but >> according to the eMMC spec it's perfectly allowed to issue a CMD13 >> even if the card signals busy on DAT0. The CMD13 is not using the >> DATA >> lines, so this should work. >> >> If I understand correctly, your driver (and controller?) has issues >> with coping with this scenario. Is it something that can be fixed? >> >>> 30s timeout is the wait time for data0 line to go high and >>> mmc_busy_status will return success right away with R1 response >>> type and SW sends switch status CMD13 but during that time on >>> device side looks like still processing CMD6 as we are not >>> waiting for enough time when we use R1 response type. >> Right, as stated above, isn't sdhci_card_busy() working for your >> case? >> Can we fix it? > sdhci_card_busy() returned 0 indicating its not busy. > > Based on our host design, When CMD6 is issued with R1 type, we > program it as NO_RESPONSE and with this command complete interrupt > happens right at end bit of command and there will be no transfer > complete interrupt. *[Correction] Based on our host design, When CMD6 is issued with R1 type as we program it as NO_RESPONSE and with this command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry to correct wordings, I meant sdhci driver programs response type as NO_RESPONSE for CMD6.
When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host design command complete interrupt happens right at end bit of command and there will be no transfer complete interrupt.
Sorry for confusion. Please ignore above on response. it is using SHORT response for R1. So SW poll should be working.
Will get back on checking on host design side internally.
Hi Ulf,
Verified internally regarding the busy state over DATA0 Our host design has known minor bug where command complete interrupt is asserted after waiting for busy cycles from device.So because of this polling for card_busy() returns 0 (DAT0 line High) immediately as waiting for busy is taken care during command complete interrupt in host design. This behavior is same for R1 and R1B.
> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT > and with this command complete is end bit of device resp and > transfer complete interrupt will be when DAT0 LOW -> HIGH. > > Regardless of R1/R1B, device side CMD6 will always have busy state > on D0 and response on CMD lines. > > There will be 2 clock cycles period after sending CMD6 for device > to send busy state on data0. > > In case of R1 type, after sending command DAT will stay high and > looks like we are polling for busy early before busy state has > started and sending CMD13 while device is busy and sending > response on CMD line is causing timeout. > > Probably with this specific case of CMD6 with R1 type, to wait for > card busy we should poll for DAT0 to go Low first and then to go > High?? > >>> >>> >>> Actually we always use R1B with CMD6 as per spec. >> I fully agree that R1B is preferable, but it's not against the >> spec to >> send CMD13 to poll for busy. >> >> Moreover, we need to cope with the scenario when the host has >> specified a maximum timeout that isn't sufficiently long enough for >> the requested operation. Do you have another proposal for how to >> manage this, but disabling MMC_RSP_BUSY? >> >> Let's assume you driver would get a R1B for the CMD6 (we force it), >> then what timeout would the driver be using if we would set >> cmd.busy_timeout to 30ms? >>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks
Sowjanya
[...]
>>>> >>>> Actually we always use R1B with CMD6 as per spec. >>> I fully agree that R1B is preferable, but it's not against the >>> spec to >>> send CMD13 to poll for busy. >>> >>> Moreover, we need to cope with the scenario when the host has >>> specified a maximum timeout that isn't sufficiently long enough for >>> the requested operation. Do you have another proposal for how to >>> manage this, but disabling MMC_RSP_BUSY? >>> >>> Let's assume you driver would get a R1B for the CMD6 (we force it), >>> then what timeout would the driver be using if we would set >>> cmd.busy_timeout to 30ms? >>>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Okay, thanks!
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
Okay.
So, the issue we see for CMD6 with R1, is a software problem that we should be able to fix.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Okay, I see.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
Alright, that sounds great!
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
I understand that 11s is probably fine to use, for most cases.
However, it's not spec compliant, as for some operations there are simply no timeout specified. BKOPS, cache flush, sanitize are cases like this - and then 11s is definitely not sufficient.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks!
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with an R1 response doesn't work - and then fix that behaviour.
2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Uffe,
On 06/03/20 4:44 pm, Ulf Hansson wrote:
[...]
>>>>> >>>>> Actually we always use R1B with CMD6 as per spec. >>>> I fully agree that R1B is preferable, but it's not against the >>>> spec to >>>> send CMD13 to poll for busy. >>>> >>>> Moreover, we need to cope with the scenario when the host has >>>> specified a maximum timeout that isn't sufficiently long enough for >>>> the requested operation. Do you have another proposal for how to >>>> manage this, but disabling MMC_RSP_BUSY? >>>> >>>> Let's assume you driver would get a R1B for the CMD6 (we force it), >>>> then what timeout would the driver be using if we would set >>>> cmd.busy_timeout to 30ms? >>>>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Okay, thanks!
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
Okay.
So, the issue we see for CMD6 with R1, is a software problem that we should be able to fix.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Okay, I see.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
Alright, that sounds great!
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
I understand that 11s is probably fine to use, for most cases.
However, it's not spec compliant, as for some operations there are simply no timeout specified. BKOPS, cache flush, sanitize are cases like this - and then 11s is definitely not sufficient.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks!
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
I was able to reproduce the issue on sdhci-omap with your patch applied.
In our case, the dat0 line never comes back up when trying to send a CMD6 with R1. It needs an R1b to work properly. I also set mmc->max_busy_timeout to 0 and that made it work as well.
I wanna try it out with other R1b commands like erase to see if it works properly if the delay is actually very long.
Thanks, Faiz
On Mon, 9 Mar 2020 at 15:06, Faiz Abbas faiz_abbas@ti.com wrote:
Uffe,
On 06/03/20 4:44 pm, Ulf Hansson wrote:
[...]
>>>>>> >>>>>> Actually we always use R1B with CMD6 as per spec. >>>>> I fully agree that R1B is preferable, but it's not against the >>>>> spec to >>>>> send CMD13 to poll for busy. >>>>> >>>>> Moreover, we need to cope with the scenario when the host has >>>>> specified a maximum timeout that isn't sufficiently long enough for >>>>> the requested operation. Do you have another proposal for how to >>>>> manage this, but disabling MMC_RSP_BUSY? >>>>> >>>>> Let's assume you driver would get a R1B for the CMD6 (we force it), >>>>> then what timeout would the driver be using if we would set >>>>> cmd.busy_timeout to 30ms? >>>>>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Okay, thanks!
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
Okay.
So, the issue we see for CMD6 with R1, is a software problem that we should be able to fix.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Okay, I see.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
Alright, that sounds great!
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
I understand that 11s is probably fine to use, for most cases.
However, it's not spec compliant, as for some operations there are simply no timeout specified. BKOPS, cache flush, sanitize are cases like this - and then 11s is definitely not sufficient.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks!
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
I was able to reproduce the issue on sdhci-omap with your patch applied.
Great, thanks for confirming!
In our case, the dat0 line never comes back up when trying to send a CMD6 with R1. It needs an R1b to work properly. I also set
Do you know if that is because of how the driver manages R1B/R1 or because of a limitation in the HW?
mmc->max_busy_timeout to 0 and that made it work as well.
Okay, good.
I wanna try it out with other R1b commands like erase to see if it works properly if the delay is actually very long.
Right, it probably works for most cases. In any case, this was broken even before my three patches that screwed things up.
BTW, what HW busy time out will you be using for sdhci omap if cmd->busy_timeout exceeds it? Is it 2581ms for all sdhci-omap instances or does it differ depending on platform?
Anyway, in this phase, it sounds like we should move forward with a patch setting max_busy_timeout to zero for sdhci omap, as that restores things to the earlier behavior. Would you mind posting a patch, or do you want me to manage it?
Kind regards Uffe
On 3/6/20 3:14 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
>>>>> Actually we always use R1B with CMD6 as per spec. >>>> I fully agree that R1B is preferable, but it's not against the >>>> spec to >>>> send CMD13 to poll for busy. >>>> >>>> Moreover, we need to cope with the scenario when the host has >>>> specified a maximum timeout that isn't sufficiently long enough for >>>> the requested operation. Do you have another proposal for how to >>>> manage this, but disabling MMC_RSP_BUSY? >>>> >>>> Let's assume you driver would get a R1B for the CMD6 (we force it), >>>> then what timeout would the driver be using if we would set >>>> cmd.busy_timeout to 30ms? >>>>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Okay, thanks!
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
Okay.
So, the issue we see for CMD6 with R1, is a software problem that we should be able to fix.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Okay, I see.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
Alright, that sounds great!
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
I understand that 11s is probably fine to use, for most cases.
However, it's not spec compliant, as for some operations there are simply no timeout specified. BKOPS, cache flush, sanitize are cases like this - and then 11s is definitely not sufficient.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks!
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Hi Uffe
Will go with 2nd option and will send patches out when ready.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
Thanks
Sowjanya
On Mon, 9 Mar 2020 at 18:33, Sowjanya Komatineni skomatineni@nvidia.com wrote:
On 3/6/20 3:14 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
>>>>>> Actually we always use R1B with CMD6 as per spec. >>>>> I fully agree that R1B is preferable, but it's not against the >>>>> spec to >>>>> send CMD13 to poll for busy. >>>>> >>>>> Moreover, we need to cope with the scenario when the host has >>>>> specified a maximum timeout that isn't sufficiently long enough for >>>>> the requested operation. Do you have another proposal for how to >>>>> manage this, but disabling MMC_RSP_BUSY? >>>>> >>>>> Let's assume you driver would get a R1B for the CMD6 (we force it), >>>>> then what timeout would the driver be using if we would set >>>>> cmd.busy_timeout to 30ms? >>>>>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Okay, thanks!
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
Okay.
So, the issue we see for CMD6 with R1, is a software problem that we should be able to fix.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Okay, I see.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
Alright, that sounds great!
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
I understand that 11s is probably fine to use, for most cases.
However, it's not spec compliant, as for some operations there are simply no timeout specified. BKOPS, cache flush, sanitize are cases like this - and then 11s is definitely not sufficient.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks!
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Hi Uffe
Will go with 2nd option and will send patches out when ready.
Okay, good.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
On 3/10/20 2:46 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Mon, 9 Mar 2020 at 18:33, Sowjanya Komatineni skomatineni@nvidia.com wrote:
On 3/6/20 3:14 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
>>>>>>> Actually we always use R1B with CMD6 as per spec. >>>>>> I fully agree that R1B is preferable, but it's not against the >>>>>> spec to >>>>>> send CMD13 to poll for busy. >>>>>> >>>>>> Moreover, we need to cope with the scenario when the host has >>>>>> specified a maximum timeout that isn't sufficiently long enough for >>>>>> the requested operation. Do you have another proposal for how to >>>>>> manage this, but disabling MMC_RSP_BUSY? >>>>>> >>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it), >>>>>> then what timeout would the driver be using if we would set >>>>>> cmd.busy_timeout to 30ms? >>>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats the data timeout counter used?
Yes. It seems like it will pick the maximum, which is 11s?
yes
Okay, thanks!
Because of above mentioned issue on our host where CMD interrupt happens after busy state, poll for busy returns right away as not busy.
I see.
So issuing CMD13 after CMD6-R1 followed by busy poll should be working. But weird that with small delay of 1ms or debug print before CMD13 it doesn't timeout and works all the time.
I have digested the information you provided in these emails. Let me summarize it, to see if I have understood correctly.
Your controller can't distinguish between R1 and R1B because of a limitation in the HW. So, in both cases you need to wait for the card to stop signal busy, before the controller can give an IRQ to notify that the R1 response has been received. Correct?
In this context, I am wondering if sdhci_send_command(), really conforms to these requirements. For example, depending on if the CMD6 has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or SDHCI_CMD_RESP_SHORT_BUSY.
Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state before IRQ is only for R1B and R1 is spec compliant.
So, with R1, CMD complete is generated after response received.
Okay.
So, the issue we see for CMD6 with R1, is a software problem that we should be able to fix.
With R1B, CMD complete and xfer complete both are generated after response received + device busy (max timeout of 11s) DATA timeout interrupt will be asserted incase if HW busy detection fails.
With R1B we may see DATA Timeout if operation takes more than max busy timeout of 11s.
Okay, I see.
Assuming my interpretation of the above is somewhat correct. Then you always need to set a busy timeout for R1/R1B responses in the controller. The maximum timeout seems to be 11s long. Obviously, this isn't enough for all cases, such as cache flushing and erase, for example. So, what can we do to support a longer timeouts than 11s? Would it be possible to disable the HW timeout, if the requested timeout is longer than 11s and use a SW timeout instead?
Kind regards Uffe
For erase long operations we have register bit to enable for infinite busy wait mode where host controller would be monitoring until card is busy.
Alright, that sounds great!
But so far for emmc devices we used on our platforms, we haven't seen cache flush taking more than 11s.
I understand that 11s is probably fine to use, for most cases.
However, it's not spec compliant, as for some operations there are simply no timeout specified. BKOPS, cache flush, sanitize are cases like this - and then 11s is definitely not sufficient.
Will get back on possibility of disabling HW timeout and using SW timeout..
Thanks!
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Hi Uffe
Will go with 2nd option and will send patches out when ready.
Okay, good.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
OK, I sent v1 yesterday. Please ignore them then.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Thanks
Sowjanya
[...]
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Hi Uffe
Will go with 2nd option and will send patches out when ready.
Okay, good.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
OK, I sent v1 yesterday. Please ignore them then.
Oh, I haven't seen them. In any case, I am ignoring them.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Great, thanks!
Please help test the series I just posted as well, if you have the time ofcourse.
Kind regards Uffe
On 3/10/20 10:09 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Hi Uffe
Will go with 2nd option and will send patches out when ready.
Okay, good.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
OK, I sent v1 yesterday. Please ignore them then.
Oh, I haven't seen them. In any case, I am ignoring them.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Great, thanks!
Please help test the series I just posted as well, if you have the time ofcourse.
Kind regards Uffe
Sure,
Thanks
Sowjanya
On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
On 3/10/20 10:09 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
I would like to get the regression fixed asap, but I also would like to avoid reverting patches, unless really necessary. May I propose the following two options.
- Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.
- Set the mmc->max_busy_timeout to zero for sdhci-tegra, which
makes the core to always use R1B for CMD6 (and erase). This also means that when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must disable the HW busy timeout and just wait "forever".
If you decide for 2, you can add the software timeout support on top, but make that can be considered as a next step of an improvement, rather than needed as fix. Note that, I believe there are some support for software timeout already in the sdhci core, maybe you need to tweak it a bit for your case, I don't know.
Kind regards Uffe
Hi Uffe
Will go with 2nd option and will send patches out when ready.
Okay, good.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
OK, I sent v1 yesterday. Please ignore them then.
Oh, I haven't seen them. In any case, I am ignoring them.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Great, thanks!
Please help test the series I just posted as well, if you have the time ofcourse.
Kind regards Uffe
Sure,
Thanks
Sowjanya
mmc_sleep() also needs update to force R1B when host sets capability MMC_CAP_NEED_RSP_BUSY
On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
On 3/10/20 10:09 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
> I would like to get the regression fixed asap, but I also would > like > to avoid reverting patches, unless really necessary. May I > propose the > following two options. > > 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 > with > an R1 response doesn't work - and then fix that behaviour. > > 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which > makes > the core to always use R1B for CMD6 (and erase). This also means > that > when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra > must > disable the HW busy timeout and just wait "forever". > > If you decide for 2, you can add the software timeout support on > top, > but make that can be considered as a next step of an improvement, > rather than needed as fix. Note that, I believe there are some > support > for software timeout already in the sdhci core, maybe you need to > tweak it a bit for your case, I don't know. > > Kind regards > Uffe Hi Uffe
Will go with 2nd option and will send patches out when ready.
Okay, good.
BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for data timeout based on host clock when using finite mode (HW busy detection based on DATA TIMEOUT count value when cmd operation timeout is < 11s for tegra host).
So, looks like we cant set host max_busy_timeout to 0 for Tegra host to force R1B during SWITCH and SLEEP_AWAKE.
So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY which can be used for hosts supporting long or infinite HW busy wait detection and will update mmc and mmc_ops drivers to not allow convert R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
OK, I sent v1 yesterday. Please ignore them then.
Oh, I haven't seen them. In any case, I am ignoring them.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Great, thanks!
Please help test the series I just posted as well, if you have the time ofcourse.
Kind regards Uffe
Sure,
Thanks
Sowjanya
mmc_sleep() also needs update to force R1B when host sets capability MMC_CAP_NEED_RSP_BUSY
Tested patches and they work good.
On 3/10/20 4:10 PM, Sowjanya Komatineni wrote:
On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
On 3/10/20 10:09 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
>> I would like to get the regression fixed asap, but I also would >> like >> to avoid reverting patches, unless really necessary. May I >> propose the >> following two options. >> >> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 >> with >> an R1 response doesn't work - and then fix that behaviour. >> >> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which >> makes >> the core to always use R1B for CMD6 (and erase). This also >> means that >> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra >> must >> disable the HW busy timeout and just wait "forever". >> >> If you decide for 2, you can add the software timeout support >> on top, >> but make that can be considered as a next step of an improvement, >> rather than needed as fix. Note that, I believe there are some >> support >> for software timeout already in the sdhci core, maybe you need to >> tweak it a bit for your case, I don't know. >> >> Kind regards >> Uffe > Hi Uffe > > Will go with 2nd option and will send patches out when ready. Okay, good.
> BTW, Tegra host also supports > SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for > data timeout based on host clock when using finite mode (HW busy > detection based on DATA TIMEOUT count value when cmd operation > timeout > is < 11s for tegra host). > > So, looks like we cant set host max_busy_timeout to 0 for Tegra > host to > force R1B during SWITCH and SLEEP_AWAKE. > > So, was thinking to introduce host capability > MMC_CAP2_LONG_WAIT_HW_BUSY > which can be used for hosts supporting long or infinite HW busy > wait > detection and will update mmc and mmc_ops drivers to not allow > convert > R1B to R1B for hosts with this capability during SLEEP_AWAKE and > SWITCH. That seems reasonable, it becomes probably both easier and clearer by adding a new host cap.
In any case, let me help out and cook a patch for this for the core part (I leave the sdhci change to you). It may be a bit tricky, especially since I have currently queued a bunch of new changes for v5.7, that enables more users of mmc_poll_for_busy() in the core. Maybe I need to temporarily drop them, so we can fix these problems first. I will check.
Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that seems to be describing the common problem we have for sdhci omap/tegra.
Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for sdhci- tegra, so while at it, perhaps you can cook a patch for that as well.
Kind regards Uffe
OK, I sent v1 yesterday. Please ignore them then.
Oh, I haven't seen them. In any case, I am ignoring them.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Great, thanks!
Please help test the series I just posted as well, if you have the time ofcourse.
Kind regards Uffe
Sure,
Thanks
Sowjanya
mmc_sleep() also needs update to force R1B when host sets capability MMC_CAP_NEED_RSP_BUSY
Tested patches and they work good.
Sent sdhci-tegra v2 patches
- includes busy wait mode programming based on cmd busy_timeout
- enables MMC_CAP_WAIT_WHILE_BUSY
Thanks
Sowjanya
On Wed, 11 Mar 2020 at 01:19, Sowjanya Komatineni skomatineni@nvidia.com wrote:
On 3/10/20 4:10 PM, Sowjanya Komatineni wrote:
On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
On 3/10/20 10:09 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
[...]
>>> I would like to get the regression fixed asap, but I also would >>> like >>> to avoid reverting patches, unless really necessary. May I >>> propose the >>> following two options. >>> >>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 >>> with >>> an R1 response doesn't work - and then fix that behaviour. >>> >>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which >>> makes >>> the core to always use R1B for CMD6 (and erase). This also >>> means that >>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra >>> must >>> disable the HW busy timeout and just wait "forever". >>> >>> If you decide for 2, you can add the software timeout support >>> on top, >>> but make that can be considered as a next step of an improvement, >>> rather than needed as fix. Note that, I believe there are some >>> support >>> for software timeout already in the sdhci core, maybe you need to >>> tweak it a bit for your case, I don't know. >>> >>> Kind regards >>> Uffe >> Hi Uffe >> >> Will go with 2nd option and will send patches out when ready. > Okay, good. > >> BTW, Tegra host also supports >> SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for >> data timeout based on host clock when using finite mode (HW busy >> detection based on DATA TIMEOUT count value when cmd operation >> timeout >> is < 11s for tegra host). >> >> So, looks like we cant set host max_busy_timeout to 0 for Tegra >> host to >> force R1B during SWITCH and SLEEP_AWAKE. >> >> So, was thinking to introduce host capability >> MMC_CAP2_LONG_WAIT_HW_BUSY >> which can be used for hosts supporting long or infinite HW busy >> wait >> detection and will update mmc and mmc_ops drivers to not allow >> convert >> R1B to R1B for hosts with this capability during SLEEP_AWAKE and >> SWITCH. > That seems reasonable, it becomes probably both easier and > clearer by > adding a new host cap. > > In any case, let me help out and cook a patch for this for the core > part (I leave the sdhci change to you). It may be a bit tricky, > especially since I have currently queued a bunch of new changes for > v5.7, that enables more users of mmc_poll_for_busy() in the core. > Maybe I need to temporarily drop them, so we can fix these problems > first. I will check. > > Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as > that > seems to be describing the common problem we have for sdhci > omap/tegra. > > Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for > sdhci- tegra, so while at it, perhaps you can cook a patch for > that as > well. > > Kind regards > Uffe OK, I sent v1 yesterday. Please ignore them then.
Oh, I haven't seen them. In any case, I am ignoring them.
Will send out patches only for HW busy wait modes program based on cmd timeout and WAIT_WHILE_BUSY enabled.
Great, thanks!
Please help test the series I just posted as well, if you have the time ofcourse.
Kind regards Uffe
Sure,
Thanks
Sowjanya
mmc_sleep() also needs update to force R1B when host sets capability MMC_CAP_NEED_RSP_BUSY
Yes, I am on it! Thanks!
Tested patches and they work good.
Great, I am adding your tested-by tag then.
Sent sdhci-tegra v2 patches
includes busy wait mode programming based on cmd busy_timeout
enables MMC_CAP_WAIT_WHILE_BUSY
Sounds great, but I don't see the patches on the mailing list, nor did they reach the mmc patchtracker.
Seems like you probably need to check your email settings when sending patches.
Kind regards Uffe
FYI,
The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2 kernel pops up the following messages on console log, Is this a problem ?
[ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
[ 15.737765] mmc1: unspecified timeout for CMD6 - use generic [ 16.754248] mmc1: unspecified timeout for CMD6 - use generic [ 16.842071] mmc1: unspecified timeout for CMD6 - use generic ... [ 20.580541] mmc1: unspecified timeout for CMD6 - use generic [ 20.588216] mmc1: unspecified timeout for CMD6 - use generic [ 20.604011] mmc1: unspecified timeout for CMD6 - use generic
ref: https://lkft.validation.linaro.org/scheduler/job/1298207#L4037 https://lkft.validation.linaro.org/scheduler/job/1298945#L4132 https://lkft.validation.linaro.org/scheduler/job/1299973#L4232
- Naresh
On Thu, 19 Mar 2020 at 20:12, Naresh Kamboju naresh.kamboju@linaro.org wrote:
FYI,
The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2 kernel pops up the following messages on console log, Is this a problem ?
[ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
[ 15.737765] mmc1: unspecified timeout for CMD6 - use generic [ 16.754248] mmc1: unspecified timeout for CMD6 - use generic [ 16.842071] mmc1: unspecified timeout for CMD6 - use generic ... [ 20.580541] mmc1: unspecified timeout for CMD6 - use generic [ 20.588216] mmc1: unspecified timeout for CMD6 - use generic [ 20.604011] mmc1: unspecified timeout for CMD6 - use generic
ref: https://lkft.validation.linaro.org/scheduler/job/1298207#L4037 https://lkft.validation.linaro.org/scheduler/job/1298945#L4132 https://lkft.validation.linaro.org/scheduler/job/1299973#L4232
The commit below is the problem, it shouldn't be applied for stable.
commit 533a6cfe08f96a7b5c65e06d20916d552c11b256 Author: Ulf Hansson ulf.hansson@linaro.org Date: Wed Jan 22 15:27:47 2020 +0100 mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
Let me sort it out with Greg/Sasha. I will keep you in the loop, thanks for reporting!
Kind regards Uffe
On Fri, Mar 20, 2020 at 10:20:04AM +0100, Ulf Hansson wrote:
On Thu, 19 Mar 2020 at 20:12, Naresh Kamboju naresh.kamboju@linaro.org wrote:
FYI,
The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2 kernel pops up the following messages on console log, Is this a problem ?
[ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
[ 15.737765] mmc1: unspecified timeout for CMD6 - use generic [ 16.754248] mmc1: unspecified timeout for CMD6 - use generic [ 16.842071] mmc1: unspecified timeout for CMD6 - use generic ... [ 20.580541] mmc1: unspecified timeout for CMD6 - use generic [ 20.588216] mmc1: unspecified timeout for CMD6 - use generic [ 20.604011] mmc1: unspecified timeout for CMD6 - use generic
ref: https://lkft.validation.linaro.org/scheduler/job/1298207#L4037 https://lkft.validation.linaro.org/scheduler/job/1298945#L4132 https://lkft.validation.linaro.org/scheduler/job/1299973#L4232
The commit below is the problem, it shouldn't be applied for stable.
commit 533a6cfe08f96a7b5c65e06d20916d552c11b256 Author: Ulf Hansson ulf.hansson@linaro.org Date: Wed Jan 22 15:27:47 2020 +0100 mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
Let me sort it out with Greg/Sasha. I will keep you in the loop, thanks for reporting!
Now dropped from the 4.19, 5.4, and 5.5 stable queues, sorry for the noise.
greg k-h
On Mon, 2 Mar 2020 at 17:50, Ulf Hansson ulf.hansson@linaro.org wrote:
On Mon, 2 Mar 2020 at 14:11, Faiz Abbas faiz_abbas@ti.com wrote:
Uffe,
On 26/02/20 8:51 pm, Ulf Hansson wrote:
- Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
Sorry I missed this earlier.
I don't have an X15 with me here but I'm trying to set one up in our remote farm. In the meantime, I tried to reproduce this issue on two platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue because those eMMC's don't even have a cache. I will keep you updated when I do get a board with a eMMC that has a cache.
Is there a way to reproduce this CMD6 issue with another operation?
Yes, most definitely.
Let me cook a debug patch for you that should trigger the problem for another CMD6 operation. I will post something later this evening or in the mornings (Swedish timezone).
A bit later than promised, I am clearly an optimist. In any case here's the patch I had in mind to trigger the problem for other CMD6 operations. Please give at shot and see what happens.
-------
From: Ulf Hansson ulf.hansson@linaro.org Date: Tue, 3 Mar 2020 22:11:05 +0100 Subject: [PATCH] mmc: core: DEBUG: Force a long timeout for all CMD6
This is to test sdhci-omap, for example, to see what happens when using a longer timeout. My guess is that it triggers __mmc_switch() to disable the MMC_RSP_BUSY flag for the command. If so, it likely to make the host driver to fail, in some way or the other.
Signed-off-by: Ulf Hansson ulf.hansson@linaro.org --- drivers/mmc/core/mmc_ops.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index da425ee2d9bf..f0d2563961f6 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -532,6 +532,9 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
mmc_retune_hold(host);
+ /* Force a long timeout to likely make use_r1b_resp to become false. */ + timeout_ms = MMC_CACHE_FLUSH_TIMEOUT_MS; + if (!timeout_ms) { pr_warn("%s: unspecified timeout for CMD6 - use generic\n", mmc_hostname(host)); @@ -544,8 +547,11 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value, * the host to avoid HW busy detection, by converting to a R1 response * instead of a R1B. */ - if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) + if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) { + pr_warn("%s:Disable MMC_RSP_BUSY. timeout_ms(%u) > max_busy_timeout(%u)\n", + mmc_hostname(host), timeout_ms, host->max_busy_timeout); use_r1b_resp = false; + }
cmd.opcode = MMC_SWITCH; cmd.arg = (MMC_SWITCH_MODE_WRITE_BYTE << 24) |