On 3/2/20 8:50 AM, Ulf Hansson wrote:
External email: Use caution opening links or attachments
On Mon, 2 Mar 2020 at 14:11, Faiz Abbas faiz_abbas@ti.com wrote:
Uffe,
On 26/02/20 8:51 pm, Ulf Hansson wrote:
- Anders, Kishon
On Tue, 25 Feb 2020 at 17:24, Jon Hunter jonathanh@nvidia.com wrote:
On 25/02/2020 14:26, Ulf Hansson wrote:
...
However, from the core point of view, the response is still requested, only that we don't want the driver to wait for the card to stop signaling busy. Instead we want to deal with that via "polling" from the core.
This is a rather worrying behaviour, as it seems like the host driver doesn't really follow this expectations from the core point of view. And mmc_flush_cache() is not the only case, as we have erase, bkops, sanitize, etc. Are all these working or not really well tested?
I don't believe that they are well tested. We have a simple test to mount an eMMC partition, create a file, check the contents, remove the file and unmount. The timeouts always occur during unmounting.
Earlier, before my three patches, if the provided timeout_ms parameter to __mmc_switch() was zero, which was the case for mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply ignored validating host->max_busy_timeout, which was wrong. In any case, this also meant that an R1B response was always used for mmc_flush_cache(), as you also indicated above. Perhaps this is the critical part where things can go wrong.
BTW, have you tried erase commands for sdhci tegra driver? If those are working fine, do you have any special treatments for these?
That I am not sure, but I will check.
Great, thanks. Looking forward to your report.
So, from my side, me and Anders Roxell, have been collaborating on testing the behaviour on a TI Beagleboard x15 (remotely with limited debug options), which is using the sdhci-omap variant. I am trying to get hold of an Nvidia jetson-TX2, but not found one yet. These are the conclusions from the observed behaviour on the Beagleboard for the CMD6 cache flush command.
First, the reported host->max_busy_timeout is 2581 (ms) for the sdhci-omap driver in this configuration.
- As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 * 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag from the command.
- Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the timeout_ms parameter is less than max_busy_timeout (2000 < 2581). Then everything works fine.
- Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be set, even when the timeout_ms becomes greater than max_busy_timeout. This also works fine.
Clearly this indicates a problem that I think needs to be addressed in the sdhci driver. However, of course I can revert the three discussed patches to fix the problem, but that would only hide the issues and I am sure we would then get back to this issue, sooner or later.
To fix the problem in the sdhci driver, I would appreciate if someone from TI and Nvidia can step in to help, as I don't have the HW on my desk.
Comments or other ideas of how to move forward?
Sorry I missed this earlier.
I don't have an X15 with me here but I'm trying to set one up in our remote farm. In the meantime, I tried to reproduce this issue on two platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue because those eMMC's don't even have a cache. I will keep you updated when I do get a board with a eMMC that has a cache.
Is there a way to reproduce this CMD6 issue with another operation?
Yes, most definitely.
Let me cook a debug patch for you that should trigger the problem for another CMD6 operation. I will post something later this evening or in the mornings (Swedish timezone).
Kind regards Uffe
Hi Ulf,
I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
Actually we always use R1B with CMD6 as per spec.
Thanks
Sowjanya