At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org --- v1 → v2:
* Use complete(&nfc->complete) when the condition is met, and do wait_for_completion_timeout() in all cases. Suggested by Boris Brezillon.
drivers/mtd/nand/raw/marvell_nand.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index 666f34b58dec..4870b5bae296 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -614,6 +614,7 @@ static int marvell_nfc_wait_op(struct nand_chip *chip, unsigned int timeout_ms) { struct marvell_nfc *nfc = to_marvell_nfc(chip->controller); int ret; + u32 st;
/* Timeout is expressed in ms */ if (!timeout_ms) @@ -622,6 +623,15 @@ static int marvell_nfc_wait_op(struct nand_chip *chip, unsigned int timeout_ms) init_completion(&nfc->complete);
marvell_nfc_enable_int(nfc, NDCR_RDYM); + + /* + * Check if the NDSR_RDY bits have already been set before the + * interrupt was enabled. + */ + st = readl_relaxed(nfc->regs + NDSR); + if (st & (NDSR_RDY(0) | NDSR_RDY(1))) + complete(&nfc->complete); + ret = wait_for_completion_timeout(&nfc->complete, msecs_to_jiffies(timeout_ms)); marvell_nfc_disable_int(nfc, NDCR_RDYM);
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Thanks, Miquèl
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master.
Thanks,
Boris
Hi All,
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master.
After fixing my R/B configuration I get a new error with this patch when running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
My board is a custom design using an Armada-385 SoC with Macronix NAND.
# stress_1 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ/WRDREQ while draining raw data (NDSR: 0x00000000) ubi0 error: ubi_io_write: error -5 while writing 4096 bytes to PEB 1858:110592, written 0 bytes CPU: 1 PID: 1170 Comm: stress_1 Not tainted 4.19.0-rc5-at1+ #8 Hardware name: Marvell Armada 380/385 (Device Tree) [<8011143c>] (unwind_backtrace) from [<8010c17c>] (show_stack+0x10/0x14) [<8010c17c>] (show_stack) from [<805ec28c>] (dump_stack+0x88/0x9c) [<805ec28c>] (dump_stack) from [<80418a28>] (ubi_io_write+0x55c/0x6c0) [<80418a28>] (ubi_io_write) from [<80415b4c>] (ubi_eba_write_leb+0x80/0x780) [<80415b4c>] (ubi_eba_write_leb) from [<80414580>] (ubi_leb_write+0xbc/0xe0) [<80414580>] (ubi_leb_write) from [<802d46b4>] (ubifs_leb_write+0xa0/0x118) [<802d46b4>] (ubifs_leb_write) from [<802d5620>] (ubifs_wbuf_write_nolock+0x184/0x6ac) [<802d5620>] (ubifs_wbuf_write_nolock) from [<802c8a18>] (ubifs_jnl_write_data+0x1c0/0x2bc) [<802c8a18>] (ubifs_jnl_write_data) from [<802caed8>] (do_writepage+0xa4/0x1b0) [<802caed8>] (do_writepage) from [<801aa160>] (__writepage+0x14/0x48) [<801aa160>] (__writepage) from [<801aa900>] (write_cache_pages+0x1d0/0x3e4) [<801aa900>] (write_cache_pages) from [<801aab68>] (generic_writepages+0x54/0x80) [<801aab68>] (generic_writepages) from [<801ac9a0>] (do_writepages+0x68/0x8c) [<801ac9a0>] (do_writepages) from [<801a0ac8>] (__filemap_fdatawrite_range+0x88/0xc0) [<801a0ac8>] (__filemap_fdatawrite_range) from [<801a0cc4>] (file_write_and_wait_range+0x3c/0x98) [<801a0cc4>] (file_write_and_wait_range) from [<802cb600>] (ubifs_fsync+0x3c/0xb0) [<802cb600>] (ubifs_fsync) from [<801a2828>] (generic_file_write_iter+0x198/0x24c) [<801a2828>] (generic_file_write_iter) from [<802ccb84>] (ubifs_write_iter+0xf0/0x158) [<802ccb84>] (ubifs_write_iter) from [<801ef854>] (__vfs_write+0xfc/0x160) [<801ef854>] (__vfs_write) from [<801efa60>] (vfs_write+0xa4/0x1ac) [<801efa60>] (vfs_write) from [<801efcac>] (ksys_write+0x54/0xb8) [<801efcac>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54) Exception stack(0xbd789fa8 to 0xbd789ff0) 9fa0: 0ca5d000 00000000 00000003 7e9f2900 00008000 ffffffff 9fc0: 0ca5d000 00000000 00008000 00000004 00000003 00000000 76f24fb4 00000000 9fe0: 00000000 7e9f27fc 00010fd8 76e775ec marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810)
... (RDDREQ messages repeat).
On Thu, 27 Sep 2018 21:55:57 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
Hi All,
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master.
After fixing my R/B configuration I get a new error with this patch when running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
My board is a custom design using an Armada-385 SoC with Macronix NAND.
# stress_1 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ/WRDREQ while draining raw data (NDSR: 0x00000000) ubi0 error: ubi_io_write: error -5 while writing 4096 bytes to PEB 1858:110592, written 0 bytes CPU: 1 PID: 1170 Comm: stress_1 Not tainted 4.19.0-rc5-at1+ #8 Hardware name: Marvell Armada 380/385 (Device Tree) [<8011143c>] (unwind_backtrace) from [<8010c17c>] (show_stack+0x10/0x14) [<8010c17c>] (show_stack) from [<805ec28c>] (dump_stack+0x88/0x9c) [<805ec28c>] (dump_stack) from [<80418a28>] (ubi_io_write+0x55c/0x6c0) [<80418a28>] (ubi_io_write) from [<80415b4c>] (ubi_eba_write_leb+0x80/0x780) [<80415b4c>] (ubi_eba_write_leb) from [<80414580>] (ubi_leb_write+0xbc/0xe0) [<80414580>] (ubi_leb_write) from [<802d46b4>] (ubifs_leb_write+0xa0/0x118) [<802d46b4>] (ubifs_leb_write) from [<802d5620>] (ubifs_wbuf_write_nolock+0x184/0x6ac) [<802d5620>] (ubifs_wbuf_write_nolock) from [<802c8a18>] (ubifs_jnl_write_data+0x1c0/0x2bc) [<802c8a18>] (ubifs_jnl_write_data) from [<802caed8>] (do_writepage+0xa4/0x1b0) [<802caed8>] (do_writepage) from [<801aa160>] (__writepage+0x14/0x48) [<801aa160>] (__writepage) from [<801aa900>] (write_cache_pages+0x1d0/0x3e4) [<801aa900>] (write_cache_pages) from [<801aab68>] (generic_writepages+0x54/0x80) [<801aab68>] (generic_writepages) from [<801ac9a0>] (do_writepages+0x68/0x8c) [<801ac9a0>] (do_writepages) from [<801a0ac8>] (__filemap_fdatawrite_range+0x88/0xc0) [<801a0ac8>] (__filemap_fdatawrite_range) from [<801a0cc4>] (file_write_and_wait_range+0x3c/0x98) [<801a0cc4>] (file_write_and_wait_range) from [<802cb600>] (ubifs_fsync+0x3c/0xb0) [<802cb600>] (ubifs_fsync) from [<801a2828>] (generic_file_write_iter+0x198/0x24c) [<801a2828>] (generic_file_write_iter) from [<802ccb84>] (ubifs_write_iter+0xf0/0x158) [<802ccb84>] (ubifs_write_iter) from [<801ef854>] (__vfs_write+0xfc/0x160) [<801ef854>] (__vfs_write) from [<801efa60>] (vfs_write+0xa4/0x1ac) [<801efa60>] (vfs_write) from [<801efcac>] (ksys_write+0x54/0xb8) [<801efcac>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54) Exception stack(0xbd789fa8 to 0xbd789ff0) 9fa0: 0ca5d000 00000000 00000003 7e9f2900 00008000 ffffffff 9fc0: 0ca5d000 00000000 00008000 00000004 00000003 00000000 76f24fb4 00000000 9fe0: 00000000 7e9f27fc 00010fd8 76e775ec marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810)
... (RDDREQ messages repeat).
Hm, that's weird, unless RDDREQ is a 'clear-on-read' bit, that shouldn't happen.
On Fri, 28 Sep 2018 08:40:46 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810)
... (RDDREQ messages repeat).
Hm, that's weird, unless RDDREQ is a 'clear-on-read' bit, that shouldn't happen.
BTW, I dropped the patch.
Hi Boris,
Boris Brezillon boris.brezillon@bootlin.com wrote on Fri, 28 Sep 2018 08:40:46 +0200:
On Thu, 27 Sep 2018 21:55:57 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
Hi All,
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master.
After fixing my R/B configuration I get a new error with this patch when running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
My board is a custom design using an Armada-385 SoC with Macronix NAND.
# stress_1 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ/WRDREQ while draining raw data (NDSR: 0x00000000) ubi0 error: ubi_io_write: error -5 while writing 4096 bytes to PEB 1858:110592, written 0 bytes CPU: 1 PID: 1170 Comm: stress_1 Not tainted 4.19.0-rc5-at1+ #8 Hardware name: Marvell Armada 380/385 (Device Tree) [<8011143c>] (unwind_backtrace) from [<8010c17c>] (show_stack+0x10/0x14) [<8010c17c>] (show_stack) from [<805ec28c>] (dump_stack+0x88/0x9c) [<805ec28c>] (dump_stack) from [<80418a28>] (ubi_io_write+0x55c/0x6c0) [<80418a28>] (ubi_io_write) from [<80415b4c>] (ubi_eba_write_leb+0x80/0x780) [<80415b4c>] (ubi_eba_write_leb) from [<80414580>] (ubi_leb_write+0xbc/0xe0) [<80414580>] (ubi_leb_write) from [<802d46b4>] (ubifs_leb_write+0xa0/0x118) [<802d46b4>] (ubifs_leb_write) from [<802d5620>] (ubifs_wbuf_write_nolock+0x184/0x6ac) [<802d5620>] (ubifs_wbuf_write_nolock) from [<802c8a18>] (ubifs_jnl_write_data+0x1c0/0x2bc) [<802c8a18>] (ubifs_jnl_write_data) from [<802caed8>] (do_writepage+0xa4/0x1b0) [<802caed8>] (do_writepage) from [<801aa160>] (__writepage+0x14/0x48) [<801aa160>] (__writepage) from [<801aa900>] (write_cache_pages+0x1d0/0x3e4) [<801aa900>] (write_cache_pages) from [<801aab68>] (generic_writepages+0x54/0x80) [<801aab68>] (generic_writepages) from [<801ac9a0>] (do_writepages+0x68/0x8c) [<801ac9a0>] (do_writepages) from [<801a0ac8>] (__filemap_fdatawrite_range+0x88/0xc0) [<801a0ac8>] (__filemap_fdatawrite_range) from [<801a0cc4>] (file_write_and_wait_range+0x3c/0x98) [<801a0cc4>] (file_write_and_wait_range) from [<802cb600>] (ubifs_fsync+0x3c/0xb0) [<802cb600>] (ubifs_fsync) from [<801a2828>] (generic_file_write_iter+0x198/0x24c) [<801a2828>] (generic_file_write_iter) from [<802ccb84>] (ubifs_write_iter+0xf0/0x158) [<802ccb84>] (ubifs_write_iter) from [<801ef854>] (__vfs_write+0xfc/0x160) [<801ef854>] (__vfs_write) from [<801efa60>] (vfs_write+0xa4/0x1ac) [<801efa60>] (vfs_write) from [<801efcac>] (ksys_write+0x54/0xb8) [<801efcac>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54) Exception stack(0xbd789fa8 to 0xbd789ff0) 9fa0: 0ca5d000 00000000 00000003 7e9f2900 00008000 ffffffff 9fc0: 0ca5d000 00000000 00008000 00000004 00000003 00000000 76f24fb4 00000000 9fe0: 00000000 7e9f27fc 00010fd8 76e775ec marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2 marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810) marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining FIFO (data) (NDSR: 0x00000810)
... (RDDREQ messages repeat).
Hm, that's weird, unless RDDREQ is a 'clear-on-read' bit, that shouldn't happen.
The spec clearly states it's not. In the status register (NDSR), you must write a bit to clear it.
The RDDREQ and RDY0/1 bit checks do not share the same path. Would it be possible that in some situations all the bits are not actually cleared, and so reading the NDSR register tells us the event happened while it has not in reality?
IIRC this patch fixes pxa implementation (NFCv1) while the errors happen on an Armada 385 (NFCv2). It would be great to understand (and document) this behavior, otherwise it is still possible to do the manual check only for NFCv1 if this is the only IP suffering about the race.
Thanks, Miquèl
Hi Chris,
On 27/9/2018 11:55 PM, Chris Packham wrote:
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master.
After fixing my R/B configuration I get a new error with this patch when running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
That's strange. So your controller sets the RDY bits before it is ready? Could you check whether only checking for NDSR_RDY(0) changes anything? Not sure about the handling of NDSR_RDY(1) in this driver anyway ...
Also, does my .EALREADY approach (v1) make any difference?
Thanks, Daniel
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Fri, 28 Sep 2018 09:43:18 +0200:
Hi Chris,
On 27/9/2018 11:55 PM, Chris Packham wrote:
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master. After fixing my R/B configuration I get a new error with this patch when
running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
That's strange. So your controller sets the RDY bits before it is ready? Could you check whether only checking for NDSR_RDY(0) changes anything? Not sure about the handling of NDSR_RDY(1) in this driver anyway ...
I suppose you mean this portion of code is not clear enough?
u32 st = readl_relaxed(nfc->regs + NDSR); u32 ien = (~readl_relaxed(nfc->regs + NDCR)) & NDCR_ALL_INT;
/* * RDY interrupt mask is one bit in NDCR while there are two status * bit in NDSR (RDY[cs0/cs2] and RDY[cs1/cs3]). */ if (st & NDSR_RDY(1)) st |= NDSR_RDY(0);
if (!(st & ien)) return IRQ_NONE;
-> st is the status in the NDSR register which has two RDY bits, one for each RDY line. -> ien is a view of the NDCR register which commands the interrupts and has one bit to enable both interrupt lines (let's call it NDCR_RDYM).
The trick is that NDSR_RDY(0) is the same bit as NDCR_RDYM. So whenever NDSR_RDY(1) is set, we fake NDSR_RDY(0) to be set (by setting manually the bit in 'st') so that the (st & ien) comparison can be true if NDSR_RDY(1) is valid and RDY interrupts are enabled.
With this in mind, I don't see why this
+ st = readl_relaxed(nfc->regs + NDSR); + if (st & (NDSR_RDY(0) | NDSR_RDY(1))) + complete(&nfc->complete);
would break the driver.
Thanks, Miquèl
On 28/9/2018 10:24 AM, Miquel Raynal wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Fri, 28 Sep 2018 09:43:18 +0200:
Hi Chris,
On 27/9/2018 11:55 PM, Chris Packham wrote:
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master. After fixing my R/B configuration I get a new error with this patch when
running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
That's strange. So your controller sets the RDY bits before it is ready? Could you check whether only checking for NDSR_RDY(0) changes anything? Not sure about the handling of NDSR_RDY(1) in this driver anyway ...
I suppose you mean this portion of code is not clear enough?
u32 st = readl_relaxed(nfc->regs + NDSR); u32 ien = (~readl_relaxed(nfc->regs + NDCR)) & NDCR_ALL_INT; /* * RDY interrupt mask is one bit in NDCR while there are two status * bit in NDSR (RDY[cs0/cs2] and RDY[cs1/cs3]). */ if (st & NDSR_RDY(1)) st |= NDSR_RDY(0); if (!(st & ien)) return IRQ_NONE;
-> st is the status in the NDSR register which has two RDY bits, one for each RDY line. -> ien is a view of the NDCR register which commands the interrupts and has one bit to enable both interrupt lines (let's call it NDCR_RDYM).
The trick is that NDSR_RDY(0) is the same bit as NDCR_RDYM. So whenever NDSR_RDY(1) is set, we fake NDSR_RDY(0) to be set (by setting manually the bit in 'st') so that the (st & ien) comparison can be true if NDSR_RDY(1) is valid and RDY interrupts are enabled.
With this in mind, I don't see why this
- st = readl_relaxed(nfc->regs + NDSR);
- if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm asking because it took me several tries sometimes to trigger the bug, so is there a chance that you see an error at all times, regardless of whether my patch is applied?
Thanks, Daniel
On 28/09/18 20:29, Daniel Mack wrote:
On 28/9/2018 10:24 AM, Miquel Raynal wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Fri, 28 Sep 2018 09:43:18 +0200:
Hi Chris,
On 27/9/2018 11:55 PM, Chris Packham wrote:
On 27/09/18 20:56, Boris Brezillon wrote:
On Thu, 27 Sep 2018 10:11:45 +0200 Miquel Raynal miquel.raynal@bootlin.com wrote:
Hi Daniel,
Daniel Mack daniel@zonque.org wrote on Thu, 27 Sep 2018 09:17:51 +0200: > At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register > will only cause the IRQ to latch when the RDY lanes are changing, and not > in case they are already asserted. > > This means that if the controller finished the command in flight before > marvell_nfc_wait_op() is called, that function will wait for a change in > the bit that can't ever happen as it is already set. > > To address this race, check for the RDY bits after the IRQ was enabled, > and complete the completion immediately if the condition is already met. > > This fixes a bug that was observed with a NAND chip that holds a UBIFS > parition on which file system stress tests were executed. When > marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount > the filesystem read-only, reporting lots of warnings along the way. > > Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver > Cc: stable@vger.kernel.org > Signed-off-by: Daniel Mack daniel@zonque.org > ---
Sorry I haven't had the time to check on my Armada, but you figured it out, and the fix looks good to me!
Acked-by: Miquel Raynal miquel.raynal@bootlin.com
Boris, do you plan to send another fixes PR of can I take it into the nand/next branch?
Queued to mtd/master. After fixing my R/B configuration I get a new error with this patch when
running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
That's strange. So your controller sets the RDY bits before it is ready? Could you check whether only checking for NDSR_RDY(0) changes anything? Not sure about the handling of NDSR_RDY(1) in this driver anyway ...
I suppose you mean this portion of code is not clear enough?
u32 st = readl_relaxed(nfc->regs + NDSR); u32 ien = (~readl_relaxed(nfc->regs + NDCR)) & NDCR_ALL_INT; /* * RDY interrupt mask is one bit in NDCR while there are two status * bit in NDSR (RDY[cs0/cs2] and RDY[cs1/cs3]). */ if (st & NDSR_RDY(1)) st |= NDSR_RDY(0); if (!(st & ien)) return IRQ_NONE;
-> st is the status in the NDSR register which has two RDY bits, one for each RDY line. -> ien is a view of the NDCR register which commands the interrupts and has one bit to enable both interrupt lines (let's call it NDCR_RDYM).
The trick is that NDSR_RDY(0) is the same bit as NDCR_RDYM. So whenever NDSR_RDY(1) is set, we fake NDSR_RDY(0) to be set (by setting manually the bit in 'st') so that the (st & ien) comparison can be true if NDSR_RDY(1) is valid and RDY interrupts are enabled.
With this in mind, I don't see why this
- st = readl_relaxed(nfc->regs + NDSR);
- if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm asking because it took me several tries sometimes to trigger the bug, so is there a chance that you see an error at all times, regardless of whether my patch is applied?
It seems pretty consistent. Without this patch there seems to be no problem. With this patch it triggers pretty much straight away. I can't discount that there might be something wrong with my dts (the R/B configuration was missing initially).
I've also been able to run this on the DB-88F6820-AMC board with the same result (the dts for this is in the for-next branch of git://git.infradead.org/linux-mvebu.git).
The really odd thing is the following seems to avoid the problem
+ st = readl_relaxed(nfc->regs + NDSR); + udelay(1000); + if (st & (NDSR_RDY(0) | NDSR_RDY(1))) + complete(&nfc->complete);
Which is weird because the st value has already been read so the udelay should have no effect.
On 28/09/18 19:43, Daniel Mack wrote:
Also, does my .EALREADY approach (v1) make any difference?
The v1 of this patch doesn't show the problem.
On 30/9/2018 11:10 PM, Chris Packham wrote:
With this in mind, I don't see why this
- st = readl_relaxed(nfc->regs + NDSR);
- if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm asking because it took me several tries sometimes to trigger the bug, so is there a chance that you see an error at all times, regardless of whether my patch is applied?
It seems pretty consistent. Without this patch there seems to be no problem. With this patch it triggers pretty much straight away. I can't discount that there might be something wrong with my dts (the R/B configuration was missing initially).
I've also been able to run this on the DB-88F6820-AMC board with the same result (the dts for this is in the for-next branch of git://git.infradead.org/linux-mvebu.git).
The really odd thing is the following seems to avoid the problem
st = readl_relaxed(nfc->regs + NDSR);
udelay(1000);
if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Which is weird because the st value has already been read so the udelay should have no effect.
Erm, yes. That's totally weird. Which gcc are you using for this?
Could you please try and use readl() here instead of readl_relaxed()? That will place a memory barrier after the read to enforce ordering.
But if this is a problem, many other parts of that driver should be equally affected.
On 28/09/18 19:43, Daniel Mack wrote:
Also, does my .EALREADY approach (v1) make any difference?
The v1 of this patch doesn't show the problem.
That's also very strange because the condition it triggers on is exactly the same.
Thanks, Daniel
On 01/10/18 18:31, Daniel Mack wrote:
On 30/9/2018 11:10 PM, Chris Packham wrote:
With this in mind, I don't see why this
- st = readl_relaxed(nfc->regs + NDSR);
- if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm asking because it took me several tries sometimes to trigger the bug, so is there a chance that you see an error at all times, regardless of whether my patch is applied?
It seems pretty consistent. Without this patch there seems to be no problem. With this patch it triggers pretty much straight away. I can't discount that there might be something wrong with my dts (the R/B configuration was missing initially).
I've also been able to run this on the DB-88F6820-AMC board with the same result (the dts for this is in the for-next branch of git://git.infradead.org/linux-mvebu.git).
The really odd thing is the following seems to avoid the problem
st = readl_relaxed(nfc->regs + NDSR);
udelay(1000);
if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Which is weird because the st value has already been read so the udelay should have no effect.
Erm, yes. That's totally weird. Which gcc are you using for this?
arm-softfloat-linux-gnueabi-gcc (crosstool-NG crosstool-ng-1.22.0) 4.9.3
Could you please try and use readl() here instead of readl_relaxed()? That will place a memory barrier after the read to enforce ordering.
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
But if this is a problem, many other parts of that driver should be equally affected.
On 28/09/18 19:43, Daniel Mack wrote:
Also, does my .EALREADY approach (v1) make any difference?
The v1 of this patch doesn't show the problem.
That's also very strange because the condition it triggers on is exactly the same.
One difference is that by calling complete() interrupts will be disabled in the spinlock.
Thanks, Daniel
On Mon, 1 Oct 2018 19:59:11 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 01/10/18 18:31, Daniel Mack wrote:
On 30/9/2018 11:10 PM, Chris Packham wrote:
With this in mind, I don't see why this
- st = readl_relaxed(nfc->regs + NDSR);
- if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm asking because it took me several tries sometimes to trigger the bug, so is there a chance that you see an error at all times, regardless of whether my patch is applied?
It seems pretty consistent. Without this patch there seems to be no problem. With this patch it triggers pretty much straight away. I can't discount that there might be something wrong with my dts (the R/B configuration was missing initially).
I've also been able to run this on the DB-88F6820-AMC board with the same result (the dts for this is in the for-next branch of git://git.infradead.org/linux-mvebu.git).
The really odd thing is the following seems to avoid the problem
st = readl_relaxed(nfc->regs + NDSR);
udelay(1000);
if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
complete(&nfc->complete);
Which is weird because the st value has already been read so the udelay should have no effect.
Erm, yes. That's totally weird. Which gcc are you using for this?
arm-softfloat-linux-gnueabi-gcc (crosstool-NG crosstool-ng-1.22.0) 4.9.3
Could you please try and use readl() here instead of readl_relaxed()? That will place a memory barrier after the read to enforce ordering.
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
On Mon, 1 Oct 2018 22:01:27 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
Except we still don't know why this is happening, and I'm not sure I want to take a fix without understanding why it does fix the problem.
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
On 02/10/18 11:13, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:01:27 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
Except we still don't know why this is happening, and I'm not sure I want to take a fix without understanding why it does fix the problem.
Agreed. My only guess is that there is some interrupt that is missed in the short period they are disabled when calling complete().
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
On Mon, 1 Oct 2018 22:15:28 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 11:13, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:01:27 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
Except we still don't know why this is happening, and I'm not sure I want to take a fix without understanding why it does fix the problem.
Agreed. My only guess is that there is some interrupt that is missed in the short period they are disabled when calling complete().
Disabling interrupts when taking a spinlock means masking the IRQ line, but the interrupt still exists and should be there when linux unmasks the IRQ line. I don't think this is the problem we're chasing.
Looks more like something
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
On Tue, 2 Oct 2018 11:36:10 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
On Mon, 1 Oct 2018 22:15:28 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 11:13, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:01:27 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
> > I'd previously tried readl() based on the same hunch. No change. > > I think my snippet above might be misleading. While a delay between > readl_relaxed() and the if should not change the outcome, this is also a > delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() > which is probably more significant. Sure enough if I move the delay to > just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
Except we still don't know why this is happening, and I'm not sure I want to take a fix without understanding why it does fix the problem.
Agreed. My only guess is that there is some interrupt that is missed in the short period they are disabled when calling complete().
Disabling interrupts when taking a spinlock means masking the IRQ line, but the interrupt still exists and should be there when linux unmasks the IRQ line. I don't think this is the problem we're chasing.
Looks more like something
Please ignore this email, I inadvertently hit the send button on a draft I started yesterday :-)
Hi Boris,
Boris Brezillon boris.brezillon@bootlin.com wrote on Tue, 2 Oct 2018 00:13:28 +0200:
On Mon, 1 Oct 2018 22:01:27 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
Except we still don't know why this is happening, and I'm not sure I want to take a fix without understanding why it does fix the problem.
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while the interrupts are enabled while we only wait for R/B signalling. This check is to avoid calling complete() on these situations.
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
Thanks, Miquèl
Hi Daniel, Chris,
Miquel Raynal miquel.raynal@bootlin.com wrote on Tue, 2 Oct 2018 08:46:01 +0200:
Hi Boris,
Boris Brezillon boris.brezillon@bootlin.com wrote on Tue, 2 Oct 2018 00:13:28 +0200:
On Mon, 1 Oct 2018 22:01:27 +0000 Chris Packham Chris.Packham@alliedtelesis.co.nz wrote:
On 02/10/18 10:41, Boris Brezillon wrote:
On Mon, 1 Oct 2018 22:34:38 +0200 Boris Brezillon boris.brezillon@bootlin.com wrote:
I'd previously tried readl() based on the same hunch. No change.
I think my snippet above might be misleading. While a delay between readl_relaxed() and the if should not change the outcome, this is also a delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() which is probably more significant. Sure enough if I move the delay to just before the marvell_nfc_disable_int() the error is not seen.
AFAICT, your timeout always happens when waiting for RDREQ, not RDYM. So maybe disabling MRDY too early has a side-effect on the RDREQ event.
Can you try with this patch [1]? It should ensure that NDSR_RDY bits are cleared before starting an operation.
No luck. I applied that on top of Daniel's and got the same result.
One thing that does look promising is the following modification of Daniel's patch[1]. Which moves the RDY check to before where the interrupts are enabled.
Except we still don't know why this is happening, and I'm not sure I want to take a fix without understanding why it does fix the problem.
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while the interrupts are enabled while we only wait for R/B signalling. This check is to avoid calling complete() on these situations.
Actually Boris is right on the fact that while the intention is good, the writing of [1] is not accurate. Daniel, could you please test if the following diff changes something with your setup, without your patch?
diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index bc2ef5209783..c7573ccdbacd 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id)
marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
- if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ))) + if (st & (NDSR_RDY(0) | NDSR_RDY(1))) complete(&nfc->complete);
return IRQ_HANDLED;
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
Thanks, Miquèl
Hi Miquel,
On 2/10/2018 9:25 AM, Miquel Raynal wrote:
Miquel Raynal miquel.raynal@bootlin.com wrote on Tue, 2 Oct 2018 08:46:01 +0200:
Boris Brezillon boris.brezillon@bootlin.com wrote on Tue, 2 Oct 2018 00:13:28 +0200:
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while the interrupts are enabled while we only wait for R/B signalling. This check is to avoid calling complete() on these situations.
Actually Boris is right on the fact that while the intention is good, the writing of [1] is not accurate. Daniel, could you please test if the following diff changes something with your setup, without your patch?
diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index bc2ef5209783..c7573ccdbacd 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id) marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
if (st & (NDSR_RDY(0) | NDSR_RDY(1))) complete(&nfc->complete);
return IRQ_HANDLED;
Yes! That seems to work nicely as a replacement for my patch.
Chris, how is that going on your board?
Thanks, Daniel
On 02/10/18 21:23, Daniel Mack wrote:
Hi Miquel,
On 2/10/2018 9:25 AM, Miquel Raynal wrote:
Miquel Raynal miquel.raynal@bootlin.com wrote on Tue, 2 Oct 2018 08:46:01 +0200:
Boris Brezillon boris.brezillon@bootlin.com wrote on Tue, 2 Oct 2018 00:13:28 +0200:
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while the interrupts are enabled while we only wait for R/B signalling. This check is to avoid calling complete() on these situations.
Actually Boris is right on the fact that while the intention is good, the writing of [1] is not accurate. Daniel, could you please test if the following diff changes something with your setup, without your patch?
diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index bc2ef5209783..c7573ccdbacd 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id) marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
return IRQ_HANDLED;if (st & (NDSR_RDY(0) | NDSR_RDY(1))) complete(&nfc->complete);
Yes! That seems to work nicely as a replacement for my patch.
Chris, how is that going on your board?
Looks good to me. I've just re-run stress_{1,2,3} on the custom board and DB-88F6820-AMC.
Hi Chris,
Chris Packham Chris.Packham@alliedtelesis.co.nz wrote on Tue, 2 Oct 2018 20:53:14 +0000:
On 02/10/18 21:23, Daniel Mack wrote:
Hi Miquel,
On 2/10/2018 9:25 AM, Miquel Raynal wrote:
Miquel Raynal miquel.raynal@bootlin.com wrote on Tue, 2 Oct 2018 08:46:01 +0200:
Boris Brezillon boris.brezillon@bootlin.com wrote on Tue, 2 Oct 2018 00:13:28 +0200:
Also, it looks like complete() is not called until the RDDREQ, WRDREQ and WRCMDREQ are cleared in the interrupt handler [1], which is weird. Miquel, do you happen to remember why you had to do that?
The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while the interrupts are enabled while we only wait for R/B signalling. This check is to avoid calling complete() on these situations.
Actually Boris is right on the fact that while the intention is good, the writing of [1] is not accurate. Daniel, could you please test if the following diff changes something with your setup, without your patch?
diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index bc2ef5209783..c7573ccdbacd 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id) marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
return IRQ_HANDLED;if (st & (NDSR_RDY(0) | NDSR_RDY(1))) complete(&nfc->complete);
Yes! That seems to work nicely as a replacement for my patch.
Chris, how is that going on your board?
Looks good to me. I've just re-run stress_{1,2,3} on the custom board and DB-88F6820-AMC.
Great! There is still something weird about the complete() though.
Daniel, do you plan to send the above change or do you want me to do it? I would like to integrate the fix in nand/next before sending the PR.
Thanks, Miquèl
Hi Miquel,
On 3/10/2018 9:33 AM, Miquel Raynal wrote:
Yes! That seems to work nicely as a replacement for my patch.
Chris, how is that going on your board?
Looks good to me. I've just re-run stress_{1,2,3} on the custom board and DB-88F6820-AMC.
Great! There is still something weird about the complete() though.
Daniel, do you plan to send the above change or do you want me to do it? I would like to integrate the fix in nand/next before sending the PR.
Nah, your version has barely anything to do with my patch anymore, except that it fixes the same effect, so go ahead please :)
You can add my Reported-and-tested-by though.
Thanks for chasing this, everyone!
Daniel
Hi Daniel,
On Thu, 27 Sep 2018 09:17:51 +0200 Daniel Mack daniel@zonque.org wrote:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Can you try to replace your patch by the following one and let me know if it solves your problem?
Thanks,
Boris
--->8--- diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index e63f714f7639..295a86a5545f 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -1123,7 +1123,7 @@ static int marvell_nfc_hw_ecc_hmg_do_write_page(struct nand_chip *chip, memcpy(nfc->dma_buf, data_buf, lt->data_bytes); memcpy(nfc->dma_buf + lt->data_bytes, oob_buf, oob_bytes); marvell_nfc_xfer_data_dma(nfc, DMA_TO_DEVICE, lt->data_bytes + - lt->ecc_bytes + lt->spare_bytes); + oob_bytes); } else { marvell_nfc_xfer_data_out_pio(nfc, data_buf, lt->data_bytes); marvell_nfc_xfer_data_out_pio(nfc, oob_buf, oob_bytes);
Hi Boris,
On 2/10/2018 12:44 AM, Boris Brezillon wrote:
On Thu, 27 Sep 2018 09:17:51 +0200 Daniel Mack daniel@zonque.org wrote:
At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register will only cause the IRQ to latch when the RDY lanes are changing, and not in case they are already asserted.
This means that if the controller finished the command in flight before marvell_nfc_wait_op() is called, that function will wait for a change in the bit that can't ever happen as it is already set.
To address this race, check for the RDY bits after the IRQ was enabled, and complete the completion immediately if the condition is already met.
This fixes a bug that was observed with a NAND chip that holds a UBIFS parition on which file system stress tests were executed. When marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount the filesystem read-only, reporting lots of warnings along the way.
Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver Cc: stable@vger.kernel.org Signed-off-by: Daniel Mack daniel@zonque.org
Can you try to replace your patch by the following one and let me know if it solves your problem?
Nope, this doesn't fix it. Same problem as before.
Thanks, Daniel
--->8--- diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c index e63f714f7639..295a86a5545f 100644 --- a/drivers/mtd/nand/raw/marvell_nand.c +++ b/drivers/mtd/nand/raw/marvell_nand.c @@ -1123,7 +1123,7 @@ static int marvell_nfc_hw_ecc_hmg_do_write_page(struct nand_chip *chip, memcpy(nfc->dma_buf, data_buf, lt->data_bytes); memcpy(nfc->dma_buf + lt->data_bytes, oob_buf, oob_bytes); marvell_nfc_xfer_data_dma(nfc, DMA_TO_DEVICE, lt->data_bytes +
lt->ecc_bytes + lt->spare_bytes);
oob_bytes); } else { marvell_nfc_xfer_data_out_pio(nfc, data_buf, lt->data_bytes); marvell_nfc_xfer_data_out_pio(nfc, oob_buf, oob_bytes);
linux-stable-mirror@lists.linaro.org