On Mon, Aug 11, 2025 at 09:49:57PM +0200, Gabor Juhos wrote:
Under some circumstances I2C recovery fails on Armada 3700. At least on the Methode uDPU board, removing and replugging an SFP module fails often, like this:
[ 36.953127] sfp sfp-eth1: module removed [ 38.468549] i2c i2c-1: i2c_pxa: timeout waiting for bus free [ 38.486960] sfp sfp-eth1: module MENTECHOPTO POS22-LDCC-KR rev 1.0 sn MNC208U90009 dc 200828 [ 38.496867] mvneta d0040000.ethernet eth1: unsupported SFP module: no common interface modes [ 38.521448] hwmon hwmon2: temp1_input not attached to any thermal zone [ 39.249196] sfp sfp-eth1: module removed ... [ 292.568799] sfp sfp-eth1: please wait, module slow to respond ... [ 625.208814] sfp sfp-eth1: failed to read EEPROM: -EREMOTEIO
Note that the 'unsupported SFP module' messages are not relevant. The module is used only for testing the I2C recovery funcionality, because the error can be triggered easily with this specific one.
Enabling debug in the i2c-pxa driver reveals the following:
[ 82.034678] sfp sfp-eth1: module removed [ 90.008654] i2c i2c-1: slave_0x50 error: timeout with active message [ 90.015112] i2c i2c-1: msg_num: 2 msg_idx: 0 msg_ptr: 0 [ 90.020464] i2c i2c-1: IBMR: 00000003 IDBR: 000000a0 ICR: 000007e0 ISR: 00000802 [ 90.027906] i2c i2c-1: log: [ 90.030787]
This continues until the retries are exhausted ...
[ 110.192489] i2c i2c-1: slave_0x50 error: exhausted retries [ 110.198012] i2c i2c-1: msg_num: 2 msg_idx: 0 msg_ptr: 0 [ 110.203323] i2c i2c-1: IBMR: 00000003 IDBR: 000000a0 ICR: 000007e0 ISR: 00000802 [ 110.210810] i2c i2c-1: log: [ 110.213633]
... then the whole sequence starts again ...
[ 115.368641] i2c i2c-1: slave_0x50 error: timeout with active message
... while finally the SFP core gives up:
[ 671.975258] sfp sfp-eth1: failed to read EEPROM: -EREMOTEIO
When we analyze the log, it can be seen that bit 1 and 11 is set in the ISR (Interface Status Register). Bit 1 indicates the ACK/NACK status, but the purpose of bit 11 is not documented in the driver code unfortunately.
The 'Functional Specification' document of the Armada 3700 SoCs family however says that this bit indicates an 'Early Bus Busy' condition. The document also notes that whenever this bit is set, it is not possible to initiate a transaction on the I2C bus. The observed behaviour corresponds to this statement.
Unfortunately, I2C recovery does not help as it never runs in this special case. Although the driver checks the busyness of the bus at several places, but since it does not consider the A3700 specific bit in these checks it can't determine the actual status of the bus correctly which results in the errors above.
In order to fix the problem, add a new member to struct 'i2c_pxa' to store a controller specific bitmask containing the bits indicating the busy status, and use that in the code while checking the actual status of the bus. This ensures that the correct status can be determined on the Armada 3700 based devices without causing functional changes on devices based on other SoCs.
With the change applied, the driver detects the busy condition, and runs the recovery process:
[ 742.617312] i2c i2c-1: state:i2c_pxa_wait_bus_not_busy:449: ISR=00000802, ICR=000007e0, IBMR=03 [ 742.626099] i2c i2c-1: i2c_pxa: timeout waiting for bus free [ 742.631933] i2c i2c-1: recovery: resetting controller, ISR=0x00000802 [ 742.638421] i2c i2c-1: recovery: IBMR 0x00000003 ISR 0x00000000
This clears the EBB bit in the ISR register, so it makes it possible to initiate transactions on the I2C bus again.
After this patch, the SFP module used for testing can be removed and replugged numerous times without causing the error described at the beginning. Previously, the error happened after a few such attempts.
The patch has been tested also with the following kernel versions: 5.10.237, 5.15.182, 6.1.138, 6.6.90, 6.12.28, 6.14.6. It improves recoverabilty on all of them.
...
Note: the patch is included in this series for completeness however it can be applied independently from the preceding patches. On kernels 6.3+, it restores I2C functionality even in itself because it recovers the controller from the bad state described in the previous patch.
Sounds to me like this one should be applied first independently on the discussion / conclusion on the patch 1.
...
Code wise it looks reasonable to me, but I haven't reviewed it properly and wouldn't probably have a time, that's why no tags.