[+cc Rajat]
On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
PCI AER stats counters sysfs attributes need to iterate over stats counters instead of stats names.
Thanks for catching this; it definitely looks like a real issue! I guess you're probably seeing junk in the sysfs files?
It would be helpful to reviewers if this said *why* we need to iterate over the counters instead of the names. I think the problem is that the current code reads past the end of the stats counters.
There are parallel arrays here:
#define AER_MAX_TYPEOF_COR_ERRS 16 #define AER_MAX_TYPEOF_UNCOR_ERRS 27
aer_correctable_error_string[32] # 32 pdev->aer_stats->dev_cor_errs[AER_MAX_TYPEOF_COR_ERRS] # 16 aer_uncorrectable_error_string[32] # 32 pdev->aer_stats->dev_fatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS] # 27 pdev->aer_stats->dev_nonfatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS] # 27
And here's the current use of them:
#define aer_stats_dev_attr(..., stats_array, strings_array, ...) for (i = 0; i < ARRAY_SIZE(strings_array); i++) { if (strings_array[i]) sysfs_emit_at(..., strings_array[i], stats[i]); (1) else if (stats[i]) sysfs_emit_at(..., stats[i]); (2)
aer_stats_dev_attr(..., dev_cor_errs, aer_correctable_error_string, aer_stats_dev_attr(..., dev_fatal_errs, aer_uncorrectable_error_string, aer_stats_dev_attr(..., dev_nonfatal_errs, aer_uncorrectable_error_string,
The current loop iterates over 0..31, which is safe at (1) because the non-NULL strings are at aer_correctable_error_string[0..15] and aer_uncorrectable_error_string[0..26].
But it is unsafe at (2) because it references dev_cor_errs[16..31], dev_fatal_errs[27..31], and dev_nonfatal_errs[27..31], which are past the end of the arrays.
Also, added a build time check to make sure all counters have entries in strings array.
Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")
Yep, I blew it there. Rajat did it correctly when he added this with 81aa5206f9a7 ("PCI/AER: Add sysfs attributes to provide AER stats and breakdown"), and I broke it by extending the string arrays to 32 entries.
Cc: stable@vger.kernel.org Reported-by: Meeta Saggi msaggi@purestorage.com Signed-off-by: Mohamed Khalfella mkhalfella@purestorage.com Reviewed-by: Meeta Saggi msaggi@purestorage.com Reviewed-by: Eric Badger ebadger@purestorage.com
drivers/pci/pcie/aer.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..ce99a6d44786 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = { u64 *stats = pdev->aer_stats->stats_array; \ size_t len = 0; \ \
- for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
- for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\ if (strings_array[i]) \ len += sysfs_emit_at(buf, len, "%s %llu\n", \ strings_array[i], \
I think maybe we should populate the currently NULL entries in the string[] arrays and simplify the code here, e.g.,
static const char *aer_correctable_error_string[] = { "RxErr", /* Bit Position 0 */ "dev_cor_errs_bit[1]", ...
if (stats[i]) len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
It's a little more data space, but easier to verify.
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
And make these check for "!=" instead of "<".