PCI AER stats counters sysfs attributes need to iterate over stats counters instead of stats names. Also, added a build time check to make sure all counters have entries in strings array.
Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()") Cc: stable@vger.kernel.org Reported-by: Meeta Saggi msaggi@purestorage.com Signed-off-by: Mohamed Khalfella mkhalfella@purestorage.com Reviewed-by: Meeta Saggi msaggi@purestorage.com Reviewed-by: Eric Badger ebadger@purestorage.com --- drivers/pci/pcie/aer.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..ce99a6d44786 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = { u64 *stats = pdev->aer_stats->stats_array; \ size_t len = 0; \ \ - for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \ + for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\ if (strings_array[i]) \ len += sysfs_emit_at(buf, len, "%s %llu\n", \ strings_array[i], \ @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
+ BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) < + AER_MAX_TYPEOF_COR_ERRS); + BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) < + AER_MAX_TYPEOF_UNCOR_ERRS); + /* Limit to Root Ports or Root Complex Event Collectors */ if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) && (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))
[+cc Rajat]
On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
PCI AER stats counters sysfs attributes need to iterate over stats counters instead of stats names.
Thanks for catching this; it definitely looks like a real issue! I guess you're probably seeing junk in the sysfs files?
It would be helpful to reviewers if this said *why* we need to iterate over the counters instead of the names. I think the problem is that the current code reads past the end of the stats counters.
There are parallel arrays here:
#define AER_MAX_TYPEOF_COR_ERRS 16 #define AER_MAX_TYPEOF_UNCOR_ERRS 27
aer_correctable_error_string[32] # 32 pdev->aer_stats->dev_cor_errs[AER_MAX_TYPEOF_COR_ERRS] # 16 aer_uncorrectable_error_string[32] # 32 pdev->aer_stats->dev_fatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS] # 27 pdev->aer_stats->dev_nonfatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS] # 27
And here's the current use of them:
#define aer_stats_dev_attr(..., stats_array, strings_array, ...) for (i = 0; i < ARRAY_SIZE(strings_array); i++) { if (strings_array[i]) sysfs_emit_at(..., strings_array[i], stats[i]); (1) else if (stats[i]) sysfs_emit_at(..., stats[i]); (2)
aer_stats_dev_attr(..., dev_cor_errs, aer_correctable_error_string, aer_stats_dev_attr(..., dev_fatal_errs, aer_uncorrectable_error_string, aer_stats_dev_attr(..., dev_nonfatal_errs, aer_uncorrectable_error_string,
The current loop iterates over 0..31, which is safe at (1) because the non-NULL strings are at aer_correctable_error_string[0..15] and aer_uncorrectable_error_string[0..26].
But it is unsafe at (2) because it references dev_cor_errs[16..31], dev_fatal_errs[27..31], and dev_nonfatal_errs[27..31], which are past the end of the arrays.
Also, added a build time check to make sure all counters have entries in strings array.
Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")
Yep, I blew it there. Rajat did it correctly when he added this with 81aa5206f9a7 ("PCI/AER: Add sysfs attributes to provide AER stats and breakdown"), and I broke it by extending the string arrays to 32 entries.
Cc: stable@vger.kernel.org Reported-by: Meeta Saggi msaggi@purestorage.com Signed-off-by: Mohamed Khalfella mkhalfella@purestorage.com Reviewed-by: Meeta Saggi msaggi@purestorage.com Reviewed-by: Eric Badger ebadger@purestorage.com
drivers/pci/pcie/aer.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..ce99a6d44786 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = { u64 *stats = pdev->aer_stats->stats_array; \ size_t len = 0; \ \
- for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
- for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\ if (strings_array[i]) \ len += sysfs_emit_at(buf, len, "%s %llu\n", \ strings_array[i], \
I think maybe we should populate the currently NULL entries in the string[] arrays and simplify the code here, e.g.,
static const char *aer_correctable_error_string[] = { "RxErr", /* Bit Position 0 */ "dev_cor_errs_bit[1]", ...
if (stats[i]) len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
It's a little more data space, but easier to verify.
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
And make these check for "!=" instead of "<".
Thanks for catching this; it definitely looks like a real issue! I guess you're probably seeing junk in the sysfs files?
That is correct. The initial report was seeing junk when reading sysfs files. As descibed, this is happening because we reading data past the end of the stats counters array.
I think maybe we should populate the currently NULL entries in the string[] arrays and simplify the code here, e.g.,
static const char *aer_correctable_error_string[] = { "RxErr", /* Bit Position 0 */ "dev_cor_errs_bit[1]", ...
if (stats[i]) len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
Doing it this way will change the output format. In this case we will show stats only if their value is greater than zero. The current code shows all the stats those have names (regardless of their value) plus those have non-zero values.
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
And make these check for "!=" instead of "<".
This will require unnecessarily extending stats arrays to have 32 entries in order to match names arrays. If you don't feel strogly about changing "<" to "!=", I prefer to keep the code as it is.
Is there any chance for this to land in 5.19?
On 5/10/22 14:17, Mohamed Khalfella wrote:
Thanks for catching this; it definitely looks like a real issue! I guess you're probably seeing junk in the sysfs files?
That is correct. The initial report was seeing junk when reading sysfs files. As descibed, this is happening because we reading data past the end of the stats counters array.
I think maybe we should populate the currently NULL entries in the string[] arrays and simplify the code here, e.g.,
static const char *aer_correctable_error_string[] = { "RxErr", /* Bit Position 0 */ "dev_cor_errs_bit[1]", ...
if (stats[i]) len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
Doing it this way will change the output format. In this case we will show stats only if their value is greater than zero. The current code shows all the stats those have names (regardless of their value) plus those have non-zero values.
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
And make these check for "!=" instead of "<".
I am happy to remove these BUILD_BUG_ON() if you think it is a good idea to do so.
This will require unnecessarily extending stats arrays to have 32 entries in order to match names arrays. If you don't feel strogly about changing "<" to "!=", I prefer to keep the code as it is.
On Fri, Jun 03, 2022 at 10:12:47PM +0000, Mohamed Khalfella wrote:
Is there any chance for this to land in 5.19?
Too late for v5.19, since the merge window will end in a couple days. Remind me again if you don't see it in -next by v5.20-rc5 or so.
On 5/10/22 14:17, Mohamed Khalfella wrote:
Thanks for catching this; it definitely looks like a real issue! I guess you're probably seeing junk in the sysfs files?
That is correct. The initial report was seeing junk when reading sysfs files. As descibed, this is happening because we reading data past the end of the stats counters array.
I think maybe we should populate the currently NULL entries in the string[] arrays and simplify the code here, e.g.,
static const char *aer_correctable_error_string[] = { "RxErr", /* Bit Position 0 */ "dev_cor_errs_bit[1]", ...
if (stats[i]) len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
Doing it this way will change the output format. In this case we will show stats only if their value is greater than zero. The current code shows all the stats those have names (regardless of their value) plus those have non-zero values.
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
And make these check for "!=" instead of "<".
I am happy to remove these BUILD_BUG_ON() if you think it is a good idea to do so.
I think it's good to enforce correctness there somehow, so let's leave them there unless somebody has a better idea.
This will require unnecessarily extending stats arrays to have 32 entries in order to match names arrays. If you don't feel strogly about changing "<" to "!=", I prefer to keep the code as it is.
On 6/3/22 16:58, Bjorn Helgaas wrote:
On Fri, Jun 03, 2022 at 10:12:47PM +0000, Mohamed Khalfella wrote:
Is there any chance for this to land in 5.19?
Too late for v5.19, since the merge window will end in a couple days. Remind me again if you don't see it in -next by v5.20-rc5 or so.
Thank you. I will keep an eye on -next.
On 5/10/22 14:17, Mohamed Khalfella wrote:
Thanks for catching this; it definitely looks like a real issue! I guess you're probably seeing junk in the sysfs files?
That is correct. The initial report was seeing junk when reading sysfs files. As descibed, this is happening because we reading data past the end of the stats counters array.
I think maybe we should populate the currently NULL entries in the string[] arrays and simplify the code here, e.g.,
static const char *aer_correctable_error_string[] = { "RxErr", /* Bit Position 0 */ "dev_cor_errs_bit[1]", ...
if (stats[i]) len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
Doing it this way will change the output format. In this case we will show stats only if their value is greater than zero. The current code shows all the stats those have names (regardless of their value) plus those have non-zero values.
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
And make these check for "!=" instead of "<".
I am happy to remove these BUILD_BUG_ON() if you think it is a good idea to do so.
I think it's good to enforce correctness there somehow, so let's leave them there unless somebody has a better idea.
This will require unnecessarily extending stats arrays to have 32 entries in order to match names arrays. If you don't feel strogly about changing "<" to "!=", I prefer to keep the code as it is.
On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
PCI AER stats counters sysfs attributes need to iterate over stats counters instead of stats names. Also, added a build time check to make sure all counters have entries in strings array.
Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()") Cc: stable@vger.kernel.org Reported-by: Meeta Saggi msaggi@purestorage.com Signed-off-by: Mohamed Khalfella mkhalfella@purestorage.com Reviewed-by: Meeta Saggi msaggi@purestorage.com Reviewed-by: Eric Badger ebadger@purestorage.com
I added some info about why we need this to the commit log and applied to pci/err for v5.20. Thank you!
drivers/pci/pcie/aer.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..ce99a6d44786 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = { u64 *stats = pdev->aer_stats->stats_array; \ size_t len = 0; \ \
- for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
- for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\ if (strings_array[i]) \ len += sysfs_emit_at(buf, len, "%s %llu\n", \ strings_array[i], \
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
- /* Limit to Root Ports or Root Complex Event Collectors */ if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) && (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))
-- 2.29.0
On 2022-07-11 17:54:37 -0500, Bjorn Helgaas wrote:
On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
PCI AER stats counters sysfs attributes need to iterate over stats counters instead of stats names. Also, added a build time check to make sure all counters have entries in strings array.
Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()") Cc: stable@vger.kernel.org Reported-by: Meeta Saggi msaggi@purestorage.com Signed-off-by: Mohamed Khalfella mkhalfella@purestorage.com Reviewed-by: Meeta Saggi msaggi@purestorage.com Reviewed-by: Eric Badger ebadger@purestorage.com
I added some info about why we need this to the commit log and applied to pci/err for v5.20. Thank you!
That is good news! Thank you for helping out.
drivers/pci/pcie/aer.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..ce99a6d44786 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = { u64 *stats = pdev->aer_stats->stats_array; \ size_t len = 0; \ \
- for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
- for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\ if (strings_array[i]) \ len += sysfs_emit_at(buf, len, "%s %llu\n", \ strings_array[i], \
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->device; struct pci_dev *port = dev->port;
- BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
AER_MAX_TYPEOF_COR_ERRS);
- BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
AER_MAX_TYPEOF_UNCOR_ERRS);
- /* Limit to Root Ports or Root Complex Event Collectors */ if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) && (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))
-- 2.29.0
linux-stable-mirror@lists.linaro.org