This is re-use existing patch submitted by Huang Ying ying.huang@intel.com with one small modification (see commits). The first idea was to use just printk to inform about any errors that were not consumed by OS.
Errors listed in HEST table should point to the same "Error status block" ESB structures as BERT does. OS handling errors should clear errors status bit in ESB for given error at the end of error recovery procedure. Case where error appeared to be too serious OS can reset machine immediately without clearing bit in ESB. During boot, kernel examine each error status bit from ESB list (pointed from BERT) and see if there are any unhandled errors.
BERT table was tested along with HEST and EINJ driver, in the following way: 1. Fill in ESB using EINJ hacked driver and do not clear erros status in ESB, this way unhandler error is simulated and BERT table could be used later: root@localhost:~# echo 1 > /sys/kernel/debug/apei/einj/error_inject 2. Reboot machin and check whether BERT driver notice injected error: ... [ 2.518179] [Hardware Error]: Error record from previous boot: [ 2.523342] [Hardware Error]: APEI generic hardware error status [ 2.548457] [Hardware Error]: severity: 1, fatal [ 2.574705] [Hardware Error]: section: 0, severity: 0, recoverable [ 2.584010] [Hardware Error]: flags: 0x00 [ 2.587937] [Hardware Error]: section_type: memory error ... 3. Kernel clear status bit so next boot would not print it again.
From: Huang Ying ying.huang@intel.com
Under normal circumstances, when a hardware error occurs, kernel will be notified via NMI, MCE or some other method, then kernel will process the error condition, report it, and recover it if possible. But sometime, the situation is so bad, so that firmware may choose to reset directly without notifying Linux kernel.
Linux kernel can use the Boot Error Record Table (BERT) to get the un-notified hardware errors that occurred in a previous boot. In this patch, the error information is reported via printk.
For more information about ERST, please refer to ACPI Specification version 5.0, section 18.3.1
Signed-off-by: Huang Ying ying.huang@intel.com --- Documentation/kernel-parameters.txt | 3 + drivers/acpi/apei/Makefile | 2 +- drivers/acpi/apei/bert.c | 169 +++++++++++++++++++++++++++++++++++ include/acpi/apei.h | 1 + 4 files changed, 174 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/apei/bert.c
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 15356ac..1149989 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -436,6 +436,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
bootmem_debug [KNL] Enable bootmem allocator debug messages.
+ bert_disable [ACPI] + Disable Boot Error Record Table (BEST) support. + bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) bttv.radio= Most important insmod options are available as kernel args too. diff --git a/drivers/acpi/apei/Makefile b/drivers/acpi/apei/Makefile index d1d1bc0..0eb590b 100644 --- a/drivers/acpi/apei/Makefile +++ b/drivers/acpi/apei/Makefile @@ -3,4 +3,4 @@ obj-$(CONFIG_ACPI_APEI_GHES) += ghes.o obj-$(CONFIG_ACPI_APEI_EINJ) += einj.o obj-$(CONFIG_ACPI_APEI_ERST_DEBUG) += erst-dbg.o
-apei-y := apei-base.o hest.o cper.o erst.o +apei-y := apei-base.o hest.o cper.o erst.o bert.o diff --git a/drivers/acpi/apei/bert.c b/drivers/acpi/apei/bert.c new file mode 100644 index 0000000..67e2274 --- /dev/null +++ b/drivers/acpi/apei/bert.c @@ -0,0 +1,169 @@ +/* + * APEI Boot Error Record Table (BERT) support + * + * Copyright 2011 Intel Corp. + * Author: Huang Ying ying.huang@xxxxxxxxx + * + * Under normal circumstances, when a hardware error occurs, kernel + * will be notified via NMI, MCE or some other method, then kernel + * will process the error condition, report it, and recover it if + * possible. But sometime, the situation is so bad, so that firmware + * may choose to reset directly without notifying Linux kernel. + * + * Linux kernel can use the Boot Error Record Table (BERT) to get the + * un-notified hardware errors that occurred in a previous boot. + * + * For more information about ERST, please refer to ACPI Specification + * version 4.0, section 17.3.1 + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/acpi.h> +#include <linux/io.h> + +#include "apei-internal.h" + +#define BERT_PFX "BERT: " + +int bert_disable; +EXPORT_SYMBOL_GPL(bert_disable); + +static void __init bert_print_all(struct acpi_hest_generic_status *region, + unsigned int region_len) +{ + int remain, first = 1; + u32 estatus_len; + struct acpi_hest_generic_status *estatus; + + remain = region_len; + estatus = region; + while (remain > sizeof(struct acpi_hest_generic_status)) { + /* No more error record */ + if (!estatus->block_status) + break; + + estatus_len = apei_estatus_len(estatus); + if (estatus_len < sizeof(struct acpi_hest_generic_status) || + remain < estatus_len) { + pr_err(FW_BUG BERT_PFX "Invalid error status block with length %u\n", + estatus_len); + return; + } + + if (apei_estatus_check(estatus)) { + pr_err(FW_BUG BERT_PFX "Invalid Error status block\n"); + goto next; + } + + if (first) { + pr_info(HW_ERR "Error record from previous boot:\n"); + first = 0; + } + apei_estatus_print(KERN_INFO HW_ERR, estatus); +next: + estatus = (void *)estatus + estatus_len; + remain -= estatus_len; + } +} + +static int __init setup_bert_disable(char *str) +{ + bert_disable = 1; + return 0; +} +__setup("bert_disable", setup_bert_disable); + +static int __init bert_check_table(struct acpi_table_bert *bert_tab) +{ + if (bert_tab->header.length < sizeof(struct acpi_table_bert)) + return -EINVAL; + if (bert_tab->region_length != 0 && + bert_tab->region_length < sizeof(struct acpi_bert_region)) + return -EINVAL; + + return 0; +} + +static int __init bert_init(void) +{ + acpi_status status; + struct acpi_table_bert *bert_tab; + struct resource *r; + struct acpi_hest_generic_status *bert_region; + unsigned int region_len; + int rc = -EINVAL; + + if (acpi_disabled) + goto out; + + if (bert_disable) { + pr_info(BERT_PFX "Boot Error Record Table (BERT) support is disabled.\n"); + goto out; + } + + status = acpi_get_table(ACPI_SIG_BERT, 0, + (struct acpi_table_header **)&bert_tab); + if (status == AE_NOT_FOUND) { + pr_err(BERT_PFX "Table is not found!\n"); + goto out; + } else if (ACPI_FAILURE(status)) { + const char *msg = acpi_format_exception(status); + pr_err(BERT_PFX "Failed to get table, %s\n", msg); + goto out; + } + + rc = bert_check_table(bert_tab); + if (rc) { + pr_err(FW_BUG BERT_PFX "BERT table is invalid\n"); + goto out; + } + + region_len = bert_tab->region_length; + if (!region_len) { + rc = 0; + goto out; + } + + r = request_mem_region(bert_tab->address, region_len, "APEI BERT"); + if (!r) { + pr_err(BERT_PFX "Can not request iomem region <%016llx-%016llx> for BERT.\n", + (unsigned long long)bert_tab->address, + (unsigned long long)bert_tab->address + region_len); + rc = -EIO; + goto out; + } + + bert_region = ioremap_cache(bert_tab->address, region_len); + if (!bert_region) { + rc = -ENOMEM; + goto out_release; + } + + bert_print_all(bert_region, region_len); + + iounmap(bert_region); + +out_release: + release_mem_region(bert_tab->address, region_len); +out: + if (rc) + bert_disable = 1; + + return rc; +} +late_initcall(bert_init); diff --git a/include/acpi/apei.h b/include/acpi/apei.h index 04f349d..b639891 100644 --- a/include/acpi/apei.h +++ b/include/acpi/apei.h @@ -23,6 +23,7 @@ extern bool ghes_disable; #else #define ghes_disable 1 #endif +extern int bert_disable;
#ifdef CONFIG_ACPI_APEI void __init acpi_hest_init(void);
Once error log is printed out clear error status so it would not be print during next boot again.
Signed-off-by: Tomasz Nowicki tomasz.nowicki@linaro.org --- drivers/acpi/apei/bert.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/acpi/apei/bert.c b/drivers/acpi/apei/bert.c index 67e2274..a06c99f 100644 --- a/drivers/acpi/apei/bert.c +++ b/drivers/acpi/apei/bert.c @@ -75,6 +75,9 @@ static void __init bert_print_all(struct acpi_hest_generic_status *region, first = 0; } apei_estatus_print(KERN_INFO HW_ERR, estatus); + + /* Clear error status */ + estatus->block_status = 0; next: estatus = (void *)estatus + estatus_len; remain -= estatus_len;
Hi Tomasz,
On 2013-8-29 17:41, Tomasz Nowicki wrote:
This is re-use existing patch submitted by Huang Ying ying.huang@intel.com with one small modification (see commits). The first idea was to use just printk to inform about any errors that were not consumed by OS.
Errors listed in HEST table should point to the same "Error status block" ESB structures as BERT does. OS handling errors should clear errors status bit in ESB for given error at the end of error recovery procedure. Case where error appeared to be too serious OS can reset machine immediately without clearing bit in ESB. During boot, kernel examine each error status bit from ESB list (pointed from BERT) and see if there are any unhandled errors.
BERT table was tested along with HEST and EINJ driver, in the following way:
- Fill in ESB using EINJ hacked driver and do not clear erros status in ESB, this way unhandler error is simulated and BERT table could be used later:
root@localhost:~# echo 1 > /sys/kernel/debug/apei/einj/error_inject 2. Reboot machin and check whether BERT driver notice injected error: ... [ 2.518179] [Hardware Error]: Error record from previous boot: [ 2.523342] [Hardware Error]: APEI generic hardware error status [ 2.548457] [Hardware Error]: severity: 1, fatal [ 2.574705] [Hardware Error]: section: 0, severity: 0, recoverable [ 2.584010] [Hardware Error]: flags: 0x00 [ 2.587937] [Hardware Error]: section_type: memory error ... 3. Kernel clear status bit so next boot would not print it again.
Why is there no summary of this patch set? such as Al's patch set:
Al Stone (6): ACPI: ARM: arndale: remove GPZ GPIO definition from DT so it can be in ACPI ACPI: ARM: arndale: whitelist the samsung-pinctrl driver for ACPI ACPI: make an error message a little clearer ACPI: improve acpi_extract_package() utility ACPI: ARM: arndale: enable ACPI in the Samsung pinctrl driver ACPI: ARM: arndale: add CONFIG_ACPI ifdef's to pinctrl driver
arch/arm/boot/dts/exynos5250-pinctrl.dtsi | 2 + arch/arm/boot/dts/exynos5250.dtsi | 6 +- drivers/acpi/acpi_platform.c | 3 + drivers/acpi/osl.c | 2 +- drivers/acpi/utils.c | 17 +- drivers/pinctrl/pinctrl-samsung.c | 517 +++++++++++++++++++++++++++++- drivers/pinctrl/pinctrl-samsung.h | 3 + 7 files changed, 536 insertions(+), 14 deletions(-)
did you delete them?
Linaro-acpi mailing list Linaro-acpi@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-acpi
W dniu 02.09.2013 08:53, Hanjun Guo pisze:
Hi Tomasz,
On 2013-8-29 17:41, Tomasz Nowicki wrote:
This is re-use existing patch submitted by Huang Ying ying.huang@intel.com with one small modification (see commits). The first idea was to use just printk to inform about any errors that were not consumed by OS.
Errors listed in HEST table should point to the same "Error status block" ESB structures as BERT does. OS handling errors should clear errors status bit in ESB for given error at the end of error recovery procedure. Case where error appeared to be too serious OS can reset machine immediately without clearing bit in ESB. During boot, kernel examine each error status bit from ESB list (pointed from BERT) and see if there are any unhandled errors.
BERT table was tested along with HEST and EINJ driver, in the following way:
- Fill in ESB using EINJ hacked driver and do not clear erros status in ESB, this way unhandler error is simulated and BERT table could be used later:
root@localhost:~# echo 1 > /sys/kernel/debug/apei/einj/error_inject 2. Reboot machin and check whether BERT driver notice injected error: ... [ 2.518179] [Hardware Error]: Error record from previous boot: [ 2.523342] [Hardware Error]: APEI generic hardware error status [ 2.548457] [Hardware Error]: severity: 1, fatal [ 2.574705] [Hardware Error]: section: 0, severity: 0, recoverable [ 2.584010] [Hardware Error]: flags: 0x00 [ 2.587937] [Hardware Error]: section_type: memory error ... 3. Kernel clear status bit so next boot would not print it again.
Why is there no summary of this patch set? such as Al's patch set:
Al Stone (6): ACPI: ARM: arndale: remove GPZ GPIO definition from DT so it can be in ACPI ACPI: ARM: arndale: whitelist the samsung-pinctrl driver for ACPI ACPI: make an error message a little clearer ACPI: improve acpi_extract_package() utility ACPI: ARM: arndale: enable ACPI in the Samsung pinctrl driver ACPI: ARM: arndale: add CONFIG_ACPI ifdef's to pinctrl driver
arch/arm/boot/dts/exynos5250-pinctrl.dtsi | 2 + arch/arm/boot/dts/exynos5250.dtsi | 6 +- drivers/acpi/acpi_platform.c | 3 + drivers/acpi/osl.c | 2 +- drivers/acpi/utils.c | 17 +- drivers/pinctrl/pinctrl-samsung.c | 517 +++++++++++++++++++++++++++++- drivers/pinctrl/pinctrl-samsung.h | 3 + 7 files changed, 536 insertions(+), 14 deletions(-)
did you delete them?
Yes, commit logs are mail titles so it makes cover letter bigger unnecessarily.
Linaro-acpi mailing list Linaro-acpi@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-acpi
On 09/02/2013 06:59 AM, Tomasz Nowicki wrote:
W dniu 02.09.2013 08:53, Hanjun Guo pisze:
Hi Tomasz,
On 2013-8-29 17:41, Tomasz Nowicki wrote:
This is re-use existing patch submitted by Huang Ying ying.huang@intel.com with one small modification (see commits). The first idea was to use just printk to inform about any errors that were not consumed by OS.
Errors listed in HEST table should point to the same "Error status block" ESB structures as BERT does. OS handling errors should clear errors status bit in ESB for given error at the end of error recovery procedure. Case where error appeared to be too serious OS can reset machine immediately without clearing bit in ESB. During boot, kernel examine each error status bit from ESB list (pointed from BERT) and see if there are any unhandled errors.
BERT table was tested along with HEST and EINJ driver, in the following way:
- Fill in ESB using EINJ hacked driver and do not clear erros status
in ESB, this way unhandler error is simulated and BERT table could be used later: root@localhost:~# echo 1 > /sys/kernel/debug/apei/einj/error_inject 2. Reboot machin and check whether BERT driver notice injected error: ... [ 2.518179] [Hardware Error]: Error record from previous boot: [ 2.523342] [Hardware Error]: APEI generic hardware error status [ 2.548457] [Hardware Error]: severity: 1, fatal [ 2.574705] [Hardware Error]: section: 0, severity: 0, recoverable [ 2.584010] [Hardware Error]: flags: 0x00 [ 2.587937] [Hardware Error]: section_type: memory error ... 3. Kernel clear status bit so next boot would not print it again.
Why is there no summary of this patch set? such as Al's patch set:
Al Stone (6): ACPI: ARM: arndale: remove GPZ GPIO definition from DT so it can be in ACPI ACPI: ARM: arndale: whitelist the samsung-pinctrl driver for ACPI ACPI: make an error message a little clearer ACPI: improve acpi_extract_package() utility ACPI: ARM: arndale: enable ACPI in the Samsung pinctrl driver ACPI: ARM: arndale: add CONFIG_ACPI ifdef's to pinctrl driver
arch/arm/boot/dts/exynos5250-pinctrl.dtsi | 2 + arch/arm/boot/dts/exynos5250.dtsi | 6 +- drivers/acpi/acpi_platform.c | 3 + drivers/acpi/osl.c | 2 +- drivers/acpi/utils.c | 17 +- drivers/pinctrl/pinctrl-samsung.c | 517 +++++++++++++++++++++++++++++- drivers/pinctrl/pinctrl-samsung.h | 3 + 7 files changed, 536 insertions(+), 14 deletions(-)
did you delete them?
Yes, commit logs are mail titles so it makes cover letter bigger unnecessarily.
Hey, Tomasz. Did you get sufficient ack's for these patches? Did they get committed? It would not surprise if all this was taken care of a long time ago and I just lost the IRQ for it :).
If, not, this is my ACK. This looks like a good starting point for this functionality; we'll have to convert it to UEFI calls at some point in the future, and we should note that it a JIRA card or bug report some place, but this does put the basic framework in place regardless.
Acked-by: Al Stone al.stone@linaro.org