This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 4.4.110-rc1
Kees Cook keescook@chromium.org KPTI: Report when enabled
Kees Cook keescook@chromium.org KPTI: Rename to PAGE_TABLE_ISOLATION
Borislav Petkov bp@suse.de x86/kaiser: Move feature detection up
Jiri Kosina jkosina@suse.cz kaiser: disabled on Xen PV
Borislav Petkov bp@suse.de x86/kaiser: Reenable PARAVIRT
Thomas Gleixner tglx@linutronix.de x86/paravirt: Dont patch flush_tlb_single
Hugh Dickins hughd@google.com kaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Hugh Dickins hughd@google.com kaiser: asm/tlbflush.h handle noPGE at lower level
Hugh Dickins hughd@google.com kaiser: drop is_atomic arg to kaiser_pagetable_walk()
Hugh Dickins hughd@google.com kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Borislav Petkov bp@suse.de x86/kaiser: Check boottime cmdline params
Borislav Petkov bp@suse.de x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
Hugh Dickins hughd@google.com kaiser: add "nokaiser" boot option, using ALTERNATIVE
Hugh Dickins hughd@google.com kaiser: fix unlikely error in alloc_ldt_struct()
Hugh Dickins hughd@google.com kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
Hugh Dickins hughd@google.com kaiser: paranoid_entry pass cr3 need to paranoid_exit
Hugh Dickins hughd@google.com kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
Hugh Dickins hughd@google.com kaiser: PCID 0 for kernel and 128 for user
Hugh Dickins hughd@google.com kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
Dave Hansen dave.hansen@linux.intel.com kaiser: enhanced by kernel and user PCIDs
Hugh Dickins hughd@google.com kaiser: vmstat show NR_KAISERTABLE as nr_overhead
Hugh Dickins hughd@google.com kaiser: delete KAISER_REAL_SWITCH option
Hugh Dickins hughd@google.com kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
Hugh Dickins hughd@google.com kaiser: cleanups while trying for gold link
Hugh Dickins hughd@google.com kaiser: kaiser_remove_mapping() move along the pgd
Hugh Dickins hughd@google.com kaiser: tidied up kaiser_add/remove_mapping slightly
Hugh Dickins hughd@google.com kaiser: tidied up asm/kaiser.h somewhat
Hugh Dickins hughd@google.com kaiser: ENOMEM if kaiser_pagetable_walk() NULL
Hugh Dickins hughd@google.com kaiser: fix perf crashes
Hugh Dickins hughd@google.com kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
Hugh Dickins hughd@google.com kaiser: KAISER depends on SMP
Hugh Dickins hughd@google.com kaiser: fix build and FIXME in alloc_ldt_struct()
Hugh Dickins hughd@google.com kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
Hugh Dickins hughd@google.com kaiser: do not set _PAGE_NX on pgd_none
Dave Hansen dave.hansen@linux.intel.com kaiser: merged update
Richard Fellner richard.fellner@student.tugraz.at KAISER: Kernel Address Isolation
Tom Lendacky thomas.lendacky@amd.com x86/boot: Add early cmdline parsing for options with arguments
-------------
Diffstat:
Documentation/kernel-parameters.txt | 8 + Makefile | 4 +- arch/x86/boot/compressed/misc.h | 1 + arch/x86/entry/entry_64.S | 164 ++++++++-- arch/x86/entry/entry_64_compat.S | 7 + arch/x86/include/asm/cmdline.h | 2 + arch/x86/include/asm/cpufeature.h | 4 + arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/hw_irq.h | 2 +- arch/x86/include/asm/kaiser.h | 141 +++++++++ arch/x86/include/asm/pgtable.h | 28 +- arch/x86/include/asm/pgtable_64.h | 25 +- arch/x86/include/asm/pgtable_types.h | 29 +- arch/x86/include/asm/processor.h | 2 +- arch/x86/include/asm/tlbflush.h | 74 ++++- arch/x86/include/uapi/asm/processor-flags.h | 3 +- arch/x86/kernel/cpu/common.c | 28 +- arch/x86/kernel/cpu/perf_event_intel_ds.c | 57 +++- arch/x86/kernel/espfix_64.c | 10 + arch/x86/kernel/head_64.S | 35 ++- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/ldt.c | 25 +- arch/x86/kernel/paravirt_patch_64.c | 2 - arch/x86/kernel/process.c | 2 +- arch/x86/kernel/setup.c | 7 + arch/x86/kernel/tracepoint.c | 2 + arch/x86/kvm/x86.c | 3 +- arch/x86/lib/cmdline.c | 105 +++++++ arch/x86/mm/Makefile | 1 + arch/x86/mm/init.c | 2 +- arch/x86/mm/init_64.c | 10 + arch/x86/mm/kaiser.c | 455 ++++++++++++++++++++++++++++ arch/x86/mm/pageattr.c | 63 +++- arch/x86/mm/pgtable.c | 16 +- arch/x86/mm/tlb.c | 39 ++- include/asm-generic/vmlinux.lds.h | 7 + include/linux/kaiser.h | 52 ++++ include/linux/mmzone.h | 3 +- include/linux/percpu-defs.h | 32 +- init/main.c | 2 + kernel/fork.c | 6 + mm/vmstat.c | 1 + security/Kconfig | 10 + 43 files changed, 1375 insertions(+), 98 deletions(-)
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
thanks,
greg k-h
Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 4.4.110-rc1
Kees Cook keescook@chromium.org KPTI: Report when enabled
Kees Cook keescook@chromium.org KPTI: Rename to PAGE_TABLE_ISOLATION
Borislav Petkov bp@suse.de x86/kaiser: Move feature detection up
Jiri Kosina jkosina@suse.cz kaiser: disabled on Xen PV
Borislav Petkov bp@suse.de x86/kaiser: Reenable PARAVIRT
Thomas Gleixner tglx@linutronix.de x86/paravirt: Dont patch flush_tlb_single
Hugh Dickins hughd@google.com kaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Hugh Dickins hughd@google.com kaiser: asm/tlbflush.h handle noPGE at lower level
Hugh Dickins hughd@google.com kaiser: drop is_atomic arg to kaiser_pagetable_walk()
Hugh Dickins hughd@google.com kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Borislav Petkov bp@suse.de x86/kaiser: Check boottime cmdline params
Borislav Petkov bp@suse.de x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
Hugh Dickins hughd@google.com kaiser: add "nokaiser" boot option, using ALTERNATIVE
Hugh Dickins hughd@google.com kaiser: fix unlikely error in alloc_ldt_struct()
Hugh Dickins hughd@google.com kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
Hugh Dickins hughd@google.com kaiser: paranoid_entry pass cr3 need to paranoid_exit
Hugh Dickins hughd@google.com kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
Hugh Dickins hughd@google.com kaiser: PCID 0 for kernel and 128 for user
Hugh Dickins hughd@google.com kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
Dave Hansen dave.hansen@linux.intel.com kaiser: enhanced by kernel and user PCIDs
Hugh Dickins hughd@google.com kaiser: vmstat show NR_KAISERTABLE as nr_overhead
Hugh Dickins hughd@google.com kaiser: delete KAISER_REAL_SWITCH option
Hugh Dickins hughd@google.com kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
Hugh Dickins hughd@google.com kaiser: cleanups while trying for gold link
Hugh Dickins hughd@google.com kaiser: kaiser_remove_mapping() move along the pgd
Hugh Dickins hughd@google.com kaiser: tidied up kaiser_add/remove_mapping slightly
Hugh Dickins hughd@google.com kaiser: tidied up asm/kaiser.h somewhat
Hugh Dickins hughd@google.com kaiser: ENOMEM if kaiser_pagetable_walk() NULL
Hugh Dickins hughd@google.com kaiser: fix perf crashes
Hugh Dickins hughd@google.com kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
Hugh Dickins hughd@google.com kaiser: KAISER depends on SMP
Hugh Dickins hughd@google.com kaiser: fix build and FIXME in alloc_ldt_struct()
Hugh Dickins hughd@google.com kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
Hugh Dickins hughd@google.com kaiser: do not set _PAGE_NX on pgd_none
Dave Hansen dave.hansen@linux.intel.com kaiser: merged update
Richard Fellner richard.fellner@student.tugraz.at KAISER: Kernel Address Isolation
Tom Lendacky thomas.lendacky@amd.com x86/boot: Add early cmdline parsing for options with arguments
Diffstat:
Documentation/kernel-parameters.txt | 8 + Makefile | 4 +- arch/x86/boot/compressed/misc.h | 1 + arch/x86/entry/entry_64.S | 164 ++++++++-- arch/x86/entry/entry_64_compat.S | 7 + arch/x86/include/asm/cmdline.h | 2 + arch/x86/include/asm/cpufeature.h | 4 + arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/hw_irq.h | 2 +- arch/x86/include/asm/kaiser.h | 141 +++++++++ arch/x86/include/asm/pgtable.h | 28 +- arch/x86/include/asm/pgtable_64.h | 25 +- arch/x86/include/asm/pgtable_types.h | 29 +- arch/x86/include/asm/processor.h | 2 +- arch/x86/include/asm/tlbflush.h | 74 ++++- arch/x86/include/uapi/asm/processor-flags.h | 3 +- arch/x86/kernel/cpu/common.c | 28 +- arch/x86/kernel/cpu/perf_event_intel_ds.c | 57 +++- arch/x86/kernel/espfix_64.c | 10 + arch/x86/kernel/head_64.S | 35 ++- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/ldt.c | 25 +- arch/x86/kernel/paravirt_patch_64.c | 2 - arch/x86/kernel/process.c | 2 +- arch/x86/kernel/setup.c | 7 + arch/x86/kernel/tracepoint.c | 2 + arch/x86/kvm/x86.c | 3 +- arch/x86/lib/cmdline.c | 105 +++++++ arch/x86/mm/Makefile | 1 + arch/x86/mm/init.c | 2 +- arch/x86/mm/init_64.c | 10 + arch/x86/mm/kaiser.c | 455 ++++++++++++++++++++++++++++ arch/x86/mm/pageattr.c | 63 +++- arch/x86/mm/pgtable.c | 16 +- arch/x86/mm/tlb.c | 39 ++- include/asm-generic/vmlinux.lds.h | 7 + include/linux/kaiser.h | 52 ++++ include/linux/mmzone.h | 3 +- include/linux/percpu-defs.h | 32 +- init/main.c | 2 + kernel/fork.c | 6 + mm/vmstat.c | 1 + security/Kconfig | 10 + 43 files changed, 1375 insertions(+), 98 deletions(-)
Not that my feedback will matter much on this release since Pixel 2 XL is an arm64 device but merged, compiled, and flashed successfully.
The changes to kernel/fork.c had to be slightly adjusted for Google's tree due to their addition of mainline commit b235beea9e99 ("Clarify naming of thread info/stack allocators").
No noticeable issues in general use or dmesg.
Thanks! Nathan
On Wed, Jan 03, 2018 at 03:08:09PM -0700, Nathan Chancellor wrote:
Not that my feedback will matter much on this release since Pixel 2 XL is an arm64 device but merged, compiled, and flashed successfully.
Hey, it's good to know I didn't break anything :)
The changes to kernel/fork.c had to be slightly adjusted for Google's tree due to their addition of mainline commit b235beea9e99 ("Clarify naming of thread info/stack allocators").
Ah, good to know, I'll watch out for that when I do the andoid-common tree merges.
Thanks again for testing and letting me know.
greg k-h
On 4 January 2018 at 01:41, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm and x86_64.
NOTE: Retested 20 Iterations on two devices and all 40 times test completed successfully. Which confirms this is an intermittent timing failure. For the internal investigation and record the bug has been reported. LKFT:stable-rc 4.4.110-rc1: x15: LTP poll02 FAIL: poll() slept for too long (intermittent) https://bugs.linaro.org/show_bug.cgi?id=3566
Summary ------------------------------------------------------------------------
kernel: 4.4.110-rc1 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-4.4.y git commit: 99abd6cdd65e984d89c8565508a7a96ea0fce179 git describe: v4.4.109-38-g99abd6cdd65e Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.4-oe/build/v4.4.109-38-...
No Regressions (compared to build v4.4.109) ------------------------------------------------------------------------
Boards, architectures and test suites: -------------------------------------
juno-r2 - arm64 * boot - pass: 20, * kselftest - pass: 32, skip: 29 * libhugetlbfs - pass: 90, skip: 1 * ltp-cap_bounds-tests - pass: 2, * ltp-containers-tests - pass: 28, skip: 36 * ltp-fcntl-locktests-tests - pass: 2, * ltp-filecaps-tests - pass: 2, * ltp-fs-tests - pass: 60, * ltp-fs_bind-tests - pass: 2, * ltp-fs_perms_simple-tests - pass: 19, * ltp-fsx-tests - pass: 2, * ltp-hugetlb-tests - pass: 22, * ltp-io-tests - pass: 3, * ltp-ipc-tests - pass: 9, * ltp-math-tests - pass: 11, * ltp-nptl-tests - pass: 2, * ltp-pty-tests - pass: 4, * ltp-sched-tests - pass: 14, * ltp-securebits-tests - pass: 4, * ltp-syscalls-tests - pass: 984, skip: 124 * ltp-timers-tests - pass: 12,
x15 - arm * boot - pass: 20, * kselftest - pass: 31, skip: 29 * libhugetlbfs - pass: 87, skip: 1 * ltp-cap_bounds-tests - pass: 2, * ltp-containers-tests - pass: 64, * ltp-fcntl-locktests-tests - pass: 2, * ltp-filecaps-tests - pass: 2, * ltp-fs-tests - pass: 60, * ltp-fs_bind-tests - pass: 2, * ltp-fs_perms_simple-tests - pass: 19, * ltp-fsx-tests - pass: 2, * ltp-hugetlb-tests - pass: 20, skip: 2 * ltp-io-tests - pass: 3, * ltp-ipc-tests - pass: 9, * ltp-math-tests - pass: 11, * ltp-nptl-tests - pass: 2, * ltp-pty-tests - pass: 4, * ltp-sched-tests - pass: 13, skip: 1 * ltp-securebits-tests - pass: 4, * ltp-syscalls-tests - fail: 1, pass: 1034, skip: 67 * ltp-timers-tests - pass: 12,
x86_64 * boot - pass: 20, * kselftest - pass: 44, skip: 32 * libhugetlbfs - pass: 90, skip: 1 * ltp-cap_bounds-tests - pass: 2, * ltp-containers-tests - pass: 64, * ltp-fcntl-locktests-tests - pass: 2, * ltp-filecaps-tests - pass: 2, * ltp-fs-tests - pass: 61, skip: 1 * ltp-fs_bind-tests - pass: 2, * ltp-fs_perms_simple-tests - pass: 19, * ltp-fsx-tests - pass: 2, * ltp-hugetlb-tests - pass: 22, * ltp-io-tests - pass: 3, * ltp-ipc-tests - pass: 9, * ltp-math-tests - pass: 11, * ltp-nptl-tests - pass: 2, * ltp-pty-tests - pass: 4, * ltp-sched-tests - pass: 9, skip: 1 * ltp-securebits-tests - pass: 4, * ltp-syscalls-tests - pass: 1013, skip: 117 * ltp-timers-tests - pass: 12,
Hikey board test results,
Summary ------------------------------------------------------------------------
kernel: 4.4.110-rc1 git repo: https://git.linaro.org/lkft/arm64-stable-rc.git git tag: 4.4.110-rc1-hikey-20180103-95 git commit: 0769c4b4aafd63e5d73b6d67f6fe93abcff67cdc git describe: 4.4.110-rc1-hikey-20180103-95 Test details: https://qa-reports.linaro.org/lkft/linaro-hikey-stable-rc-4.4-oe/build/4.4.1...
No regressions (compared to build 4.4.110-rc1-hikey-20180103-94)
Boards, architectures and test suites: -------------------------------------
hi6220-hikey - arm64 * boot - pass: 20, * kselftest - pass: 30, skip: 31 * libhugetlbfs - pass: 90, skip: 1 * ltp-cap_bounds-tests - pass: 2, * ltp-containers-tests - pass: 28, skip: 36 * ltp-fcntl-locktests-tests - pass: 2, * ltp-filecaps-tests - pass: 2, * ltp-fs-tests - pass: 60, * ltp-fs_bind-tests - pass: 2, * ltp-fs_perms_simple-tests - pass: 19, * ltp-fsx-tests - pass: 2, * ltp-hugetlb-tests - pass: 21, skip: 1 * ltp-io-tests - pass: 3, * ltp-ipc-tests - pass: 9, * ltp-math-tests - pass: 11, * ltp-nptl-tests - pass: 2, * ltp-pty-tests - pass: 4, * ltp-sched-tests - pass: 14, * ltp-securebits-tests - pass: 4, * ltp-syscalls-tests - pass: 980, skip: 124 * ltp-timers-tests - pass: 12,
Documentation - https://collaborate.linaro.org/display/LKFT/Email+Reports Tested-by: Naresh Kamboju naresh.kamboju@linaro.org
stable-rc/linux-4.4.y boot: 100 boots: 4 failed, 93 passed with 1 offline, 2 conflicts (v4.4.109-38-g99abd6cdd65e)
Full Boot Summary: https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.4.y/kernel/v4.4.1... Full Build Summary: https://kernelci.org/build/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g...
Tree: stable-rc Branch: linux-4.4.y Git Describe: v4.4.109-38-g99abd6cdd65e Git Commit: 99abd6cdd65e984d89c8565508a7a96ea0fce179 Git URL: http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Tested: 53 unique boards, 19 SoC families, 16 builds out of 178
Boot Regressions Detected:
arm:
exynos_defconfig: exynos5422-odroidxu3: lab-collabora: failing since 58 days (last pass: v4.4.95-21-g32458fcb7bd6 - first fail: v4.4.96-41-g336421367b9c)
multi_v7_defconfig: armada-xp-linksys-mamba: lab-free-electrons: new failure (last pass: v4.4.109-36-g8b381424010c) tegra124-nyan-big: lab-collabora: failing since 1 day (last pass: v4.4.109 - first fail: v4.4.109-36-g8b381424010c)
tegra_defconfig: tegra124-nyan-big: lab-collabora: failing since 1 day (last pass: v4.4.108-65-g57856049c0f8 - first fail: v4.4.109)
Boot Failures Detected:
arm:
multi_v7_defconfig armada-xp-linksys-mamba: 1 failed lab tegra124-nyan-big: 1 failed lab
exynos_defconfig exynos5422-odroidxu3_rootfs:nfs: 1 failed lab
tegra_defconfig tegra124-nyan-big: 1 failed lab
Offline Platforms:
arm:
davinci_all_defconfig: dm365evm,legacy: 1 offline lab
Conflicting Boot Failures Detected: (These likely are not failures as other labs are reporting PASS. Needs review.)
arm:
multi_v7_defconfig: exynos5422-odroidxu3: lab-baylibre-seattle: PASS lab-collabora: FAIL
exynos_defconfig: exynos5422-odroidxu3: lab-baylibre-seattle: PASS lab-collabora: FAIL
--- For more info write to info@kernelci.org
kernelci.org bot bot@kernelci.org writes:
stable-rc/linux-4.4.y boot: 100 boots: 4 failed, 93 passed with 1 offline, 2 conflicts (v4.4.109-38-g99abd6cdd65e)
Full Boot Summary: https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.4.y/kernel/v4.4.1... Full Build Summary: https://kernelci.org/build/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g...
Tree: stable-rc Branch: linux-4.4.y Git Describe: v4.4.109-38-g99abd6cdd65e Git Commit: 99abd6cdd65e984d89c8565508a7a96ea0fce179 Git URL: http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Tested: 53 unique boards, 19 SoC families, 16 builds out of 178
TL;DR; All is well.
Boot Regressions Detected:
arm:
exynos_defconfig: exynos5422-odroidxu3: lab-collabora: failing since 58 days (last pass: v4.4.95-21-g32458fcb7bd6 - first fail: v4.4.96-41-g336421367b9c)
Long standing issue in lab-collabora (passing in other labs) Guillaume?
multi_v7_defconfig: armada-xp-linksys-mamba: lab-free-electrons: new failure (last pass: v4.4.109-36-g8b381424010c)
Not a kerel issue, bootROM fails to start bootloader. I pinged lab owners (Free Electrons)
tegra124-nyan-big: lab-collabora: failing since 1 day (last pass: v4.4.109 - first fail: v4.4.109-36-g8b381424010c) tegra_defconfig: tegra124-nyan-big: lab-collabora: failing since 1 day (last pass: v4.4.108-65-g57856049c0f8 - first fail: v4.4.109)
This one is booting fine, but the command to power-off the board is timing out, resulting in a failure report.
Kevin
On 05/01/18 00:06, Kevin Hilman wrote:
kernelci.org bot bot@kernelci.org writes:
stable-rc/linux-4.4.y boot: 100 boots: 4 failed, 93 passed with 1 offline, 2 conflicts (v4.4.109-38-g99abd6cdd65e)
Full Boot Summary: https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.4.y/kernel/v4.4.1... Full Build Summary: https://kernelci.org/build/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g...
Tree: stable-rc Branch: linux-4.4.y Git Describe: v4.4.109-38-g99abd6cdd65e Git Commit: 99abd6cdd65e984d89c8565508a7a96ea0fce179 Git URL: http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Tested: 53 unique boards, 19 SoC families, 16 builds out of 178
TL;DR; All is well.
Boot Regressions Detected:
arm:
exynos_defconfig: exynos5422-odroidxu3: lab-collabora: failing since 58 days (last pass: v4.4.95-21-g32458fcb7bd6 - first fail: v4.4.96-41-g336421367b9c)
Long standing issue in lab-collabora (passing in other labs) Guillaume?
This should be fixed now, with a tweak to the device config to enable relocating the ramdisk and dtb:
https://review.linaro.org/#/c/23238/
multi_v7_defconfig: armada-xp-linksys-mamba: lab-free-electrons: new failure (last pass: v4.4.109-36-g8b381424010c)
Not a kerel issue, bootROM fails to start bootloader. I pinged lab owners (Free Electrons)
tegra124-nyan-big: lab-collabora: failing since 1 day (last pass: v4.4.109 - first fail: v4.4.109-36-g8b381424010c) tegra_defconfig: tegra124-nyan-big: lab-collabora: failing since 1 day (last pass: v4.4.108-65-g57856049c0f8 - first fail: v4.4.109)
This one is booting fine, but the command to power-off the board is timing out, resulting in a failure report.
Indeed, this was due to a crash of the lavapdu daemon - it's back on track now.
(On a side note, the tegra124-nyan-big is still failing to boot in mainline due to a genuine kernel driver issue.)
Guillaume
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Thank you, Pavel
On Wed, Jan 3, 2018 at 3:11 PM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
thanks,
greg k-h
Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 4.4.110-rc1
Kees Cook keescook@chromium.org KPTI: Report when enabled
Kees Cook keescook@chromium.org KPTI: Rename to PAGE_TABLE_ISOLATION
Borislav Petkov bp@suse.de x86/kaiser: Move feature detection up
Jiri Kosina jkosina@suse.cz kaiser: disabled on Xen PV
Borislav Petkov bp@suse.de x86/kaiser: Reenable PARAVIRT
Thomas Gleixner tglx@linutronix.de x86/paravirt: Dont patch flush_tlb_single
Hugh Dickins hughd@google.com kaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Hugh Dickins hughd@google.com kaiser: asm/tlbflush.h handle noPGE at lower level
Hugh Dickins hughd@google.com kaiser: drop is_atomic arg to kaiser_pagetable_walk()
Hugh Dickins hughd@google.com kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Borislav Petkov bp@suse.de x86/kaiser: Check boottime cmdline params
Borislav Petkov bp@suse.de x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
Hugh Dickins hughd@google.com kaiser: add "nokaiser" boot option, using ALTERNATIVE
Hugh Dickins hughd@google.com kaiser: fix unlikely error in alloc_ldt_struct()
Hugh Dickins hughd@google.com kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
Hugh Dickins hughd@google.com kaiser: paranoid_entry pass cr3 need to paranoid_exit
Hugh Dickins hughd@google.com kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
Hugh Dickins hughd@google.com kaiser: PCID 0 for kernel and 128 for user
Hugh Dickins hughd@google.com kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
Dave Hansen dave.hansen@linux.intel.com kaiser: enhanced by kernel and user PCIDs
Hugh Dickins hughd@google.com kaiser: vmstat show NR_KAISERTABLE as nr_overhead
Hugh Dickins hughd@google.com kaiser: delete KAISER_REAL_SWITCH option
Hugh Dickins hughd@google.com kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
Hugh Dickins hughd@google.com kaiser: cleanups while trying for gold link
Hugh Dickins hughd@google.com kaiser: kaiser_remove_mapping() move along the pgd
Hugh Dickins hughd@google.com kaiser: tidied up kaiser_add/remove_mapping slightly
Hugh Dickins hughd@google.com kaiser: tidied up asm/kaiser.h somewhat
Hugh Dickins hughd@google.com kaiser: ENOMEM if kaiser_pagetable_walk() NULL
Hugh Dickins hughd@google.com kaiser: fix perf crashes
Hugh Dickins hughd@google.com kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
Hugh Dickins hughd@google.com kaiser: KAISER depends on SMP
Hugh Dickins hughd@google.com kaiser: fix build and FIXME in alloc_ldt_struct()
Hugh Dickins hughd@google.com kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
Hugh Dickins hughd@google.com kaiser: do not set _PAGE_NX on pgd_none
Dave Hansen dave.hansen@linux.intel.com kaiser: merged update
Richard Fellner richard.fellner@student.tugraz.at KAISER: Kernel Address Isolation
Tom Lendacky thomas.lendacky@amd.com x86/boot: Add early cmdline parsing for options with arguments
Diffstat:
Documentation/kernel-parameters.txt | 8 + Makefile | 4 +- arch/x86/boot/compressed/misc.h | 1 + arch/x86/entry/entry_64.S | 164 ++++++++-- arch/x86/entry/entry_64_compat.S | 7 + arch/x86/include/asm/cmdline.h | 2 + arch/x86/include/asm/cpufeature.h | 4 + arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/hw_irq.h | 2 +- arch/x86/include/asm/kaiser.h | 141 +++++++++ arch/x86/include/asm/pgtable.h | 28 +- arch/x86/include/asm/pgtable_64.h | 25 +- arch/x86/include/asm/pgtable_types.h | 29 +- arch/x86/include/asm/processor.h | 2 +- arch/x86/include/asm/tlbflush.h | 74 ++++- arch/x86/include/uapi/asm/processor-flags.h | 3 +- arch/x86/kernel/cpu/common.c | 28 +- arch/x86/kernel/cpu/perf_event_intel_ds.c | 57 +++- arch/x86/kernel/espfix_64.c | 10 + arch/x86/kernel/head_64.S | 35 ++- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/ldt.c | 25 +- arch/x86/kernel/paravirt_patch_64.c | 2 - arch/x86/kernel/process.c | 2 +- arch/x86/kernel/setup.c | 7 + arch/x86/kernel/tracepoint.c | 2 + arch/x86/kvm/x86.c | 3 +- arch/x86/lib/cmdline.c | 105 +++++++ arch/x86/mm/Makefile | 1 + arch/x86/mm/init.c | 2 +- arch/x86/mm/init_64.c | 10 + arch/x86/mm/kaiser.c | 455 ++++++++++++++++++++++++++++ arch/x86/mm/pageattr.c | 63 +++- arch/x86/mm/pgtable.c | 16 +- arch/x86/mm/tlb.c | 39 ++- include/asm-generic/vmlinux.lds.h | 7 + include/linux/kaiser.h | 52 ++++ include/linux/mmzone.h | 3 +- include/linux/percpu-defs.h | 32 +- init/main.c | 2 + kernel/fork.c | 6 + mm/vmstat.c | 1 + security/Kconfig | 10 + 43 files changed, 1375 insertions(+), 98 deletions(-)
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
Adding Jiri and Hugh and Dave here to see if they have seen this before...
thanks,
greg k-h
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
I thought the above says that it boots fine ?
Guenter
On Thu, Jan 04, 2018 at 09:01:06AM -0800, Guenter Roeck wrote:
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
Sorry, it's been a long few days, I meant "odd that the 4.9 -rc works and the 4.4 one fails".
{sigh}
I think I need to ignore email for a while...
greg k-h
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
s/4.9.75/4.4.110/ I suppose.
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Willy
On Thu, Jan 04, 2018 at 06:03:15PM +0100, Willy Tarreau wrote:
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
s/4.9.75/4.4.110/ I suppose.
Yes, mistake on my side.
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Odd thing is, the 4.9 series started from the 4.4 code for most of the patches, so I would expect that one to fail...
greg k-h
On Thu, Jan 04, 2018 at 06:11:02PM +0100, Greg Kroah-Hartman wrote:
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Odd thing is, the 4.9 series started from the 4.4 code for most of the patches, so I would expect that one to fail...
I see. Then maybe a missing patch somewhere in 4.4 compared to 4.9 :-/ I have no idea what to look for however.
Willy
On Thu, Jan 04, 2018 at 06:11:02PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:03:15PM +0100, Willy Tarreau wrote:
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
s/4.9.75/4.4.110/ I suppose.
Yes, mistake on my side.
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Odd thing is, the 4.9 series started from the 4.4 code for most of the patches, so I would expect that one to fail...
Also, the 4.4 patches were supposed to have been better tested, I need to go dig and see what I messed up here...
greg k-h
On Thu, Jan 04, 2018 at 06:14:15PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:11:02PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:03:15PM +0100, Willy Tarreau wrote:
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
s/4.9.75/4.4.110/ I suppose.
Yes, mistake on my side.
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Odd thing is, the 4.9 series started from the 4.4 code for most of the patches, so I would expect that one to fail...
Also, the 4.4 patches were supposed to have been better tested, I need to go dig and see what I messed up here...
Nope, it matches up with what is in SLES12 exactly, I must be missing something else here as a prerequisite...
On Thu, Jan 04, 2018 at 06:16:04PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:14:15PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:11:02PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:03:15PM +0100, Willy Tarreau wrote:
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.940142] PGD 0 [ 5.942400] Oops: 0002 [#1] SMP [ 5.946023] Modules linked in: [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted 4.4.110-rc1_pt_linux-4.4.110rc1 #1 [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: ffff881ff2f24000 [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60
I tried to bisect the problem, but when I try to boot only with: "KAISER: Kernel Address Isolation" machine hangs during boot and reboots without any panic message.
4.4.109 boots fine 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
s/4.9.75/4.4.110/ I suppose.
Yes, mistake on my side.
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Odd thing is, the 4.9 series started from the 4.4 code for most of the patches, so I would expect that one to fail...
Also, the 4.4 patches were supposed to have been better tested, I need to go dig and see what I messed up here...
Nope, it matches up with what is in SLES12 exactly, I must be missing something else here as a prerequisite...
FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
Guenter
On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
On Thu, Jan 04, 2018 at 06:16:04PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:14:15PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:11:02PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 06:03:15PM +0100, Willy Tarreau wrote:
On Thu, Jan 04, 2018 at 05:53:06PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 11:38:25AM -0500, Pavel Tatashin wrote: > I am getting the following panic when trying to boot 4.4.110rc1 on > Intel(R) Xeon(R) CPU E5-2630: > > [ 5.923489] BUG: unable to handle kernel NULL pointer dereference > at 000000000000000d > [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50 > [ 5.940142] PGD 0 > [ 5.942400] Oops: 0002 [#1] SMP > [ 5.946023] Modules linked in: > [ 5.949448] CPU: 5 PID: 8 Comm: rcu_sched Not tainted > 4.4.110-rc1_pt_linux-4.4.110rc1 #1 > [ 5.958484] Hardware name: Oracle Corporation ORACLE SERVER > X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 > [ 5.969552] task: ffff881ff2f1ab00 ti: ffff881ff2f24000 task.ti: > ffff881ff2f24000 > [ 5.977905] RIP: 0010:[<ffffffff810e70d2>] [<ffffffff810e70d2>] > dyntick_save_progress_counter+0x12/0x50 > [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 > [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 > [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 > [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 > [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 > [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 > [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) > knlGS:0000000000000000 > [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 > [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 6.073603] Stack: > [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 > ffff881ff2f27e60 > [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 > ffffffff81b127a0 > [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 > ffff881ff2f27eb8 > [ 6.100768] Call Trace: > [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150 > [ 6.109527] [<ffffffff810e70c0>] ? rcu_start_gp+0x70/0x70 > [ 6.115654] [<ffffffff810ea118>] rcu_gp_kthread+0x468/0x9b0 > [ 6.121976] [<ffffffff810c9190>] ? prepare_to_wait_event+0xf0/0xf0 > [ 6.128973] [<ffffffff810e9cb0>] ? rcu_process_callbacks+0x5f0/0x5f0 > [ 6.136167] [<ffffffff810a4a25>] kthread+0xe5/0x100 > [ 6.141710] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 > [ 6.147840] [<ffffffff81714e8f>] ret_from_fork+0x3f/0x70 > [ 6.153868] [<ffffffff810a4940>] ? kthread_park+0x60/0x60 > > I tried to bisect the problem, but when I try to boot only with: > "KAISER: Kernel Address Isolation" machine hangs during boot and > reboots without any panic message. > > 4.4.109 boots fine > 4.9.75rc1 also boots fine.
Hm, so I'm guessing 4.15-rc6 also works?
Odd that 4.9.75-rc1 fails.
s/4.9.75/4.4.110/ I suppose.
Yes, mistake on my side.
Can't this be because more patches are required in 4.4 to support this patch set ? Or maybe a manual fix for a conflict that went wrong ? Just trying to guess.
Odd thing is, the 4.9 series started from the 4.4 code for most of the patches, so I would expect that one to fail...
Also, the 4.4 patches were supposed to have been better tested, I need to go dig and see what I messed up here...
Nope, it matches up with what is in SLES12 exactly, I must be missing something else here as a prerequisite...
FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
That's good to know, hopefully 4.4.110-final also still works for you :)
On Fri, Jan 05, 2018 at 04:00:55PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
That's good to know, hopefully 4.4.110-final also still works for you :)
It seems to be working. One patch to add for v4.4.111:
063fb3e56f6d ("x86/kasan: Write protect kasan zero shadow")
It is needed to be able to run KASAN enabled images in KVM.
Guenter
On Fri, Jan 05, 2018 at 10:12:38AM -0800, Guenter Roeck wrote:
On Fri, Jan 05, 2018 at 04:00:55PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
That's good to know, hopefully 4.4.110-final also still works for you :)
It seems to be working. One patch to add for v4.4.111:
063fb3e56f6d ("x86/kasan: Write protect kasan zero shadow")
It is needed to be able to run KASAN enabled images in KVM.
Ugh, thanks for that, it also looks like SLES also is missing that one too.
thanks,
greg k-h
On Thu, Jan 4, 2018 at 8:38 AM, Pavel Tatashin soleen@gmail.com wrote:
I am getting the following panic when trying to boot 4.4.110rc1 on Intel(R) Xeon(R) CPU E5-2630:
[ 5.923489] BUG: unable to handle kernel NULL pointer dereference at 000000000000000d [ 5.932259] IP: [<ffffffff810e70d2>] dyntick_save_progress_counter+0x12/0x50
Hmm. You don't have the "Code:" line in this oops anywhere, do you?
[ 5.977905] RIP: dyntick_save_progress_counter+0x12/0x50 [ 5.988505] RSP: 0000:ffff881ff2f27dc0 EFLAGS: 00010046 [ 5.994434] RAX: 0000000000000001 RBX: ffffffff81b02140 RCX: ffff883fec768000 [ 6.002403] RDX: 0000000000000000 RSI: ffff881ff2f27e5f RDI: ffff88407e958140 [ 6.010368] RBP: ffff881ff2f27dc0 R08: ffff881ff2f27e78 R09: 000000016110f359 [ 6.018333] R10: 0000000000000b10 R11: 0000000000000000 R12: ffffffff81b02140 [ 6.026297] R13: 00000000ffffffdf R14: 0000000000000021 R15: 0000000200000000 [ 6.034262] FS: 0000000000000000(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000 [ 6.043293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.049707] CR2: 000000000000000d CR3: 0000000001aa6000 CR4: 0000000000360670 [ 6.057672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.073603] Stack: [ 6.075847] ffff881ff2f27e18 ffffffff810e8fac 0000000000000202 ffff881ff2f27e60 [ 6.084158] ffff881ff2f27e5f ffffffff810e70c0 ffffffff81b02140 ffffffff81b127a0 [ 6.092465] 0000000000000001 0000000000000000 0000000000000003 ffff881ff2f27eb8 [ 6.100768] Call Trace: [ 6.103501] [<ffffffff810e8fac>] force_qs_rnp+0xdc/0x150
The oops looks like it *might* be this:
lock xadd %edx,0xc(%rax)
which is from the
int snap = atomic_add_return(0, &rdtp->dynticks);
in rcu_dynticks_snap() because %rax is 1 and that would give you the invalid page fault and the right faulting address.
But that would be complete rcu data structure corruption (that rdtp pointer comes from
per_cpu_ptr(rsp->rda, cpu)
in force_qs_rnp(), afaik.
The PTI patches obviously change percpu stuff, but this looks like an odd place for that to manifest.
Linus
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
For v4.4.109-38-g99abd6c:
Build results: total: 145 pass: 145 fail: 0 Qemu test results: total: 118 pass: 118 fail: 0
Details are available at http://kerneltests.org/builders.
Guenter
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
Confused,
Thomas
On Thu, Jan 04, 2018 at 08:38:23PM +0100, Thomas Voegtle wrote:
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
Does 4.15-rc6 also work ok?
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
Any chance we can see the panic?
There's another error report of this same type of thing on this thread, did you see that?
thanks,
greg k-h
On Thu, 4 Jan 2018, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 08:38:23PM +0100, Thomas Voegtle wrote:
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
Does 4.15-rc6 also work ok?
Yes. Slightly different kernel config, but it boots.
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
Any chance we can see the panic?
Attached a screenshot. Is that useful? Are there some debug options I can add?
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
Linus
On Jan 4, 2018, at 12:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
That's almost certainly it.
I'll try to find some time today or tomorrow to add a proper selftest.
Linus
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:
On Jan 4, 2018, at 12:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
That's almost certainly it.
I'm hopeless on the FIXMAP arithmetic, but I'm pretty sure that ffffffffff5ff000 is either VSYSCALL page or PVCLOCK page (I think it was VVAR page when init segfaulted on it in my 3.2).
I'll forward Borislav's suggested 4.4 VSYSCALL patch from the kaiser backports ml to Thomas, to see if that sorts his crash (forwarding in the hope that gmail doesn't mess up the patch).
Seems odd that 4.4 should be broken but 4.9 not broken here, I'd expect them to be equally known broken with respect to VSYSCALL; but perhaps it's a matter of userspace trying different fallbacks according to what kernel supports, and only hitting this on 4.4.
Hugh
I'll try to find some time today Thomnor tomorrow to add a proper selftest.
Linus
On Jan 4, 2018, at 12:57 PM, Hugh Dickins hughd@google.com wrote:
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:
On Jan 4, 2018, at 12:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
That's almost certainly it.
I'm hopeless on the FIXMAP arithmetic, but I'm pretty sure that ffffffffff5ff000 is either VSYSCALL page or PVCLOCK page (I think it was VVAR page when init segfaulted on it in my 3.2).
Nah, that's one page below VSYSCALL. Vvar is 0x7fff...
I don't have the actual screenshot, I think.
I'll forward Borislav's suggested 4.4 VSYSCALL patch from the kaiser backports ml to Thomas, to see if that sorts his crash (forwarding in the hope that gmail doesn't mess up the patch).
Seems odd that 4.4 should be broken but 4.9 not broken here, I'd expect them to be equally known broken with respect to VSYSCALL; but perhaps it's a matter of userspace trying different fallbacks according to what kernel supports, and only hitting this on 4.4.
I don't think any current userspace is that dumb. But Go was still using vsyscall fairly recently.
I may be able to look for real tonight.
Hugh
I'll try to find some time today Thomnor tomorrow to add a proper selftest.
Linus
I tried cherry picking 435086b36f62 x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
on top of 4.4.110-rc1, (needed to resolve a small 5level table to 4level page table conflict). Unfortunately, this does not solve the panic/hanging problem I reported. For some reason I do not see the panic message anymore. Machine hangs here:
[ 5.023052] zswap: loaded using pool lzo/zbud [ 5.023063] page_owner is disabled [ 5.026492] Key type trusted registered [ 5.029325] Key type encrypted registered [ 5.029330] ima: No TPM chip found, activating TPM-bypass! [ 5.029365] evm: HMAC attrs: 0x1 [ 5.034696] rtc_cmos 00:00: setting system clock to 2018-01-04 21:20:34 UTC (1515100834) [ 5.216862] Freeing unused kernel memory: 1856K <hang>
And reboots after about half a minute.
Thank you, Pavel
On Thu, Jan 4, 2018 at 1:23 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
I tried cherry picking 435086b36f62 x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
on top of 4.4.110-rc1, (needed to resolve a small 5level table to 4level page table conflict). Unfortunately, this does not solve the panic/hanging problem I reported. For some reason I do not see the panic message anymore. Machine hangs here:
[ 5.023052] zswap: loaded using pool lzo/zbud [ 5.023063] page_owner is disabled [ 5.026492] Key type trusted registered [ 5.029325] Key type encrypted registered [ 5.029330] ima: No TPM chip found, activating TPM-bypass! [ 5.029365] evm: HMAC attrs: 0x1 [ 5.034696] rtc_cmos 00:00: setting system clock to 2018-01-04 21:20:34 UTC (1515100834) [ 5.216862] Freeing unused kernel memory: 1856K
<hang>
And reboots after about half a minute.
Thanks for trying, but yes, I wouldn't expect a straight cherry-pick of that to work in the context of 4.4.110: it needs to be cherry-picked "in principle". Which Borislav has done, and I'll forward you his (not yet reviewed) patch too, but frankly I've much less hope that it will help your crash than Thomas's.
So please revert that cherry-pick; and if Borislav's patch doesn't help, if you can send us a "Code:" line from the crash, that may still give us more to go on.
As Linus remarked earlier, "The PTI patches obviously change percpu stuff, but this looks like an odd place for that to manifest". Exactly: segfault and panic when starting init is a "normal" symptom when we get something wrong with Kaiser/PTI, but a kthread crashing in dyntick_save_progress_counter is something new to me.
Hugh
[ 6.159992] Code: 89 83 78 06 01 00 b8 01 00 00 00 5b 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2 48 8b 87 c8 00 00 00 48 89 e5 <f0> 0f c1 50 0c 89 97 d0 00 00 00 83 e2 01 b8 01 00 00 00 74 1d
Also, attached is the full console output.
Thank you, Pavel
On Thu, Jan 4, 2018 at 4:37 PM, Hugh Dickins hughd@google.com wrote:
On Thu, Jan 4, 2018 at 1:23 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
I tried cherry picking 435086b36f62 x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
on top of 4.4.110-rc1, (needed to resolve a small 5level table to 4level page table conflict). Unfortunately, this does not solve the panic/hanging problem I reported. For some reason I do not see the panic message anymore. Machine hangs here:
[ 5.023052] zswap: loaded using pool lzo/zbud [ 5.023063] page_owner is disabled [ 5.026492] Key type trusted registered [ 5.029325] Key type encrypted registered [ 5.029330] ima: No TPM chip found, activating TPM-bypass! [ 5.029365] evm: HMAC attrs: 0x1 [ 5.034696] rtc_cmos 00:00: setting system clock to 2018-01-04 21:20:34 UTC (1515100834) [ 5.216862] Freeing unused kernel memory: 1856K
<hang>
And reboots after about half a minute.
Thanks for trying, but yes, I wouldn't expect a straight cherry-pick of that to work in the context of 4.4.110: it needs to be cherry-picked "in principle". Which Borislav has done, and I'll forward you his (not yet reviewed) patch too, but frankly I've much less hope that it will help your crash than Thomas's.
So please revert that cherry-pick; and if Borislav's patch doesn't help, if you can send us a "Code:" line from the crash, that may still give us more to go on.
As Linus remarked earlier, "The PTI patches obviously change percpu stuff, but this looks like an odd place for that to manifest". Exactly: segfault and panic when starting init is a "normal" symptom when we get something wrong with Kaiser/PTI, but a kthread crashing in dyntick_save_progress_counter is something new to me.
Hugh
On Thu, Jan 4, 2018 at 1:48 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
[ 6.159992] Code: 89 83 78 06 01 00 b8 01 00 00 00 5b 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2 48 8b 87 c8 00 00 00 48 89 e5 <f0> 0f c1 50 0c 89 97 d0 00 00 00 83 e2 01 b8 01 00 00 00 74 1d
Yeah, it's the "lock xadd" as suspected:
0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 5: 55 push %rbp 6: 31 d2 xor %edx,%edx 8: 48 8b 87 c8 00 00 00 mov 0xc8(%rdi),%rax f: 48 89 e5 mov %rsp,%rbp 12:* f0 0f c1 50 0c lock xadd %edx,0xc(%rax) <-- trapping instruction 17: 89 97 d0 00 00 00 mov %edx,0xd0(%rdi) 1d: 83 e2 01 and $0x1,%edx 20: b8 01 00 00 00 mov $0x1,%eax 25: 74 1d je 0x44
(that first "nop" is a 5-byte nop that is used for the function tracing placeholder).
And %rax contains garbage (the value "1", rather than a valid kernel pointer).
Sadly, I have no idea about how that garbage came about.
Linus
On Thu, Jan 04, 2018 at 04:48:48PM -0500, Pavel Tatashin wrote:
[ 6.159992] Code: 89 83 78 06 01 00 b8 01 00 00 00 5b 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2 48 8b 87 c8 00 00 00 48 89 e5 <f0> 0f c1 50 0c 89 97 d0 00 00 00 83 e2 01 b8 01 00 00 00 74 1d
Also, attached is the full console output.
Ick, like the others, I have no idea what happened here.
But, can you tgest 4.4.110 now? It has 4 more patches on top of what you were testing with here for 4.4.110-rc1, that hopefully should resolve this type of issue.
And if not, it would be good for us to know :)
thanks so much for testing,
greg k-h
Hi Greg,
Just tested on my machine: [ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.4.110_pt_linux-v4.4.110 (ptatashi@ca-ostest441) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Fri Jan 5 07:22:34 PST 2018 [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.4.110_pt_linux-v4.4.110 root=UUID=fe908085-0117-442b-a57c-ce651cc95b38 ro crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8 [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
<cut> [ 3.457106] hub 1-0:1.0: USB hub found
[ 3.461298] hub 1-0:1.0: 2 ports detected
[ 3.466173] ehci-pci 0000:00:1d.0: EHCI Host Controller
[ 3.472111] ehci-pci 0000:00:1d.0: new USB bus registered, assigned bus number 2 [ 3.480381] ehci-pci 0000:00:1d.0: debug port 2
[ 3.489571] ehci-pci 0000:00:1d.0: irq 18, io mem 0xc7101000
[ 3.501393] ehci-pci 0000:00:1d.0: USB 2.0 started, EHCI 1.00
[ 3.507855] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002 [ 3.515436] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 3.523500] usb usb2: Product: EHCI Host Controller
[ 3.528947] usb usb2: Manufacturer: Linux 4.4.110_pt_linux-v4.4.110 ehci_hcd [ 3.536816] usb usb2: SerialNumber: 0000:00:1d.0
[ 3.542107] hub 2-0:1.0: USB hub found
[ 3.546301] hub 2-0:1.0: 2 ports detected
[ 3.550942] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[ 3.557854] ohci-pci: OHCI PCI platform driver
[ 3.562844] uhci_hcd: USB Universal Host Controller Interface driver
[ 3.570032] usbcore: registered new interface driver usbserial
[ 3.576550] usbcore: registered new interface driver usbserial_generic [ 3.583844] usbserial: USB Serial support registered for generic
[ 3.590570] i8042: PNP: No PS/2 controller found. Probing ports directly. [ 3.995383] tsc: Refined TSC clocksource calibration: 2195.099 MHz
[ 4.002289] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fa41d170d9, max_idle_ns: 440795288527 ns
[ 4.046414] usb 2-1: new high-speed USB device number 2 using ehci-pci [ 4.174758] usb 2-1: New USB device found, idVendor=8087, idProduct=8002 [ 4.182245] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [ 4.190382] hub 2-1:1.0: USB hub found
[ 4.194609] hub 2-1:1.0: 8 ports detected
[ 4.637363] i8042: No controller found
[ 4.641646] mousedev: PS/2 mouse device common for all mice
[ 4.648117] rtc_cmos 00:00: RTC can wake from S4
[ 4.653447] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
[ 4.660272] rtc_cmos 00:00: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 4.669050] Intel P-state driver initializing.
[ 4.676630] EFI Variables Facility v0.08 2004-May-17 <hangs here> Reboots after about 30 seconds.
Boots fine with nopti option.
Thank you, Pavel
On Fri, Jan 05, 2018 at 10:32:49AM -0500, Pavel Tatashin wrote:
Hi Greg,
Just tested on my machine: [ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.4.110_pt_linux-v4.4.110 (ptatashi@ca-ostest441) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Fri Jan 5 07:22:34 PST 2018 [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.4.110_pt_linux-v4.4.110 root=UUID=fe908085-0117-442b-a57c-ce651cc95b38 ro crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8 [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
<cut> [ 3.457106] hub 1-0:1.0: USB hub found
[ 3.461298] hub 1-0:1.0: 2 ports detected
[ 3.466173] ehci-pci 0000:00:1d.0: EHCI Host Controller
[ 3.472111] ehci-pci 0000:00:1d.0: new USB bus registered, assigned bus number 2 [ 3.480381] ehci-pci 0000:00:1d.0: debug port 2
[ 3.489571] ehci-pci 0000:00:1d.0: irq 18, io mem 0xc7101000
[ 3.501393] ehci-pci 0000:00:1d.0: USB 2.0 started, EHCI 1.00
[ 3.507855] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002 [ 3.515436] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 3.523500] usb usb2: Product: EHCI Host Controller
[ 3.528947] usb usb2: Manufacturer: Linux 4.4.110_pt_linux-v4.4.110 ehci_hcd [ 3.536816] usb usb2: SerialNumber: 0000:00:1d.0
[ 3.542107] hub 2-0:1.0: USB hub found
[ 3.546301] hub 2-0:1.0: 2 ports detected
[ 3.550942] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[ 3.557854] ohci-pci: OHCI PCI platform driver
[ 3.562844] uhci_hcd: USB Universal Host Controller Interface driver
[ 3.570032] usbcore: registered new interface driver usbserial
[ 3.576550] usbcore: registered new interface driver usbserial_generic [ 3.583844] usbserial: USB Serial support registered for generic
[ 3.590570] i8042: PNP: No PS/2 controller found. Probing ports directly. [ 3.995383] tsc: Refined TSC clocksource calibration: 2195.099 MHz
[ 4.002289] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fa41d170d9, max_idle_ns: 440795288527 ns
[ 4.046414] usb 2-1: new high-speed USB device number 2 using ehci-pci [ 4.174758] usb 2-1: New USB device found, idVendor=8087, idProduct=8002 [ 4.182245] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [ 4.190382] hub 2-1:1.0: USB hub found
[ 4.194609] hub 2-1:1.0: 8 ports detected
[ 4.637363] i8042: No controller found
[ 4.641646] mousedev: PS/2 mouse device common for all mice
[ 4.648117] rtc_cmos 00:00: RTC can wake from S4
[ 4.653447] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
[ 4.660272] rtc_cmos 00:00: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 4.669050] Intel P-state driver initializing.
[ 4.676630] EFI Variables Facility v0.08 2004-May-17
<hangs here> Reboots after about 30 seconds.
Boots fine with nopti option.
Crap.
And 4.9.75 works for you just fine? Same with 4.15-rc6?
I'm wondering if this is some crazy gcc thing, given the ancient age of what you are using (gcc 4.8.5). I haven't used 4.x in many many years, is this what comes with RHEL6? What is the "base" distro you are building this on, and anything special about the hardware being used here?
Or is this a virtual machine? I've been seeing too many different crashes lately to keep them all straight, sorry...
thanks,
greg k-h
On Fri, Jan 05, 2018 at 04:51:32PM +0100, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 10:32:49AM -0500, Pavel Tatashin wrote:
(...)
Reboots after about 30 seconds.
Boots fine with nopti option.
Crap.
And 4.9.75 works for you just fine? Same with 4.15-rc6?
I'm wondering if this is some crazy gcc thing, given the ancient age of what you are using (gcc 4.8.5). I haven't used 4.x in many many years, is this what comes with RHEL6? What is the "base" distro you are building this on, and anything special about the hardware being used here?
I don't think so, I'm personally building with 4.7.4 and am not seeing this with 4.4.110.
Willy
On Fri, Jan 05, 2018 at 04:57:15PM +0100, Willy Tarreau wrote:
On Fri, Jan 05, 2018 at 04:51:32PM +0100, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 10:32:49AM -0500, Pavel Tatashin wrote:
(...)
Reboots after about 30 seconds.
Boots fine with nopti option.
Crap.
And 4.9.75 works for you just fine? Same with 4.15-rc6?
I'm wondering if this is some crazy gcc thing, given the ancient age of what you are using (gcc 4.8.5). I haven't used 4.x in many many years, is this what comes with RHEL6? What is the "base" distro you are building this on, and anything special about the hardware being used here?
I don't think so, I'm personally building with 4.7.4 and am not seeing this with 4.4.110.
Ok, looks like an efi issue...
greg k-h
Crap.
And 4.9.75 works for you just fine? Same with 4.15-rc6?
4.15-rc6 -> Rebooted twice no issues. 4.9.75 -> Rebooted twice no issues 4.4.110 -> hangs/reboots on every single reboot.
I'm wondering if this is some crazy gcc thing, given the ancient age of what you are using (gcc 4.8.5). I haven't used 4.x in many many years, is this what comes with RHEL6? What is the "base" distro you are building this on, and anything special about the hardware being used here?
Oracle Linux 7.3
[root@ca-ostest441 ~]# cat /etc/oracle-release Oracle Linux Server release 7.3 [root@ca-ostest441 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo)
Or is this a virtual machine? I've been seeing too many different crashes lately to keep them all straight, sorry...
This is a physical machine. No special devices attached: http://www.oracle.com/us/products/servers/x6-2datasheet-2900789.pdf
[root@ca-ostest441 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2394.586 BogoMIPS: 4390.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39
[root@ca-ostest441 ~]# free -h total used free shared buff/cache available Mem: 251G 2.7G 241G 9.1M 7.8G 247G Swap: 4.0G 0B 4.0G
thanks,
greg k-h
On Jan 5, 2018, at 7:32 AM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Hi Greg,
Just tested on my machine: [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 4.4.110_pt_linux-v4.4.110 (ptatashi@ca-ostest441) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Fri Jan 5 07:22:34 PST 2018 [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.4.110_pt_linux-v4.4.110 root=UUID=fe908085-0117-442b-a57c-ce651cc95b38 ro crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8 [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers' [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
<cut> [ 3.457106] hub 1-0:1.0: USB hub found [ 3.461298] hub 1-0:1.0: 2 ports detected [ 3.466173] ehci-pci 0000:00:1d.0: EHCI Host Controller [ 3.472111] ehci-pci 0000:00:1d.0: new USB bus registered, assigned bus number 2 [ 3.480381] ehci-pci 0000:00:1d.0: debug port 2 [ 3.489571] ehci-pci 0000:00:1d.0: irq 18, io mem 0xc7101000 [ 3.501393] ehci-pci 0000:00:1d.0: USB 2.0 started, EHCI 1.00 [ 3.507855] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002 [ 3.515436] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 3.523500] usb usb2: Product: EHCI Host Controller [ 3.528947] usb usb2: Manufacturer: Linux 4.4.110_pt_linux-v4.4.110 ehci_hcd [ 3.536816] usb usb2: SerialNumber: 0000:00:1d.0 [ 3.542107] hub 2-0:1.0: USB hub found [ 3.546301] hub 2-0:1.0: 2 ports detected [ 3.550942] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver [ 3.557854] ohci-pci: OHCI PCI platform driver [ 3.562844] uhci_hcd: USB Universal Host Controller Interface driver [ 3.570032] usbcore: registered new interface driver usbserial [ 3.576550] usbcore: registered new interface driver usbserial_generic [ 3.583844] usbserial: USB Serial support registered for generic [ 3.590570] i8042: PNP: No PS/2 controller found. Probing ports directly. [ 3.995383] tsc: Refined TSC clocksource calibration: 2195.099 MHz [ 4.002289] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fa41d170d9, max_idle_ns: 440795288527 ns [ 4.046414] usb 2-1: new high-speed USB device number 2 using ehci-pci [ 4.174758] usb 2-1: New USB device found, idVendor=8087, idProduct=8002 [ 4.182245] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [ 4.190382] hub 2-1:1.0: USB hub found [ 4.194609] hub 2-1:1.0: 8 ports detected [ 4.637363] i8042: No controller found [ 4.641646] mousedev: PS/2 mouse device common for all mice [ 4.648117] rtc_cmos 00:00: RTC can wake from S4 [ 4.653447] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0 [ 4.660272] rtc_cmos 00:00: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 4.669050] Intel P-state driver initializing. [ 4.676630] EFI Variables Facility v0.08 2004-May-17 <hangs here> Reboots after about 30 seconds.
This looks like the KVM RSM issue. When you manage to run a buggy configuration (KVM + OVMF with secure boot support in the host, PCID (PTI or otherwise) and SMP in the guest), the first EFI call after AP bringup dies.
The actual failure is nasty. When one CPU calls into EFI, all the other CPUs die --they enter SMM and they don't come back out correctly. I think the best the guest could do is to try to generate a useful printk if this happens.
Update your host.
Boots fine with nopti option.
Thank you, Pavel
Hi Andy,
This is bare metal, not VM, read my other email in this thread about the machine on which I am testing. Sometime hang happens a little later:
[ 5.088948] microcode: CPU36 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.096076] microcode: CPU37 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.103206] microcode: CPU38 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.110326] microcode: CPU39 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.117467] microcode: Microcode Update Driver: v2.01 tigran@aivazian.fsnet.co.uk, Peter Oruba [ 5.127476] registered taskstats version 1 [ 5.132058] Loading compiled-in X.509 certificates [ 5.138206] Loaded X.509 cert 'Build time autogenerated kernel key: 26871d9e2c53359981a91797284d4f630796d8cf' [ 5.149337] zswap: loaded using pool lzo/zbud [ 5.154215] page_owner is disabled [ 5.161468] Key type trusted registered [ 5.169226] Key type encrypted registered [ 5.173719] ima: No TPM chip found, activating TPM-bypass! [ 5.179918] evm: HMAC attrs: 0x1 [ 5.184958] rtc_cmos 00:00: setting system clock to 2018-01-05 15:40:45 UTC (1515166845) [ 5.196099] Freeing unused kernel memory: 1856K <hang / reboot here>
Thank you, Pavel
On Jan 5, 2018, at 9:14 AM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Hi Andy,
This is bare metal, not VM, read my other email in this thread about the machine on which I am testing. Sometime hang happens a little later:
[ 5.088948] microcode: CPU36 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.096076] microcode: CPU37 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.103206] microcode: CPU38 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.110326] microcode: CPU39 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.117467] microcode: Microcode Update Driver: v2.01 tigran@aivazian.fsnet.co.uk, Peter Oruba [ 5.127476] registered taskstats version 1 [ 5.132058] Loading compiled-in X.509 certificates [ 5.138206] Loaded X.509 cert 'Build time autogenerated kernel key: 26871d9e2c53359981a91797284d4f630796d8cf' [ 5.149337] zswap: loaded using pool lzo/zbud [ 5.154215] page_owner is disabled [ 5.161468] Key type trusted registered [ 5.169226] Key type encrypted registered [ 5.173719] ima: No TPM chip found, activating TPM-bypass! [ 5.179918] evm: HMAC attrs: 0x1 [ 5.184958] rtc_cmos 00:00: setting system clock to 2018-01-05 15:40:45 UTC (1515166845) [ 5.196099] Freeing unused kernel memory: 1856K <hang / reboot here>
Gah, too many emails.
Someone probably just needs to look at the EFI think code. Does it boot if you disable EFI support in the kernel (noefi boot option, I think, or maybe just compile it out.
Thank you, Pavel
Boots successfully with "noefi" kernel parameter :)
On Fri, Jan 5, 2018 at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:
On Jan 5, 2018, at 9:14 AM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Hi Andy,
This is bare metal, not VM, read my other email in this thread about the machine on which I am testing. Sometime hang happens a little later:
[ 5.088948] microcode: CPU36 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.096076] microcode: CPU37 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.103206] microcode: CPU38 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.110326] microcode: CPU39 sig=0x406f1, pf=0x1, revision=0xb00001d [ 5.117467] microcode: Microcode Update Driver: v2.01 tigran@aivazian.fsnet.co.uk, Peter Oruba [ 5.127476] registered taskstats version 1 [ 5.132058] Loading compiled-in X.509 certificates [ 5.138206] Loaded X.509 cert 'Build time autogenerated kernel key: 26871d9e2c53359981a91797284d4f630796d8cf' [ 5.149337] zswap: loaded using pool lzo/zbud [ 5.154215] page_owner is disabled [ 5.161468] Key type trusted registered [ 5.169226] Key type encrypted registered [ 5.173719] ima: No TPM chip found, activating TPM-bypass! [ 5.179918] evm: HMAC attrs: 0x1 [ 5.184958] rtc_cmos 00:00: setting system clock to 2018-01-05 15:40:45 UTC (1515166845) [ 5.196099] Freeing unused kernel memory: 1856K <hang / reboot here>
Gah, too many emails.
Someone probably just needs to look at the EFI think code. Does it boot if you disable EFI support in the kernel (noefi boot option, I think, or maybe just compile it out.
Thank you, Pavel
On Fri, Jan 5, 2018 at 9:52 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
Boots successfully with "noefi" kernel parameter :)
Thanks, that will help me narrow it down. I'll dig through more patches when I get home tonight...
I wish you luck. The 4.4 series is "KAISER", not "KPTI", and the relevant code is spread all over the place and is generally garbage. See, for example, the turd called kaiser_set_shadow_pgd(). I would not be terribly surprised if that particular turd is biting here.
An alternative theory is that something is screwy in the EFI code. I don't see anything directly wrong, but it's certainly a bit sketchy. The newer kernels carefully avoid using PCID 0 for real work to avoid corruption due to EFI and similar things. The "KAISER" code has no such mitigation. Fortunately, it seems to use PCID=0 for kernel and PCID=nonzero for user, so the obvious problem isn't present, but something could still be wrong.
Pavel, can you send your /proc/cpuinfo on a noefi boot? (Just the first CPU worth is fine.)
FWIW, I said before that I have very little desire to help debug "KAISER". I stand by that.
Pavel, can you send your /proc/cpuinfo on a noefi boot? (Just the first CPU worth is fine.)
With noefi option:
[root@ca-ostest441 ~]# more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz stepping : 1 microcode : 0xb00001d cpu MHz : 1971.406 cache size : 25600 KB physical id : 0 siblings : 20 core id : 0 cpu cores : 10 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_singl e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc bugs : bogomips : 4390.08 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
I hoped, this patch would fix the efi issue: https://lkml.org/lkml/2018/1/5/534
But, unfortunatly it does not. I got a partial panic message this time:
[ 4.737578] usb 1-1: new high-speed USB device number 2 using ehci-pci [ 4.846712] BUG: unable to handle kernel paging request at 0000000000017e10 [ 4.854509] IP: [<ffffffff810ce77e>] native_queued_spin_lock_slowpath+0xfe/0x170 [ 4.862780] PGD 0 [ 4.865034] Oops: 0002 [#1] SMP [ 4.868657] Modules linked in: [ 4.872075] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.110_pt_linux-v4.4.110 #3 [ 4.880526] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 4.891596] task: ffffffff81aab500 ti: ffffffff81a98000 task.ti: ffffffff81a98000 [ 4.899950] RIP: 0010:[<ffffffff810ce77e>] [<ffffffff810ce77e>] native_queued_spin_lock_slowpath+0xfe/0x170 [ 4.910936] RSP: 0000:ffff881fff803c88 EFLAGS: 00010002 [ 4.916865] RAX: 000000000000206b RBX: ffff88407e611900 RCX: ffff881fff817e00 [ 4.924831] RDX: 0000000000017e10 RSI: 0000000000040000 RDI: ffff88407e611a58 [ 4.932797] RBP: ffff881fff803c88 R08: 0000000000000101 R09: 0000000000000000 [ 4.940764] R10: 000000005c96d000 R11: ffff88005c96d0c0 R12: ffff881ff25e52c8 [ 4.948730] R13: ffff88407e6d1900 R14: ffff881fff8118c0 R15: ffff88407e6118c0 [ 4.956696] FS: 0000000000000000(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000 [ 4.965727] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.972140] CR2: 0000000000017e10 CR3: 0000000001aa2000 CR4: 00000000003606
On Fri, Jan 5, 2018 at 1:21 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Pavel, can you send your /proc/cpuinfo on a noefi boot? (Just the first CPU worth is fine.)
With noefi option:
[root@ca-ostest441 ~]# more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz stepping : 1 microcode : 0xb00001d cpu MHz : 1971.406 cache size : 25600 KB physical id : 0 siblings : 20 core id : 0 cpu cores : 10 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_singl e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc bugs : bogomips : 4390.08 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
Actually it helps, if before 4.4.110 never booted on my machine, not i was able to boot on a second try.
On Fri, Jan 5, 2018 at 2:14 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
I hoped, this patch would fix the efi issue: https://lkml.org/lkml/2018/1/5/534
But, unfortunatly it does not. I got a partial panic message this time:
[ 4.737578] usb 1-1: new high-speed USB device number 2 using ehci-pci [ 4.846712] BUG: unable to handle kernel paging request at 0000000000017e10 [ 4.854509] IP: [<ffffffff810ce77e>] native_queued_spin_lock_slowpath+0xfe/0x170 [ 4.862780] PGD 0 [ 4.865034] Oops: 0002 [#1] SMP [ 4.868657] Modules linked in: [ 4.872075] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.110_pt_linux-v4.4.110 #3 [ 4.880526] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 4.891596] task: ffffffff81aab500 ti: ffffffff81a98000 task.ti: ffffffff81a98000 [ 4.899950] RIP: 0010:[<ffffffff810ce77e>] [<ffffffff810ce77e>] native_queued_spin_lock_slowpath+0xfe/0x170 [ 4.910936] RSP: 0000:ffff881fff803c88 EFLAGS: 00010002 [ 4.916865] RAX: 000000000000206b RBX: ffff88407e611900 RCX: ffff881fff817e00 [ 4.924831] RDX: 0000000000017e10 RSI: 0000000000040000 RDI: ffff88407e611a58 [ 4.932797] RBP: ffff881fff803c88 R08: 0000000000000101 R09: 0000000000000000 [ 4.940764] R10: 000000005c96d000 R11: ffff88005c96d0c0 R12: ffff881ff25e52c8 [ 4.948730] R13: ffff88407e6d1900 R14: ffff881fff8118c0 R15: ffff88407e6118c0 [ 4.956696] FS: 0000000000000000(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000 [ 4.965727] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.972140] CR2: 0000000000017e10 CR3: 0000000001aa2000 CR4: 00000000003606
On Fri, Jan 5, 2018 at 1:21 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Pavel, can you send your /proc/cpuinfo on a noefi boot? (Just the first CPU worth is fine.)
With noefi option:
[root@ca-ostest441 ~]# more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz stepping : 1 microcode : 0xb00001d cpu MHz : 1971.406 cache size : 25600 KB physical id : 0 siblings : 20 core id : 0 cpu cores : 10 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_singl e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc bugs : bogomips : 4390.08 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
Actually it helps, if before 4.4.110 never booted on my machine, not i was able to boot on a second try.
Wait, what? This has never booted on 4.4.x before? Did 4.4.108 work? 109? Are you sure this hardware even works? :)
thanks,
greg k-h
The hardware works :) I meant that before the patch linked in https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But with that patch applied, I was able to boot it at least once, but it could be accidental. The hang/panic does not happen at the same time on every boot.
Pasha
On 01/05/2018 03:45 PM, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
Actually it helps, if before 4.4.110 never booted on my machine, not i was able to boot on a second try.
Wait, what? This has never booted on 4.4.x before? Did 4.4.108 work? 109? Are you sure this hardware even works? :)
thanks,
greg k-h
On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
The hardware works :) I meant that before the patch linked in https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But with that patch applied, I was able to boot it at least once, but it could be accidental. The hang/panic does not happen at the same time on every boot.
I get the feeling that it was accidental: it seems to me that you have a memory corruption problem, that gets shifted around by the different patches (or "noefi" or "nopti").
Because yesterday your boots were able to get way beyond the "EFI Variables Facility" message, and I can't imagine why the EFI issue would not have been equally debilitating on yesterday's 110-rc, if it were in play.
I did intend to ask you to send your System.map, for us to scan through: maybe some variable is marked __init and should not be, then the "Freeing unused kernel memory" frees it for random reuse.
But today you didn't get anywhere near the "Freeing unused kernel memory", so that can't be it - or do you sometimes get that far today?
You mention that the hang/panic does not happen at the same time on every boot: I think all I can ask is for you to keep supplying us with different examples (console messages) of where it occurs, in the hope that one of them will point us in the right direction.
And it even seems possible that this has nothing to do with the 4.4.110 changes - that 4.4.109 plus some other random patches would unleash similar corruption. Though on balance that does seem unlikely.
Hugh
Pasha
On 01/05/2018 03:45 PM, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
Actually it helps, if before 4.4.110 never booted on my machine, not i was able to boot on a second try.
Wait, what? This has never booted on 4.4.x before? Did 4.4.108 work? 109? Are you sure this hardware even works? :)
thanks,
greg k-h
Hi Hugh,
Thank you very much for your very thoughtful input.
I quiet positive this problem is PTI regression, because exactly the same problem I see with kernel 4.1 to which I back-ported all the necessary PTI patches from 4.4.110. I will provide this thread with more information as I collect it. I will also try to root cause the problem.
The bug has memory corruption behavior, but with both 4.1 and 4.4 kernels problem goes away when I boot with noefi parameter. So, EFI + PTI is the culprit for this memory corruption.
Thank you, Pavel
On 01/05/2018 06:15 PM, Hugh Dickins wrote:
On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
The hardware works :) I meant that before the patch linked in https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But with that patch applied, I was able to boot it at least once, but it could be accidental. The hang/panic does not happen at the same time on every boot.
I get the feeling that it was accidental: it seems to me that you have a memory corruption problem, that gets shifted around by the different patches (or "noefi" or "nopti").
Because yesterday your boots were able to get way beyond the "EFI Variables Facility" message, and I can't imagine why the EFI issue would not have been equally debilitating on yesterday's 110-rc, if it were in play.
I did intend to ask you to send your System.map, for us to scan through: maybe some variable is marked __init and should not be, then the "Freeing unused kernel memory" frees it for random reuse.
But today you didn't get anywhere near the "Freeing unused kernel memory", so that can't be it - or do you sometimes get that far today?
You mention that the hang/panic does not happen at the same time on every boot: I think all I can ask is for you to keep supplying us with different examples (console messages) of where it occurs, in the hope that one of them will point us in the right direction.
And it even seems possible that this has nothing to do with the 4.4.110 changes - that 4.4.109 plus some other random patches would unleash similar corruption. Though on balance that does seem unlikely.
Hugh
On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
The hardware works :) I meant that before the patch linked in https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But with that patch applied, I was able to boot it at least once, but it could be accidental. The hang/panic does not happen at the same time on every boot.
Any chance you can grab the latest SLES 12 kernel and run it with pti and efi enabled to see if that works properly for you or not? I trust SUSE's testing of their kernel, and odds are I'm just missing one of their many other patches they have in their tree for other issues that they have seen in the past.
If you want, I can just send you the full patch that they run on top of the latest 4.4 stable tree, so you don't have to dig it out of their git repo if you can't find the binary image.
thanks,
greg k-h
Hi Greg,
I cloned and built suse12, and it does not have issues with EFI + PTI (kaiser) on my machine.
BTW, i have also reproduced this problem on another machine with the same configuration, therefore, it is not specific only to one box. Also, as I mentioned earlier I am seeing the same issue with 4.1 + kaiser patches taken from 4.4.110.
Thank you, Pavel
On Sun, Jan 7, 2018 at 5:45 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
The hardware works :) I meant that before the patch linked in https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But with that patch applied, I was able to boot it at least once, but it could be accidental. The hang/panic does not happen at the same time on every boot.
Any chance you can grab the latest SLES 12 kernel and run it with pti and efi enabled to see if that works properly for you or not? I trust SUSE's testing of their kernel, and odds are I'm just missing one of their many other patches they have in their tree for other issues that they have seen in the past.
If you want, I can just send you the full patch that they run on top of the latest 4.4 stable tree, so you don't have to dig it out of their git repo if you can't find the binary image.
thanks,
greg k-h
Hi Greg,
I reverted suse12 back to: 13dae54cb229d078635f159dd8afe16ae683980b x86/kaiser: Move feature detection up (bsc#1068032).
And, still do not see the problem. So, whatever fixes the issue comes before kaiser.
Pavel
On Sun, Jan 7, 2018 at 9:17 AM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Hi Greg,
I cloned and built suse12, and it does not have issues with EFI + PTI (kaiser) on my machine.
BTW, i have also reproduced this problem on another machine with the same configuration, therefore, it is not specific only to one box. Also, as I mentioned earlier I am seeing the same issue with 4.1 + kaiser patches taken from 4.4.110.
Thank you, Pavel
On Sun, Jan 7, 2018 at 5:45 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
The hardware works :) I meant that before the patch linked in https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But with that patch applied, I was able to boot it at least once, but it could be accidental. The hang/panic does not happen at the same time on every boot.
Any chance you can grab the latest SLES 12 kernel and run it with pti and efi enabled to see if that works properly for you or not? I trust SUSE's testing of their kernel, and odds are I'm just missing one of their many other patches they have in their tree for other issues that they have seen in the past.
If you want, I can just send you the full patch that they run on top of the latest 4.4 stable tree, so you don't have to dig it out of their git repo if you can't find the binary image.
thanks,
greg k-h
On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
Hi Greg,
I reverted suse12 back to: 13dae54cb229d078635f159dd8afe16ae683980b x86/kaiser: Move feature detection up (bsc#1068032).
And, still do not see the problem. So, whatever fixes the issue comes before kaiser.
Ok, thanks for the hint.
As I can't duplicate this here at all, any specifics as to what hardware/procesor type this is?
I can punt and say just "use 4.9 on this hardware if you have it", right? :)
I'll try to dig through the sles kernel some more, but given it is 20000 patches, and I can't actually test the problem myself, it's not exactly easy going...
greg k-h
Hi Greg,
On Mon, Jan 8, 2018 at 2:46 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
Hi Greg,
I reverted suse12 back to: 13dae54cb229d078635f159dd8afe16ae683980b x86/kaiser: Move feature detection up (bsc#1068032).
And, still do not see the problem. So, whatever fixes the issue comes before kaiser.
Ok, thanks for the hint.
As I can't duplicate this here at all, any specifics as to what hardware/procesor type this is?
BIOS: Version 2.17.1249. Copyright (C) 2016 American Megatrends, Inc. BIOS Date: 08/30/2016 10:35:36 Ver: 38050100
ca-ostest442:linux-stable$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Stepping: 1 CPU MHz: 1738.601 BogoMIPS: 4396.18 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39
Note, if I boot with nr_cpus=1, hang never happens, with nr_cpus=4 happens but seldomly, and with all 40 CPUs happens on almost every reboot.
As Hugh Dickins suggested, I am going to show panic outputs, as I get them. Here is one more panic (note output is not complete because machine reboots):
[ 6.276456] EFI Variables Facility v0.08 2004-May-17 [ 6.384665] BUG: unable to handle kernel paging request at ffff901fff5a6000 [ 6.392461] IP: [<ffffffff8106bb08>] vmalloc_fault+0x1f8/0x340 [ 6.398987] PGD 0 [ 6.401242] Oops: 0000 [#1] SMP [ 6.404866] Modules linked in: [ 6.408287] CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.4.110_pt_stable #2 [ 6.416156] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 3 8050100 08/30/2016 [ 6.427226] task: ffff883ff1e28000 ti: ffff883ff1e24000 task.ti: ffff883ff1e24000 [ 6.435580] RIP: 0010:[<ffffffff8106bb08>] [<ffffffff8106bb08>] vmalloc_fault+0x1f8/0x340 [ 6.444819] RSP: 0000:ffff883ff1e27cc0 EFLAGS: 00010086 [ 6.450749] RAX: ffff881fff5a6058 RBX: 00003ffffffff000 RCX: 0000081fff5a6000 [ 6.458714] RDX: ffff880000000000 RSI: ffff901fff5a6000 RDI: 0000000000000000 [ 6.466681] RBP: ffff883ff1e27cf0 R08: 0000000000000018 R09: 000000000002d2de [ 6.474647] R10: 0000000000032ef3 R11: 0000000000002e04 R12: ffffc900000000f0 [ 6.482615] R13: ffff880000000000 R14: ffff901fff5a6000 R15: ffff881fff5a6000 [ 6.490574] FS: 0000000000000000(0000) GS:ffff88407e600000(0000) knlGS:0000000000000000 [ 6.499607] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.506022] CR2: ffff901fff5a6000 CR3: 0000000001aa2000 CR4: 0000000000360670 [ 6.513989] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.521956] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.529923] Stack: [ 6.532169] ffff881fff5a6000[ 6.532405] ------------[ cut here ]------------ [ 6.532414] WARNING: CPU: 22 PID: 162
Here is one more:
[ 6.284763] EFI Variables Facility v0.08 2004-May-17 [ 6.555990] ------------[ cut here ]------------ [ 6.561145] kernel BUG at /scratch/ptatashi/linux-stable/mm/slub.c:3627! [ 6.568625] invalid opcode: 0000 [#1] SMP [ 6.573219] Modules linked in: [ 6.576639] CPU: 1 PID: 364 Comm: kworker/1:1 Not tainted 4.4.110_pt_stable #3 [ 6.584692] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016 [ 6.595766] Workqueue: events clocksource_watchdog_work [ 6.601611] task: ffff881fecd82b00 ti: ffff881fecda4000 task.ti: ffff881fecda4000 [ 6.609963] RIP: 0010:[<ffffffff811e704a>] [<ffffffff811e704a>] kfree+0x14a/0x150 [ 6.618419] RSP: 0000:ffff881fecda7d40 EFLAGS: 00010246 [ 6.624348] RAX: ffffffff8106c280 RBX: ffff883ff114bfc0 RCX: 00000000ffffffd8 [ 6.632314] RDX: 000077ff80000000 RSI: 0000000000000246 RDI: ffff883ff114bfc0 [ 6.640280] RBP: ffff881fecda7d58 R08: 0000000000000000 R09: ffff881fff917300 [ 6.648244] R10: 0000000000000000 R11: ffffea00ffc452c0 R12: ffff883fec2f4080 [ 6.656208] R13: ffffffff810a5bee R14: 00000000ffffffff R15: 0000000000000000 [ 6.664175] FS: 0000000000000000(0000) GS:ffff881fff840000(0000) knlGS:0000000000000000 [ 6.673208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.679623] CR2: 0000000000000000 CR3: 0000000001aa2000 CR4: 0000000000360670 [ 6.687587] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.695553] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.703516] Stack: [ 6.705759] ffff883ff114bfc0 ffff883fec2f4080 ffffffff819a26e8 ffff881fecda7e00 [ 6.714061] ffffffff810a5bee ffff881f00000020 ffff881fecda7e10 ffff881fecda7da8 [ 6.722363] ffffffff00000000 ffff881f00000000 ffff881fecda7d90 ffff881fecda7d90 [ 6.730666] Call Trace: [ 6.733400] [<ffffffff810a5bee>] kthread_create_on_node+0x14e/0x1a0 [ 6.740495] [<ffffffff810f9dd5>] clocksource_watchdog_work+0x25/0x40 [ 6.747679] [<ffffffff8109ef6f>] process_one_work+0x14f/0x400 [ 6.754181] [<ffffffff8109fbc4>] worker_thread+0x114/0x480 [ 6.760402] [<ffffffff8109fab0>] ? rescuer_thread+0x310/0x310 [ 6.766913] [<ffffffff810a56b5>] kthread+0xe5/0x100 [ 6.772456] [<ffffffff810a55d0>] ? kthread_park+0x60/0x60 [ 6.778580] [<ffffffff8170fa0f>] ret_from_fork+0x3f/0x70 [ 6.784608] [<ffffffff810a55d0>] ? kthread_park+0x60/0x60 [ 6.790721] Code: 8b 03 31 f6 f6 c4 40 74 04 41 8b 73 6c 4c 89 df e8 1c a8 fa ff e9 73 ff ff ff 4c 8d 58 ff e9 20 ff ff ff 49 8b 43 20 a8 01 75 d4 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 [ 6.812429] RIP [<ffffffff811e704a>] kfree+0x14a/0x150 [ 6.818273] RSP <ffff881fecda7d40> [ 6.822177] ---[ end trace 4ce44d21c6d68eed ]---
On Mon, Jan 8, 2018 at 3:38 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
Hi Greg,
On Mon, Jan 8, 2018 at 2:46 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
Hi Greg,
I reverted suse12 back to: 13dae54cb229d078635f159dd8afe16ae683980b x86/kaiser: Move feature detection up (bsc#1068032).
And, still do not see the problem. So, whatever fixes the issue comes before kaiser.
Ok, thanks for the hint.
As I can't duplicate this here at all, any specifics as to what hardware/procesor type this is?
BIOS: Version 2.17.1249. Copyright (C) 2016 American Megatrends, Inc. BIOS Date: 08/30/2016 10:35:36 Ver: 38050100
ca-ostest442:linux-stable$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Stepping: 1 CPU MHz: 1738.601 BogoMIPS: 4396.18 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39
Note, if I boot with nr_cpus=1, hang never happens, with nr_cpus=4 happens but seldomly, and with all 40 CPUs happens on almost every reboot.
As Hugh Dickins suggested, I am going to show panic outputs, as I get them. Here is one more panic (note output is not complete because machine reboots):
[ 6.276456] EFI Variables Facility v0.08 2004-May-17 [ 6.384665] BUG: unable to handle kernel paging request at ffff901fff5a6000 [ 6.392461] IP: [<ffffffff8106bb08>] vmalloc_fault+0x1f8/0x340 [ 6.398987] PGD 0 [ 6.401242] Oops: 0000 [#1] SMP [ 6.404866] Modules linked in: [ 6.408287] CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.4.110_pt_stable #2 [ 6.416156] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 3 8050100 08/30/2016 [ 6.427226] task: ffff883ff1e28000 ti: ffff883ff1e24000 task.ti: ffff883ff1e24000 [ 6.435580] RIP: 0010:[<ffffffff8106bb08>] [<ffffffff8106bb08>] vmalloc_fault+0x1f8/0x340 [ 6.444819] RSP: 0000:ffff883ff1e27cc0 EFLAGS: 00010086 [ 6.450749] RAX: ffff881fff5a6058 RBX: 00003ffffffff000 RCX: 0000081fff5a6000 [ 6.458714] RDX: ffff880000000000 RSI: ffff901fff5a6000 RDI: 0000000000000000 [ 6.466681] RBP: ffff883ff1e27cf0 R08: 0000000000000018 R09: 000000000002d2de [ 6.474647] R10: 0000000000032ef3 R11: 0000000000002e04 R12: ffffc900000000f0 [ 6.482615] R13: ffff880000000000 R14: ffff901fff5a6000 R15: ffff881fff5a6000 [ 6.490574] FS: 0000000000000000(0000) GS:ffff88407e600000(0000) knlGS:0000000000000000 [ 6.499607] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.506022] CR2: ffff901fff5a6000 CR3: 0000000001aa2000 CR4: 0000000000360670 [ 6.513989] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.521956] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.529923] Stack: [ 6.532169] ffff881fff5a6000[ 6.532405] ------------[ cut here ]------------ [ 6.532414] WARNING: CPU: 22 PID: 162
I have root caused the memory corruption panics/hangs that I've been experiencing during boot with the latest 4.4.110 kernel. The problem as was suspected by Andy Lutomirski is with interaction between PTI and EFI. It may affect any system that has EFI bios. I have not verified if it can affect any other kernel beside 4.4.110
Attached is the fix for this issue with explanations that Steve Sistare and I developed.
If it is better to resubmit this patch via git send-email, please let me know.
Thank you, Pavel
On Thu, Jan 11, 2018 at 1:36 PM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
I have root caused the memory corruption panics/hangs that I've been experiencing during boot with the latest 4.4.110 kernel. The problem as was suspected by Andy Lutomirski is with interaction between PTI and EFI. It may affect any system that has EFI bios. I have not verified if it can affect any other kernel beside 4.4.110
Attached is the fix for this issue with explanations that Steve Sistare and I developed.
[ Patch to make sure the EFI trampoline_pgd is properly aligned and has the double pgd that KPTI requires ]
On Thu, Jan 11, 2018 at 10:40 AM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
If it is better to resubmit this patch via git send-email, please let me know.
It would be better, because that way the patch can be more easily quoted and discussed.
That said, I do not see why this isn't an issue upstream too.
As far as I can tell, it's not just 4.4.110. Our current entry code does that ADJUST_KERNEL_CR3 dance too, which clears the PTI_SWITCH_MASK bit from cr3.
And that realmode trampoline pgd seems all to be just aligned to PAGE_SIZE.
Now, in the modern world, we generate new page tables for EFI, but we still have that EFI_OLD_MEMMAP code that disables that. And afaik, EFI_OLD_MEMMAP has the exact same problem that your patch fixes in 4.4 (where it's always on).
So I think this patch should go into the development kernel too.
Or maybe it already is, and I just haven't gotten it yet.
Or - even more likely - I'm missing something entirely, and even EFI_OLD_MEMMAP solved this some other way upstream.
Adding Thomas Gleixner explicitly to the participants so that he can tell me I'm a moron and point me to the right thing.
Linus
On Thu, 11 Jan 2018, Linus Torvalds wrote:
[ Patch to make sure the EFI trampoline_pgd is properly aligned and has the double pgd that KPTI requires ]
On Thu, Jan 11, 2018 at 10:40 AM, Pavel Tatashin pasha.tatashin@oracle.com wrote:
If it is better to resubmit this patch via git send-email, please let me know.
It would be better, because that way the patch can be more easily quoted and discussed.
That said, I do not see why this isn't an issue upstream too.
As far as I can tell, it's not just 4.4.110. Our current entry code does that ADJUST_KERNEL_CR3 dance too, which clears the PTI_SWITCH_MASK bit from cr3.
And that realmode trampoline pgd seems all to be just aligned to PAGE_SIZE.
Right, but see below.
Now, in the modern world, we generate new page tables for EFI, but we still have that EFI_OLD_MEMMAP code that disables that. And afaik, EFI_OLD_MEMMAP has the exact same problem that your patch fixes in 4.4 (where it's always on).
So I think this patch should go into the development kernel too.
Or maybe it already is, and I just haven't gotten it yet.
It's not. There is an efi oldmap fix pending, but that's a different story.
Or - even more likely - I'm missing something entirely, and even EFI_OLD_MEMMAP solved this some other way upstream.
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I don't see how upstream needs the fix as the trampoline_pgd seems only to be used when coming out of the boot loader.
Adding Matt. He stepped back from EFI, but he might still know.
Adding Thomas Gleixner explicitly to the participants so that he can tell me I'm a moron and point me to the right thing.
Your wish is my command, but I need to stare some more before doing so.
Thanks,
tglx
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Looking around, there's at least one quirk for the SGI UV1 system that enables EFI_OLD_MMAP automatically. There might be others that I missed, but I think that's it.
So it *can* trigger without "efi=old_mmap", but not on any normal machines.
And as Pavel points out, even when the bug is active, it's pretty hard to actually trigger.
But yeah, there may be other EFI patches that I didn't notice that changed things in other ways too.
Linus
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
Thanks,
tglx
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
So coming back to the same commit. From the changelog:
This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses.
So this very commit gets rid of the (ab)use of trampoline_pgd and allocates efi_pgd, which we made use the proper size.
trampoline_pgd is since then only used to get into long mode in realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
The runtime services stuff does not use it in kernel versions >= 4.6
Thanks,
tglx
On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
So coming back to the same commit. From the changelog:
This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses.
So this very commit gets rid of the (ab)use of trampoline_pgd and allocates efi_pgd, which we made use the proper size.
trampoline_pgd is since then only used to get into long mode in realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
The runtime services stuff does not use it in kernel versions >= 4.6
Thanks,
tglx
Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are independent of it. When EFI_OLD_MMAP is enabled, the efi pgd is not used, and the bug will not bite.
- Steve
On Thu, 11 Jan 2018, Steven Sistare wrote:
On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
So coming back to the same commit. From the changelog:
This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses.
So this very commit gets rid of the (ab)use of trampoline_pgd and allocates efi_pgd, which we made use the proper size.
trampoline_pgd is since then only used to get into long mode in realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
The runtime services stuff does not use it in kernel versions >= 4.6
Thanks,
tglx
Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are independent of it. When EFI_OLD_MMAP is enabled, the efi pgd is not used, and the bug will not bite.
We have a fix queued in tip/x86/pti which addresses a missing NX clear, but that's a different story.
Thanks,
tglx
On Thu, Jan 11, 2018 at 11:47:23PM +0100, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Steven Sistare wrote:
On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
So coming back to the same commit. From the changelog:
This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses.
So this very commit gets rid of the (ab)use of trampoline_pgd and allocates efi_pgd, which we made use the proper size.
trampoline_pgd is since then only used to get into long mode in realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
The runtime services stuff does not use it in kernel versions >= 4.6
Thanks,
tglx
Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are independent of it. When EFI_OLD_MMAP is enabled, the efi pgd is not used, and the bug will not bite.
We have a fix queued in tip/x86/pti which addresses a missing NX clear, but that's a different story.
Since you are talking about NX, I see this in last night's -next:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle kernel paging request at fffffe0000007000 IP: 0xfffffe0000006e9d PGD ffd6067 P4D ffd6067 PUD ffd5067 PMD ff73067 PTE 800000000fc09063 Oops: 0011 [#1] PREEMPT SMP PTI Modules linked in: CPU: 0 PID: 1 Comm: init Tainted: G W 4.15.0-rc7-next-20180111-yocto-standard #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:0xfffffe0000006e9d RSP: 0018:ffffaee28000ffd0 EFLAGS: 00000006 RAX: 000000000000000c RBX: 0000000000400040 RCX: 00007f2c4186ad6a RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb6a00000 RBP: 0000000000000008 R08: 000000000000037f R09: 0000000000000064 R10: 00000000078bfbfd R11: 0000000000000246 R12: 00007f2c41856a60 R13: 0000000000000000 R14: 0000000000402368 R15: 0000000000001000 FS: 0000000000000000(0000) GS:ffff95fecfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffe0000007000 CR3: 000000000d88a000 CR4: 00000000003406f0 Call Trace: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <90> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 RIP: 0xfffffe0000006e9d RSP: ffffaee28000ffd0 CR2: fffffe0000007000 ---[ end trace a82b8742114c1785 ]---
Is this the issue you are talking about, or is the fix triggering the crash ?
Guenter
On Thu, Jan 11, 2018 at 2:42 PM, Steven Sistare steven.sistare@oracle.com wrote:
Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are independent of it. When EFI_OLD_MMAP is enabled, the efi pgd is not used, and the bug will not bite.
Ok, good. Thanks for checking.
Linus
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
So coming back to the same commit. From the changelog:
This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses.
So this very commit gets rid of the (ab)use of trampoline_pgd and allocates efi_pgd, which we made use the proper size.
trampoline_pgd is since then only used to get into long mode in realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
The runtime services stuff does not use it in kernel versions >= 4.6
But there is one very well hidden user for it after boot:
It's used for booting secondary CPUs from real mode
So the transition to long mode for secondaries uses the trampoline pgd for long mode transition and then jumping to secondary_startup_64 where CR3 is set to the real kernel page tables.
Thanks,
tglx
On Fri, Jan 12, 2018 at 12:03:10AM +0100, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
On Thu, 11 Jan 2018, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Yes, the trampoline_pgd is still around, but I can't figure out how it would be used after boot. Confused, digging more.
So coming back to the same commit. From the changelog:
This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses.
So this very commit gets rid of the (ab)use of trampoline_pgd and allocates efi_pgd, which we made use the proper size.
trampoline_pgd is since then only used to get into long mode in realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
The runtime services stuff does not use it in kernel versions >= 4.6
But there is one very well hidden user for it after boot:
It's used for booting secondary CPUs from real mode
So the transition to long mode for secondaries uses the trampoline pgd for long mode transition and then jumping to secondary_startup_64 where CR3 is set to the real kernel page tables.
Ok, so the summary is that this patch is only needed for the 4.4 and 4.9 kernels, and _NOT_ for Linus's tree and 4.14, right?
thanks,
greg k-h
On Fri, 12 Jan 2018, Greg Kroah-Hartman wrote:
On Fri, Jan 12, 2018 at 12:03:10AM +0100, Thomas Gleixner wrote:
So the transition to long mode for secondaries uses the trampoline pgd for long mode transition and then jumping to secondary_startup_64 where CR3 is set to the real kernel page tables.
Ok, so the summary is that this patch is only needed for the 4.4 and 4.9 kernels, and _NOT_ for Linus's tree and 4.14, right?
Correct.
On 1/11/2018 3:46 PM, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Looking around, there's at least one quirk for the SGI UV1 system that enables EFI_OLD_MMAP automatically. There might be others that I missed, but I think that's it.
So it *can* trigger without "efi=old_mmap", but not on any normal machines.
And as Pavel points out, even when the bug is active, it's pretty hard to actually trigger.
But yeah, there may be other EFI patches that I didn't notice that changed things in other ways too.
Linus
The bug is not present in the latest upstream kernel because the efi_pgd is correctly aligned:
arch/x86/platform/efi/efi_64.c int __init efi_alloc_page_tables(void) efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
arch/x86/include/asm/pgalloc.h +#ifdef CONFIG_PAGE_TABLE_ISOLATION +#define PGD_ALLOCATION_ORDER 1 +#else +#define PGD_ALLOCATION_ORDER 0 +#endif
Pavel's patch fixes kernels prior to 67a9108ed431 ("x86/efi: Build our own page table structures")
where the efi pgd allocation looks like:
arch/x86/realmode/init.c void __init reserve_real_mode(void) mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE); base = __va(mem); real_mode_header = (struct real_mode_header *) base;
void __init setup_real_mode(void) trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
Kernel versions between 67a9108ed431 and the latest also have the bug and need a similar fix:
arch/x86/platform/efi/efi_64.c
int __init efi_alloc_page_tables(void) efi_pgd = (pgd_t *)__get_free_page(gfp_mask);
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) pgd = efi_pgd; efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd);
All of the code paths above are taken when *not* EFI_OLD_MMAP.
- Steve
On Thu, 11 Jan 2018, Steven Sistare wrote:
On 1/11/2018 3:46 PM, Linus Torvalds wrote:
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
67a9108ed431 ("x86/efi: Build our own page table structures")
got rid of EFI depending on real_mode_header->trampoline_pgd
So I think it only got rid of by default - the codepath is still there, the allocation is still there, it's just that it's not actually used unless somebody does that "efi=old_mmap" thing.
Looking around, there's at least one quirk for the SGI UV1 system that enables EFI_OLD_MMAP automatically. There might be others that I missed, but I think that's it.
So it *can* trigger without "efi=old_mmap", but not on any normal machines.
And as Pavel points out, even when the bug is active, it's pretty hard to actually trigger.
But yeah, there may be other EFI patches that I didn't notice that changed things in other ways too.
Linus
The bug is not present in the latest upstream kernel because the efi_pgd is correctly aligned:
arch/x86/platform/efi/efi_64.c int __init efi_alloc_page_tables(void) efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
Yes, I came exactly to the same conclusion, but I didn't want to call Linus a moron before I triple checked that trampoline_pgd is still there, but only every used to get out of the realmode swamp at bpot.
Thanks,
tglx
On Jan 11, 2018 13:35, "Steven Sistare" steven.sistare@oracle.com wrote:
All of the code paths above are taken when *not* EFI_OLD_MMAP.
But it is exactly the EFI_OLD_MMAP case I worry about.
Nobody should hopefully use it, but as mentioned, at least the SGI UV1 case enables it automatically.
And who knows how many users ended up adding it manually due to the problems we had for a while with EFI page tables (due to non-linear addresses when laying out the EFI data, but also due to bad EFI memory information from the BIOS)
Linus
On Thu, Jan 11, 2018 at 01:36:50PM -0500, Pavel Tatashin wrote:
I have root caused the memory corruption panics/hangs that I've been experiencing during boot with the latest 4.4.110 kernel. The problem as was suspected by Andy Lutomirski is with interaction between PTI and EFI. It may affect any system that has EFI bios. I have not verified if it can affect any other kernel beside 4.4.110
Attached is the fix for this issue with explanations that Steve Sistare and I developed.
Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as well on this hardware? Nor on the SLES12 SP3 kernel?
What is different there that 4.4 requires? That worries me more than your fix (which looks good to me, fwiw.)
thanks,
greg k-h
On Thu, Jan 11, 2018 at 12:10 PM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as well on this hardware? Nor on the SLES12 SP3 kernel?
What is different there that 4.4 requires? That worries me more than your fix (which looks good to me, fwiw.)
I really think it's simply that since v4.6, we've had commit 67a9108ed431 ("x86/efi: Build our own page table structures"), so no normal EFI use actually uses the old legacy mapping unless you passed in "efi=old_map" on the kernel command line.
So the bug is there in all versions, it's just that it's normally only noticeable in 4.4.
But I might be missing some other difference, so take that with a pinch of salt.
Linus
On 01/11/2018 03:10 PM, Greg Kroah-Hartman wrote:
On Thu, Jan 11, 2018 at 01:36:50PM -0500, Pavel Tatashin wrote:
I have root caused the memory corruption panics/hangs that I've been experiencing during boot with the latest 4.4.110 kernel. The problem as was suspected by Andy Lutomirski is with interaction between PTI and EFI. It may affect any system that has EFI bios. I have not verified if it can affect any other kernel beside 4.4.110
Attached is the fix for this issue with explanations that Steve Sistare and I developed.
Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as well on this hardware? Nor on the SLES12 SP3 kernel?
What is different there that 4.4 requires? That worries me more than your fix (which looks good to me, fwiw.)
Hi Greg,
I have not studied other versions of kernels, efi was changed substantially since 4.4. But, even on 4.4.110 there are several things have to happen for this bug to show-up:
1. During boot memmblock must allocate address that is not 2PAGE_SIZE aligned. 2. nmi must arrive exactly when EFI replaced page table.
While I was debugging this problem, I tried to enable, kasan, vm_debug, add more printfs etc, but every little change would cause this problem to disappear, or appear less frequently.
Thank you, Pavel
On Fri, Jan 05, 2018 at 10:15:00AM -0800, Andy Lutomirski wrote:
On Fri, Jan 5, 2018 at 9:52 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
Boots successfully with "noefi" kernel parameter :)
Thanks, that will help me narrow it down. I'll dig through more patches when I get home tonight...
I wish you luck. The 4.4 series is "KAISER", not "KPTI", and the relevant code is spread all over the place and is generally garbage. See, for example, the turd called kaiser_set_shadow_pgd(). I would not be terribly surprised if that particular turd is biting here.
An alternative theory is that something is screwy in the EFI code. I don't see anything directly wrong, but it's certainly a bit sketchy. The newer kernels carefully avoid using PCID 0 for real work to avoid corruption due to EFI and similar things. The "KAISER" code has no such mitigation. Fortunately, it seems to use PCID=0 for kernel and PCID=nonzero for user, so the obvious problem isn't present, but something could still be wrong.
Pavel, can you send your /proc/cpuinfo on a noefi boot? (Just the first CPU worth is fine.)
FWIW, I said before that I have very little desire to help debug "KAISER". I stand by that.
I totally understand, and do not expect your help at all.
Worse case, I point people at 4.14 and tell them to upgrade, I'm not going to waste a ton of time on this for the same exact reasons you list here.
And yeah, kaiser_set_shadow_pgd() is horrid, I've already gotten sucked into it for long enough...
greg k-h
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:
On Jan 4, 2018, at 12:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
That's almost certainly it.
I'll try to find some time today or tomorrow to add a proper selftest.
Give this a shot:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86...
Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate and run both the 32-bit and 64-bit variants of that test. All six combinations should pass. But I bet they don't on 4.4.
On Thu, Jan 4, 2018 at 9:33 PM, Andy Lutomirski luto@amacapital.net wrote:
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:
On Jan 4, 2018, at 12:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
That's almost certainly it.
I'll try to find some time today or tomorrow to add a proper selftest.
Give this a shot:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86...
Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate and run both the 32-bit and 64-bit variants of that test. All six combinations should pass. But I bet they don't on 4.4.
With my 4.4.110-rc1 under QEMU -cpu=host (Xeon E5-2690 v3)
vsyscall=emulate:
# ./test_vsyscall_64 ... [RUN] Checking read access to the vsyscall page [FAIL] We don't have read access, but we should
vsyscall=native:
# ./test_vsyscall_64 ... [RUN] Checking read access to the vsyscall page [FAIL] We don't have read access, but we should
Everything else passes.
Note that test_vsyscall_32 warns:
# ./test_vsyscall_32 Warning: failed to find getcpu in vDSO ...
-Kees
On Fri, Jan 05, 2018 at 02:12:33AM -0800, Kees Cook wrote:
On Thu, Jan 4, 2018 at 9:33 PM, Andy Lutomirski luto@amacapital.net wrote:
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:
On Jan 4, 2018, at 12:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle tv@lio96.de wrote:
Attached a screenshot. Is that useful? Are there some debug options I can add?
Not much of an oops, because the SIGSEGV happens in user space. The only reason you get any kernel stack printout at all is because 'init' dying will make the kernel print that out.
The segfault address for init looks like the fixmap area to me (first byte in the last page of the fixmap?). "Error 5" means that it's a user-space read that got a protection fault. So it's not a LDT of GDT update or anything like that, it's a normal access from user space (or a qemu emulation bug, but that sounds unlikely).
Is that the vsyscall page?
Adding Luto to the participants. I think he noticed one of the vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 series had something similar..
That's almost certainly it.
I'll try to find some time today or tomorrow to add a proper selftest.
Give this a shot:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86...
Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate and run both the 32-bit and 64-bit variants of that test. All six combinations should pass. But I bet they don't on 4.4.
With my 4.4.110-rc1 under QEMU -cpu=host (Xeon E5-2690 v3)
vsyscall=emulate:
# ./test_vsyscall_64 ... [RUN] Checking read access to the vsyscall page [FAIL] We don't have read access, but we should
vsyscall=native:
# ./test_vsyscall_64 ... [RUN] Checking read access to the vsyscall page [FAIL] We don't have read access, but we should
Everything else passes.
I get this same error with the latest 4.9-rc tree as well, but it works just fine on 4.15-rc6.
I'll look at the proposed patches now for this...
thanks so much for the test tool.
greg k-h
On Thu, Jan 04, 2018 at 08:38:23PM +0100, Thomas Voegtle wrote:
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
I did, but not on Haswell, and not with the same root file system. It boots fine in qemu on E5-2690 v3.
Do you have a traceback ?
Guenter
Confused,
Thomas
On Thu, Jan 04, 2018 at 08:38:23PM +0100, Thomas Voegtle wrote:
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
I've now released 4.4.110, which had 4 more patches on top of what 4.4.109-rc1 had in it, that should hopefully resolve these issues.
Can you test that and let me know if you still have problems?
thanks,
greg k-h
On Fri, 5 Jan 2018, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 08:38:23PM +0100, Thomas Voegtle wrote:
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
I've now released 4.4.110, which had 4 more patches on top of what 4.4.109-rc1 had in it, that should hopefully resolve these issues.
Can you test that and let me know if you still have problems?
It's fixed. I can boot 4.4.110 on qemu without problems so far.
./test_vsyscall_64 still fails though, like Kees wrote about 4.4.110-rc1 https://lkml.org/lkml/2018/1/5/123
That's another issue?
Thank you very much.
Thomas
On Fri, Jan 05, 2018 at 04:25:33PM +0100, Thomas Voegtle wrote:
On Fri, 5 Jan 2018, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 08:38:23PM +0100, Thomas Voegtle wrote:
When I start 4.4.110-rc1 on a virtual machine (qemu) init throws a segfault and the kernel panics (attempted to kill init). The VM host is a Haswell system.
The same kernel binary boots fine on a (other) Haswell system.
I tried:
4.4.110-rc1 broken 4.4.109 ok 4.9.75-rc1 ok
All systems are OpenSuSE 42.3 64bit.
qemu is started only with: qemu-system-x86_64 -m 2048 -enable-kvm -drive file=tvsuse,format=raw,if=none,id=virtdisk0 -device virtio-blk-pci,scsi=off,drive=virtdisk0
Am I the only one who sees this? Has anyone booted that kernel on qemu?
I've now released 4.4.110, which had 4 more patches on top of what 4.4.109-rc1 had in it, that should hopefully resolve these issues.
Can you test that and let me know if you still have problems?
It's fixed. I can boot 4.4.110 on qemu without problems so far.
Yeah!!!
./test_vsyscall_64 still fails though, like Kees wrote about 4.4.110-rc1 https://lkml.org/lkml/2018/1/5/123
That's another issue?
Yes it is, that's next up to get resolved.
thanks for testing and letting me know,
greg k-h
On 01/03/2018 01:11 PM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
thanks,
greg k-h
Based on the email threads, I expected to see issues, however, compiled and booted on my test system. No dmesg regressions.
thanks, -- Shuah
On Thu, Jan 04, 2018 at 03:00:29PM -0700, Shuah Khan wrote:
On 01/03/2018 01:11 PM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
thanks,
greg k-h
Based on the email threads, I expected to see issues, however, compiled and booted on my test system. No dmesg regressions.
Hey, you got lucky :)
Thanks for testing all of these and letting me know.
greg k-h
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
This is also reported to crash if loaded under qemu + haxm under windows. See https://www.spinics.net/lists/kernel/msg2689835.html for details. Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that the same log is also seen with vanilla v4.4.110-rc1).
[ 0.712750] Freeing unused kernel memory: 552K [ 0.721821] init: Corrupted page table at address 57b029b332e0 [ 0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067 [ 0.722761] Bad pagetable: 000b [#1] PREEMPT SMP [ 0.722761] Modules linked in: [ 0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000 [ 0.722761] RIP: 0010:[<ffffffff83f4129e>] [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP: 0000:ffff8800bc28fcf8 EFLAGS: 00010202 [ 0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4 [ 0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0 [ 0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000 [ 0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0 [ 0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00 [ 0.722761] FS: 0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000 [ 0.722761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0 [ 0.722761] Stack: [ 0.722761] 000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c [ 0.722761] ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000 [ 0.722761] ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.722761] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.722761] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.722761] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.722761] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.722761] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.722761] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17 48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff [ 0.722761] RIP [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP <ffff8800bc28fcf8> [ 0.722761] ---[ end trace def703879b4ff090 ]--- [ 0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21 [ 0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init [ 0.722761] CPU: 1 PID: 1 Comm: init Tainted: G D 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] 0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004 [ 0.722761] ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9 [ 0.722761] ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f34004>] dump_stack+0x4d/0x63 [ 0.722761] [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c [ 0.722761] [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6 [ 0.722761] [<ffffffff84502788>] down_read+0x20/0x31 [ 0.722761] [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63 [ 0.722761] [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16 [ 0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd [ 0.722761] [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c [ 0.802309] [<ffffffff83cac84e>] do_exit+0x39/0xe7f [ 0.802309] [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f [ 0.802309] [<ffffffff83d7bb95>] ? printk+0x57/0x73 [ 0.802309] [<ffffffff83c46e25>] oops_end+0x80/0x85 [ 0.802309] [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95 [ 0.802309] [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352 [ 0.802309] [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5 [ 0.802309] [<ffffffff83ca821c>] do_page_fault+0xc/0xe [ 0.802309] [<ffffffff84507682>] page_fault+0x22/0x30 [ 0.802309] [<ffffffff83f4129e>] ? __clear_user+0x42/0x67 [ 0.802309] [<ffffffff83f4127f>] ? __clear_user+0x23/0x67 [ 0.802309] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.802309] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.802309] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.802309] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.802309] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.802309] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.802309] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.802309] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.830559] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 0.830559] [ 0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
Guenter
--- From: Guenter Roeck groeck@chromium.org Date: Thu, 4 Jan 2018 13:41:55 -0800 Subject: [PATCH 2/2] WIP: kaiser: Set _PAGE_NX only if supported
Change-Id: Ie6ab566c1d725b24c4b3aa80a47c3ff3a5feddb9 Signed-off-by: Guenter Roeck groeck@chromium.org --- arch/x86/mm/kaiser.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 7d2f7eb6857f..e4706273d4a1 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -421,7 +421,8 @@ pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) * get out to userspace running on the kernel CR3, * userspace will crash instead of running. */ - pgd.pgd |= _PAGE_NX; + if (__supported_pte_mask & _PAGE_NX) + pgd.pgd |= _PAGE_NX; } } else if (!pgd.pgd) { /*
On Thu, Jan 4, 2018 at 3:45 PM, Guenter Roeck linux@roeck-us.net wrote:
[ 0.721821] init: Corrupted page table at address 57b029b332e0 [ 0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067 [ 0.722761] Bad pagetable: 000b [#1] PREEMPT SMP
Ok, it's unhappy because the RSVD bit is set in the error code.
And yeah, that seems to be due to NX in the pgd (nothing else is certainly set), with presumably a virtual machine that doesn't support it.
So I suspect your patch is indeed the right thing.
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
Presumably that is a triple fault.
That causes a reboot traditionally, and in a virtual environment that would be approximated with a VCPU shutdown.
Linus
On Thu, 2018-01-04 at 15:45 -0800, Guenter Roeck wrote:
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
The crash part is not fixed by your patch here, w/wo I get this, and it is PTI, as virgin 109 boots/works with identical everything else. My shiny new PTI equipped enterprise 4.4 RT kernels also boot/work fine, which seems a bit odd.. and not particularly comforting.
[ 1.244354] Freeing unused kernel memory: 1192K [ 1.245278] Write protecting the kernel read-only data: 10240k [ 1.247626] Freeing unused kernel memory: 1152K [ 1.251318] Freeing unused kernel memory: 1476K [ 1.253393] init[1]: segfault at ffffffffff5ff100 ip 00007fffb7ffac6e sp 00007fffb7fa07d8 error 5 [ 1.254629] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 1.254629] [ 1.256202] CPU: 4 PID: 1 Comm: init Not tainted 4.4.110-rc1-smp #4 [ 1.257169] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [ 1.258563] 0000000000000000 ffffffff8125a9c0 ffffffff817de7c8 ffff880197e83cf0 [ 1.260850] ffffffff8112bb2d ffffffff00000010 ffff880197e83d00 ffff880197e83ca0 [ 1.263091] ffffffff81c3cf30 000000000000000b ffff880197e90010 0000000000000000 [ 1.264580] Call Trace: [ 1.265617] [<ffffffff8125a9c0>] ? dump_stack+0x5c/0x7c [ 1.266671] [<ffffffff8112bb2d>] ? panic+0xc8/0x20f [ 1.267799] [<ffffffff81060af0>] ? do_exit+0xa50/0xa50 [ 1.268971] [<ffffffff810618e9>] ? do_group_exit+0x39/0xa0 [ 1.270281] [<ffffffff8106c8a0>] ? get_signal+0x1d0/0x600 [ 1.271347] [<ffffffff810041e3>] ? do_signal+0x23/0x5b0 [ 1.272259] [<ffffffff8106ade9>] ? __send_signal+0x179/0x460 [ 1.273235] [<ffffffff8104b88f>] ? force_sig_info_fault+0x5f/0x70 [ 1.274258] [<ffffffff8104bf6c>] ? __bad_area_nosemaphore+0x1cc/0x200 [ 1.275268] [<ffffffff8105a052>] ? exit_to_usermode_loop+0x54/0x95 [ 1.276262] [<ffffffff81001961>] ? prepare_exit_to_usermode+0x31/0x40 [ 1.277266] [<ffffffff814d9dbe>] ? retint_user+0x8/0x2c [ 1.278274] Dumping ftrace buffer: [ 1.279011] (ftrace buffer empty) [ 1.279728] Kernel Offset: disabled [ 1.280432] Rebooting in 60 seconds..
virsh # exit
Guenter
From: Guenter Roeck groeck@chromium.org Date: Thu, 4 Jan 2018 13:41:55 -0800 Subject: [PATCH 2/2] WIP: kaiser: Set _PAGE_NX only if supported
Change-Id: Ie6ab566c1d725b24c4b3aa80a47c3ff3a5feddb9 Signed-off-by: Guenter Roeck groeck@chromium.org
arch/x86/mm/kaiser.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 7d2f7eb6857f..e4706273d4a1 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -421,7 +421,8 @@ pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) * get out to userspace running on the kernel CR3, * userspace will crash instead of running. */
pgd.pgd |= _PAGE_NX;
if (__supported_pte_mask & _PAGE_NX)
} } else if (!pgd.pgd) { /*pgd.pgd |= _PAGE_NX;
On Fri, Jan 05, 2018 at 05:37:54AM +0100, Mike Galbraith wrote:
On Thu, 2018-01-04 at 15:45 -0800, Guenter Roeck wrote:
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
The crash part is not fixed by your patch here, w/wo I get this, and it is PTI, as virgin 109 boots/works with identical everything else. My shiny new PTI equipped enterprise 4.4 RT kernels also boot/work fine, which seems a bit odd.. and not particularly comforting.
Might I ask _what_ enterprise 4.4 kernels you are trying here? This should be the identical set to what is in the SLES12 tree, which worries me a lot...
thanks,
greg k-h
On Fri, 2018-01-05 at 13:17 +0100, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 05:37:54AM +0100, Mike Galbraith wrote:
On Thu, 2018-01-04 at 15:45 -0800, Guenter Roeck wrote:
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
The crash part is not fixed by your patch here, w/wo I get this, and it is PTI, as virgin 109 boots/works with identical everything else. My shiny new PTI equipped enterprise 4.4 RT kernels also boot/work fine, which seems a bit odd.. and not particularly comforting.
Might I ask _what_ enterprise 4.4 kernels you are trying here? This should be the identical set to what is in the SLES12 tree, which worries me a lot...
SLE12-SP[23]-RT, currently 4.4.104 based. Parent trees boot fine in the vm too. I thought perhaps it was a config difference, but seems not, 4.4.110-rc1 built with as close to enterprise config as you can get blows up the same as my light config.
-Mike
On Fri, Jan 05, 2018 at 02:03:17PM +0100, Mike Galbraith wrote:
On Fri, 2018-01-05 at 13:17 +0100, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 05:37:54AM +0100, Mike Galbraith wrote:
On Thu, 2018-01-04 at 15:45 -0800, Guenter Roeck wrote:
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
The crash part is not fixed by your patch here, w/wo I get this, and it is PTI, as virgin 109 boots/works with identical everything else. My shiny new PTI equipped enterprise 4.4 RT kernels also boot/work fine, which seems a bit odd.. and not particularly comforting.
Might I ask _what_ enterprise 4.4 kernels you are trying here? This should be the identical set to what is in the SLES12 tree, which worries me a lot...
SLE12-SP[23]-RT, currently 4.4.104 based. Parent trees boot fine in the vm too. I thought perhaps it was a config difference, but seems not, 4.4.110-rc1 built with as close to enterprise config as you can get blows up the same as my light config.
Ok, we found two patches that were missing in 4.4-stable that were in the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift through :)
I should probably do an "interm" release to get people to be able to sync up to a common place easier for testing, dealing with patch sets and random emails saying different git ids is not easy for anyone.
thanks,
greg k-h
On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
Ok, we found two patches that were missing in 4.4-stable that were in the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift through :)
As you know, in enterprise, uname -r means you might find something this old in your kernel if you look hard enough :)
-Mike
On Fri, Jan 5, 2018 at 6:03 AM, Mike Galbraith efault@gmx.de wrote:
On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
Ok, we found two patches that were missing in 4.4-stable that were in the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift through :)
As you know, in enterprise, uname -r means you might find something this old in your kernel if you look hard enough :)
Mike, I think there's a good chance that Greg's 4.4.110 final will fix your "segfault at ffffffffff5ff100" crashes: please give it a try when you can, and let us know - thanks.
Hugh
On Fri, 2018-01-05 at 15:28 -0800, Hugh Dickins wrote:
On Fri, Jan 5, 2018 at 6:03 AM, Mike Galbraith efault@gmx.de wrote:
On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
Ok, we found two patches that were missing in 4.4-stable that were in the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift through :)
As you know, in enterprise, uname -r means you might find something this old in your kernel if you look hard enough :)
Mike, I think there's a good chance that Greg's 4.4.110 final will fix your "segfault at ffffffffff5ff100" crashes: please give it a try when you can, and let us know - thanks.
Already done, and yes, it did.
-Mike
On Thu, Jan 04, 2018 at 03:45:55PM -0800, Guenter Roeck wrote:
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
This is also reported to crash if loaded under qemu + haxm under windows. See https://www.spinics.net/lists/kernel/msg2689835.html for details. Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that the same log is also seen with vanilla v4.4.110-rc1).
[ 0.712750] Freeing unused kernel memory: 552K [ 0.721821] init: Corrupted page table at address 57b029b332e0 [ 0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067 [ 0.722761] Bad pagetable: 000b [#1] PREEMPT SMP [ 0.722761] Modules linked in: [ 0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000 [ 0.722761] RIP: 0010:[<ffffffff83f4129e>] [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP: 0000:ffff8800bc28fcf8 EFLAGS: 00010202 [ 0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4 [ 0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0 [ 0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000 [ 0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0 [ 0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00 [ 0.722761] FS: 0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000 [ 0.722761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0 [ 0.722761] Stack: [ 0.722761] 000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c [ 0.722761] ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000 [ 0.722761] ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.722761] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.722761] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.722761] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.722761] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.722761] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.722761] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17 48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff [ 0.722761] RIP [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP <ffff8800bc28fcf8> [ 0.722761] ---[ end trace def703879b4ff090 ]--- [ 0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21 [ 0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init [ 0.722761] CPU: 1 PID: 1 Comm: init Tainted: G D 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] 0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004 [ 0.722761] ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9 [ 0.722761] ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f34004>] dump_stack+0x4d/0x63 [ 0.722761] [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c [ 0.722761] [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6 [ 0.722761] [<ffffffff84502788>] down_read+0x20/0x31 [ 0.722761] [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63 [ 0.722761] [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16 [ 0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd [ 0.722761] [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c [ 0.802309] [<ffffffff83cac84e>] do_exit+0x39/0xe7f [ 0.802309] [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f [ 0.802309] [<ffffffff83d7bb95>] ? printk+0x57/0x73 [ 0.802309] [<ffffffff83c46e25>] oops_end+0x80/0x85 [ 0.802309] [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95 [ 0.802309] [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352 [ 0.802309] [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5 [ 0.802309] [<ffffffff83ca821c>] do_page_fault+0xc/0xe [ 0.802309] [<ffffffff84507682>] page_fault+0x22/0x30 [ 0.802309] [<ffffffff83f4129e>] ? __clear_user+0x42/0x67 [ 0.802309] [<ffffffff83f4127f>] ? __clear_user+0x23/0x67 [ 0.802309] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.802309] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.802309] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.802309] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.802309] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.802309] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.802309] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.802309] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.830559] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 0.830559] [ 0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
Guenter
From: Guenter Roeck groeck@chromium.org Date: Thu, 4 Jan 2018 13:41:55 -0800 Subject: [PATCH 2/2] WIP: kaiser: Set _PAGE_NX only if supported
Change-Id: Ie6ab566c1d725b24c4b3aa80a47c3ff3a5feddb9 Signed-off-by: Guenter Roeck groeck@chromium.org
arch/x86/mm/kaiser.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 7d2f7eb6857f..e4706273d4a1 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -421,7 +421,8 @@ pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) * get out to userspace running on the kernel CR3, * userspace will crash instead of running. */
pgd.pgd |= _PAGE_NX;
if (__supported_pte_mask & _PAGE_NX)
} } else if (!pgd.pgd) { /*pgd.pgd |= _PAGE_NX;
Very good catch, this mirrors almost what is in 4.14 in this area. I'll go queue this up for 4.9 and 4.4 stable trees.
thanks,
greg k-h
On Fri, Jan 05, 2018 at 02:41:04PM +0100, Greg Kroah-Hartman wrote:
On Thu, Jan 04, 2018 at 03:45:55PM -0800, Guenter Roeck wrote:
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
This is also reported to crash if loaded under qemu + haxm under windows.
[ ... ]
The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means.
v4.4.110 still suffers from "VCPU Shutdown request" with qemu+haxm. Unfortunately I don't have any other information about the problem at this time.
Guenter
On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
This patchset merges correctly with Gentoo patches and GCC version 6.4.0 The kernel boot up correctly. Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
This patchset merges correctly with Gentoo patches and GCC version 6.4.0 The kernel boot up correctly. Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
Great, but Gentoo really should be moving to 4.9 and 4.14 here, I hope no one running Gentoo is relying on 4.4 :)
thanks,
greg k-h
Quoting Greg Kroah-Hartman (gregkh@linuxfoundation.org):
On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
This patchset merges correctly with Gentoo patches and GCC version 6.4.0 The kernel boot up correctly. Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
Great, but Gentoo really should be moving to 4.9 and 4.14 here, I hope no one running Gentoo is relying on 4.4 :)
Wait what?
According to https://www.kernel.org/category/releases.html 4.4 should be the best bet for longest support, right? Does that page need to be updated? If 4.4 is not going to be supported, is there anything else with a possible 5-6 years of support?
On Tue, Jan 09, 2018 at 01:49:48PM -0600, Serge E. Hallyn wrote:
Quoting Greg Kroah-Hartman (gregkh@linuxfoundation.org):
On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
This patchset merges correctly with Gentoo patches and GCC version 6.4.0 The kernel boot up correctly. Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
Great, but Gentoo really should be moving to 4.9 and 4.14 here, I hope no one running Gentoo is relying on 4.4 :)
Wait what?
According to https://www.kernel.org/category/releases.html 4.4 should be the best bet for longest support, right? Does that page need to be updated? If 4.4 is not going to be supported, is there anything else with a possible 5-6 years of support?
4.4 is going to be supported, yes, but really, for a desktop/server system, why would you ever want to stick with it for anything longer than a year? No new hardware support is added, and no new features that you would want are in there.
The LTS kernels are for the crazy embedded people that don't change their hardware systems, and have the insane huge number of out-of-tree patches. No one else should be using those kernels, they should always be using newer ones, as there are always more issues fixed in newer kernels than older ones.
So again, I hope no one running Gentoo, which is a rolling, constantly updated distro, is using the old and crusty 4.4 kernel release. To do so is to defeat the purpose of relying on Gentoo in the first place...
thanks,
greg k-h
Quoting Greg Kroah-Hartman (gregkh@linuxfoundation.org):
On Tue, Jan 09, 2018 at 01:49:48PM -0600, Serge E. Hallyn wrote:
Quoting Greg Kroah-Hartman (gregkh@linuxfoundation.org):
On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below.
This patchset merges correctly with Gentoo patches and GCC version 6.4.0 The kernel boot up correctly. Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
Great, but Gentoo really should be moving to 4.9 and 4.14 here, I hope no one running Gentoo is relying on 4.4 :)
Wait what?
According to https://www.kernel.org/category/releases.html 4.4 should be the best bet for longest support, right? Does that page need to be updated? If 4.4 is not going to be supported, is there anything else with a possible 5-6 years of support?
4.4 is going to be supported, yes, but really, for a desktop/server system, why would you ever want to stick with it for anything longer than a year? No new hardware support is added, and no new features that you would want are in there.
The LTS kernels are for the crazy embedded people that don't change their hardware systems, and have the insane huge number of out-of-tree patches. No one else should be using those kernels, they should always be using newer ones, as there are always more issues fixed in newer kernels than older ones.
So again, I hope no one running Gentoo, which is a rolling, constantly updated distro, is using the old and crusty 4.4 kernel release. To do so is to defeat the purpose of relying on Gentoo in the first place...
Ah, I see, yeah that makes sense :)
thanks, -serge
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
Update: v4.4.110 final nosmp builds fail as follows:
------------ Error log: arch/x86/entry/vdso/vma.c: In function ‘map_vdso’: arch/x86/entry/vdso/vma.c:173:9: error: implicit declaration of function ‘pvclock_pvti_cpu0_va’
Guenter
On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
Update: v4.4.110 final nosmp builds fail as follows:
Error log: arch/x86/entry/vdso/vma.c: In function ‘map_vdso’: arch/x86/entry/vdso/vma.c:173:9: error: implicit declaration of function ‘pvclock_pvti_cpu0_va’
x86-64 or i386? That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue, have a .config I can try?
thanks,
greg k-h
On Fri, Jan 05, 2018 at 09:54:45PM +0100, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
Update: v4.4.110 final nosmp builds fail as follows:
Error log: arch/x86/entry/vdso/vma.c: In function ‘map_vdso’: arch/x86/entry/vdso/vma.c:173:9: error: implicit declaration of function ‘pvclock_pvti_cpu0_va’
x86-64 or i386?
x86-64
That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue, have a .config I can try?
https://github.com/groeck/linux-build-test/blob/master/rootfs/x86_64/qemu_x8...
However, https://github.com/groeck/linux-build-test/blob/master/rootfs/x86_64/qemu_x8... does build, and the only differences are:
30a31
CONFIG_SMP=y
32a34,35
CONFIG_NR_CPUS=24 CONFIG_SCHED_SMT=y
44d46 < CONFIG_ACPI_CONTAINER=y
Both configurations have CONFIG_PARAVIRT_CLOCK disabled.
Guenter
On 01/05/2018 12:54 PM, Greg Kroah-Hartman wrote:
On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.4.110 release. There are 37 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri Jan 5 19:50:38 UTC 2018. Anything received after that time might be too late.
Update: v4.4.110 final nosmp builds fail as follows:
Error log: arch/x86/entry/vdso/vma.c: In function ‘map_vdso’: arch/x86/entry/vdso/vma.c:173:9: error: implicit declaration of function ‘pvclock_pvti_cpu0_va’
x86-64 or i386? That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue, have a .config I can try?
Here is an easier way to reproduce the problem: make allnoconfig ; make
Guenter