This is the start of the stable review cycle for the 3.16.59 release. There are 131 patches in this series, which will be posted as responses to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Mon Oct 01 21:43:06 UTC 2018. Anything received after that time might be too late.
All the patches have also been committed to the linux-3.16.y-rc branch of https://git.kernel.org/pub/scm/linux/kernel/git/bwh/linux-stable-rc.git . A shortlog and diffstat can be found below.
Ben.
-------------
Andi Kleen (10): x86/mm/kmmio: Make the tracer robust against L1TF [1063711b57393c1999248cccb57bebfaf16739e7] x86/mm/pat: Make set_memory_np() L1TF safe [958f79b9ee55dfaf00c8106ed1c22a2919e0028b] x86/speculation/l1tf: Add sysfs reporting for l1tf [17dbca119312b4e8173d4e25ff64262119fcef38] x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings [42e4089c7890725fcd329999252dc489b72f2921] x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT [50896e180c6aa3a9c61a26ced99e15d602666a4c] x86/speculation/l1tf: Invert all not present mappings [f22cc87f6c1f771b57c407555cfefd811cdd9507] x86/speculation/l1tf: Limit swap file size to MAX_PA/2 [377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb] x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert [0768f91530ff46683e0b372df14fd79fe8d156e5] x86/speculation/l1tf: Make sure the first page is always reserved [10a70416e1f067f6c4efda6ffd8ea96002ac4223] x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation [6b28baca9b1f0d4a42b865da7a05b1c81424bd5c]
Andy Lutomirski (2): mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP [5dd0b16cdaff9b94da06074d5888b03235c0bf17] mm: Add vm_insert_pfn_prot() [1745cbc5d0dee0749a6bc0ea8e872c5db0074061]
Andy Whitcroft (1): floppy: Do not copy a kernel pointer to user memory in FDGETPRM ioctl [65eea8edc315589d6c993cf12dbb5d0e9ef1fe4e]
Ben Hutchings (3): x86/cpufeatures: Show KAISER in cpuinfo [not upstream; upstream shows the related PTI feature] x86/speculation/l1tf: Protect NUMA-balance entries against L1TF [not upstream; these functions were removed in 4.0] x86: mm: Add PUD functions [a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517]
Borislav Petkov (3): Documentation/spec_ctrl: Do some minor cleanups [dd0792699c4058e63c0715d9a7c2d40226fcdddc] x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host} [cc69b34989210f067b2c51d5539b5f96ebcc3a01] x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP [e7c587da125291db39ddf1f49b18e5970adbac17]
Dan Williams (1): mm: fix cache mode tracking in vm_insert_mixed() [87744ab3832b83ba71b931f86f9cfdb000d07da5]
Daniel Rosenberg (1): HID: debug: check length before copy_to_user() [717adfdaf14704fd3ec7fa2c04520c0723247eac]
Dave Airlie (2): drm/drivers: add support for using the arch wc mapping API. [7cf321d118a825c1541b43ca45294126fd474efa] x86/io: add interface to reserve io memtype for a resource range. (v1.1) [8ef4227615e158faa4ee85a1d6466782f7e22f2f]
Dave Hansen (1): x86/mm: Move swap offset/type up in PTE to work around erratum [00839ee3b299303c6a5e26a0a2485427a3afcbbf]
Finn Thain (1): via-cuda: Use spinlock_irq_save/restore instead of enable/disable_irq [ac39452e942af6a212e8f89e8a36b71354323845]
Jim Mattson (1): x86/cpu: Make alternative_msr_write work for 32-bit code [5f2b745f5e1304f438f9b2cd03ebc8120b6e0d3b]
Jiri Kosina (3): x86/bugs: Fix __ssb_select_mitigation() return type [d66d8ff3d21667b41eddbe86b35ab411e40d8c5f] x86/bugs: Make cpu_show_common() static [7bb4d366cba992904bffa4820d24e70a3de93e76] x86/speculation/l1tf: Unbreak !__HAVE_ARCH_PFN_MODIFY_ALLOWED architectures [6c26fcd2abfe0a56bbd95271fce02df2896cfd24]
Juergen Gross (1): x86/xen: Add call of speculative_store_bypass_ht_init() to PV paths [74899d92e66663dc7671a8017b3146dcd4735f3b]
Kees Cook (6): exec: Limit arg stack to at most 75% of _STK_LIM [da029c11e6b12f321f36dac8771e833b65cec962] nospec: Allow getting/setting on non-current task [7bbf1373e228840bb0295a2ca26d548ef37f448e] proc: Provide details on speculation flaw mitigations [fae1fa0fc6cca8beee3ab8ed71d54f9a78fa3f64] seccomp: Add filter flag to opt-out of SSB mitigation [00a02d0c502a06d15e07b857f8ff921e3e402675] seccomp: Enable speculation flaw mitigations [5c3070890d06ff82eecb808d02d2ca39169533ef] x86/speculation: Make "seccomp" the default mode for Speculative Store Bypass [f21b53b20c754021935ea43364dbf53778eeba32]
Kirill A. Shutemov (39): alpha: drop _PAGE_FILE and pte_file()-related helpers [b816157a5366550c5ee29a6431ba1abb88721266] arc: drop _PAGE_FILE and pte_file()-related helpers [18747151308f9e0fb63766057957617ec4afa190] arm64: drop PTE_FILE and pte_file()-related helpers [9b3e661e58b90b0c2d5c2168c23408f1e59e9e35] arm: drop L_PTE_FILE and pte_file()-related helpers [b007ea798f5c568d3f464d37288220ef570f062c] asm-generic: drop unused pte_file* helpers [5064c8e19dc215afae8ffae95570e7f22062d49c] avr32: drop _PAGE_FILE and pte_file()-related helpers [7a7d2db4b8b3505a3195178619ffcc80985c4be1] blackfin: drop pte_file() [2bc6ff14d46745a7728ed4ed90c5e0edca91f52e] c6x: drop pte_file() [f5b45de9b00eb53d11ada85c61e4ea1c31ab8218] cris: drop _PAGE_FILE and pte_file()-related helpers [103f3d9a26df944f4c29de190d72dfbf913c71af] frv: drop _PAGE_FILE and pte_file()-related helpers [ca5bfa7b390017f053d7581bc701518b87bc3d43] hexagon: drop _PAGE_FILE and pte_file()-related helpers [d99f95e6522db22192c331c75de182023a49fbcc] ia64: drop _PAGE_FILE and pte_file()-related helpers [636a002b704e0a36cefb5f4cf0293fab858fc46c] m32r: drop _PAGE_FILE and pte_file()-related helpers [406b16e26d0996516c8d1641008a7d326bf282d6] m68k: drop _PAGE_FILE and pte_file()-related helpers [1eeda0abf4425c91e7ce3ca32f1908c3a51bf84e] metag: drop _PAGE_FILE and pte_file()-related helpers [22f9bf3950f20d24198791685f2dccac2c4ef38a] microblaze: drop _PAGE_FILE and pte_file()-related helpers [937fa39fb22fea1c1d8ca9e5f31c452b91ac7239] mips: drop _PAGE_FILE and pte_file()-related helpers [b32da82e28ce90bff4e371fc15d2816fa3175bb0] mm: drop support of non-linear mapping from fault codepath [9b4bdd2ffab9557ac43af7dff02e7dab1c8c58bd] mm: drop support of non-linear mapping from unmap/zap codepath [8a5f14a23177061ec11daeaa3d09d0765d785c47] mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub [d83a08db5ba6072caa658745881f4baa9bad6a08] mm: fix regression in remap_file_pages() emulation [48f7df329474b49d83d0dffec1b6186647f11976] mm: remove rest usage of VM_NONLINEAR and pte_file() [0661a33611fca12570cba48d9344ce68834ee86c] mm: replace remap_file_pages() syscall with emulation [c8d78c1823f46519473949d33f0d1d33fe21ea16] mm: replace vma->sharead.linear with vma->shared [ac51b934f3912582d3c897c6c4d09b32ea57b2c7] mn10300: drop _PAGE_FILE and pte_file()-related helpers [6bf63a8ccb1dccd6ab81bc8bc46863493629cdb8] openrisc: drop _PAGE_FILE and pte_file()-related helpers [3824e3cf7e865b2ff0b71de23b16e332fe6a853a] parisc: drop _PAGE_FILE and pte_file()-related helpers [8d55da810f1fabcf1d4c0bbc46205e5f2c0fa84b] powerpc: drop _PAGE_FILE and pte_file()-related helpers [780fc5642f59b6c6e2b05794de60b2d2ad5f040e] proc: drop handling non-linear mappings [1da4b35b001481df99a6dcab12d5d39a876f7056] rmap: drop support of non-linear mappings [27ba0644ea9dfe6e7693abc85837b60e40583b96] s390: drop pte_file()-related helpers [6e76d4b20bf6b514408ab5bd07f4a76723259b64] score: drop _PAGE_FILE and pte_file()-related helpers [917e401ea75478d4f4575bc8b0ef3d14ecf9ef69] sh: drop _PAGE_FILE and pte_file()-related helpers [8b70beac99466b6d164de9fe647b3567e6f17e3a] sparc: drop pte_file()-related helpers [6a8c4820895cf1dd2a128aef67ce079ba6eded80] tile: drop pte_file()-related helpers [eb12f4872a3845a8803f689646dea5b92a30aff7] um: drop _PAGE_FILE and pte_file()-related helpers [3513006a5691ae3629eef9ddef0b71a47c40dfbc] unicore32: drop pte_file()-related helpers [40171798fe11a6dc1d963058b097b2c4c9d34a9c] x86: drop _PAGE_FILE and pte_file()-related helpers [0a191362058391878cc2a4d4ccddcd8223eb4f79] xtensa: drop _PAGE_FILE and pte_file()-related helpers [d9ecee281b8f89da6d3203be62802eda991e37cc]
Konrad Rzeszutek Wilk (17): KVM/VMX: Expose SSBD properly to guests [0aa48468d00959c8a37cd3ac727284f4f7359151] proc: Use underscores for SSBD in 'status' [e96f46ee8587607a828f783daa6eb5b44d25004d] x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guest [da39556f66f5cfe8f9c989206974f1cb16ca5d7c] x86/bugs, KVM: Support the combination of guest and host IBRS [5cf687548705412da47c9cec342fd952d71ed3d5] x86/bugs/AMD: Add support to disable RDS on Fam[15,16,17]h if requested [764f3c21588a059cd783c6ba0734d4db2d72822d] x86/bugs/intel: Set proper CPU features and setup RDS [772439717dbf703b39990be58d8d4e3e4ad0598a] x86/bugs: Concentrate bug detection into a separate function [4a28bfe3267b68e22c663ac26185aa16c9b879ef] x86/bugs: Concentrate bug reporting into a separate function [d1059518b4789cabe34bb4b714d07e6089c82ca1] x86/bugs: Expose /sys/../spec_store_bypass [c456442cd3a59eeb1d60293c26cbe2ff2c4e42cf] x86/bugs: Fix the parameters alignment and missing void [ffed645e3be0e32f8e9ab068d257aee8d0fe8eec] x86/bugs: Move the l1tf function and define pr_fmt properly [56563f53d3066afa9e63d6c997bf67e76a8b05c0] x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation [24f7fc83b9204d20f878c57cb77d261ae825e033] x86/bugs: Read SPEC_CTRL MSR during boot and re-use reserved bits [1b86883ccb8d5d9506529d42dbe1a5257cb30b18] x86/bugs: Rename SSBD_NO to SSB_NO [240da953fcc6a9008c92fae5b1f727ee5ed167ab] x86/bugs: Rename _RDS to _SSBD [9f65fb29374ee37856dbad847b4e121aab72b510] x86/bugs: Whitelist allowed SPEC_CTRL MSR values [1115a859f33276fe8afb31c60cf9d8e657872558] x86/cpufeatures: Add X86_FEATURE_RDS [0cc5fa00b0a88dad140b4e5c2cead9951ad36822]
Linus Torvalds (3): x86/nospec: Simplify alternative_msr_write() [1aa7a5735a41418d8e01fa7c9565eb2657e2ea3f] x86/speculation/l1tf: Change order of offset/type in swap entry [bcd11afa7adad8d720e7ba5ef58bdcd9775cf45f] x86/speculation/l1tf: Protect swap entries against L1TF [2f22b4cd45b67b3496f4aa4c7180a1271c6452f6]
Markus Trippelsdorf (1): x86/tools: Fix gcc-7 warning in relocs.c [7ebb916782949621ff6819acf373a06902df7679]
Michal Hocko (1): x86/speculation/l1tf: Fix up pte->pfn conversion for PAE [e14d7dfb41f5807a0c1c26a13f2b8ef16af24935]
Naoya Horiguchi (3): mm/pagewalk: remove pgd_entry() and pud_entry() [0b1fbfe50006c41014cc25660c0e735d21c34939] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 [eee4818baac0f2b37848fdf90e4b16430dc536ac] pagewalk: improve vma handling [fafaa4264eba49fd10695c193a82760558d093f4]
Sean Christopherson (1): x86/speculation/l1tf: Exempt zeroed PTEs from inversion [f19f5c49bbc3ffcc9126cc245fc1b24cc29f4a37]
Thomas Gleixner (19): KVM: SVM: Move spec control call after restore of GS [15e6c22fd8e5a42c5ed6d487b7c9fe44c2517765] KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled [024d83cadc6b2af027e473720f3c3da97496c318] prctl: Add force disable speculation [356e4bfff2c5489e016fdb925adbf12a1e3950ee] prctl: Add speculation control prctls [b617cfc858161140d69cc0b5cc211996b557a1c7] seccomp: Move speculation migitation control to arch code [8bf37d8c067bb7eb8e7c381bdadf9bd89182b6bc] seccomp: Use PR_SPEC_FORCE_DISABLE [b849a812f7eb92e96d1c8239b06581b2cfd8b275] x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL [ccbcd2674472a978b48c91c1fbfb66c0ff959f24] x86/bugs: Expose x86_spec_ctrl_base directly [fa8ac4988249c38476f6ad678a4848a736373403] x86/bugs: Remove x86_spec_ctrl_set() [4b59bdb569453a60b752b274ca61f009e37f4dae] x86/bugs: Rework spec_ctrl base and mask logic [be6fcb5478e95bb1c91f489121238deb3abca46a] x86/cpufeatures: Add FEATURE_ZEN [d1035d971829dcf80e8686ccde26f94b0a069472] x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS [7eb8956a7fec3c1f0abc2a5517dada99ccc8a961] x86/cpufeatures: Disentangle SSBD enumeration [52817587e706686fcdb27f14c1b000c92f266c96] x86/process: Allow runtime control of Speculative Store Bypass [885f82bfbc6fefb6664ea27965c3ab9ac4194b8c] x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG [47c61b3955cf712cadfc25635bf9bc174af030ea] x86/speculation: Add prctl for Speculative Store Bypass mitigation [a73ec77ee17ec556fe7f165d00314cb7c047b1ac] x86/speculation: Create spec-ctrl.h to avoid include hell [28a2775217b17208811fa43a9e96bd1fdf417b86] x86/speculation: Handle HT correctly on AMD [1f50ddb4f4189243c05926b842dc1a0332195f31] x86/speculation: Rework speculative_store_bypass_update() [0270be3e34efb05a88bc4c422572ece038ef3608]
Tom Lendacky (2): KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD [bc226f07dcd3c9ef0b7f6236fe356ea4a9cb4769] x86/speculation: Add virtualized speculative store bypass disable support [11fb0683493b2da112cd64c9dada221b52463bf7]
Tyler Hicks (2): irda: Fix memory leak caused by repeated binds of irda socket [not upstream; irda has been removed] irda: Only insert new objects into the global database via setsockopt [not upstream; irda has been removed]
Vincent Pelletier (1): scsi: target: iscsi: Use hex2bin instead of a re-implementation [1816494330a83f2a064499d8ed2797045641f92c]
Vlastimil Babka (6): x86/init: fix build with CONFIG_SWAP=n [792adb90fa724ce07c0171cbc96b9215af4b1045] x86/speculation/l1tf: Extend 64bit swap file size limit [1a7ed1ba4bba6c075d5ad61bb75e3fbc870840d6] x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM [b0a182f875689647b014bc01d36b340217792852] x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit [9df9516940a61d29aedf4d91b483ca6597e7d480] x86/speculation/l1tf: Protect PAE swap entries against L1TF [0d0f6249058834ffe1ceaad0bb31464af66f6e7a] x86/speculation/l1tf: Suggest what to do on systems with too much RAM [6a012288d6906fee1dbc244050ade1dafe4a9c8d]
Documentation/ABI/testing/sysfs-devices-system-cpu | 1 + Documentation/cachetlb.txt | 8 +- Documentation/kernel-parameters.txt | 45 +++ Documentation/spec_ctrl.rst | 94 +++++ Documentation/vm/remap_file_pages.txt | 7 +- Makefile | 4 +- arch/alpha/include/asm/pgtable.h | 7 - arch/arc/include/asm/pgtable.h | 13 +- arch/arm/include/asm/pgtable-2level.h | 1 - arch/arm/include/asm/pgtable-3level.h | 1 - arch/arm/include/asm/pgtable-nommu.h | 2 - arch/arm/include/asm/pgtable.h | 20 +- arch/arm/mm/proc-macros.S | 2 +- arch/arm64/include/asm/pgtable.h | 22 +- arch/avr32/include/asm/pgtable.h | 25 -- arch/blackfin/include/asm/pgtable.h | 5 - arch/c6x/include/asm/pgtable.h | 5 - arch/cris/include/arch-v10/arch/mmu.h | 3 - arch/cris/include/arch-v32/arch/mmu.h | 3 - arch/cris/include/asm/pgtable.h | 4 - arch/frv/include/asm/pgtable.h | 27 +- arch/hexagon/include/asm/pgtable.h | 60 +-- arch/ia64/include/asm/pgtable.h | 25 +- arch/m32r/include/asm/pgtable-2level.h | 4 - arch/m32r/include/asm/pgtable.h | 11 - arch/m68k/include/asm/mcf_pgtable.h | 23 +- arch/m68k/include/asm/motorola_pgtable.h | 15 - arch/m68k/include/asm/pgtable_no.h | 2 - arch/m68k/include/asm/sun3_pgtable.h | 15 - arch/metag/include/asm/pgtable.h | 6 - arch/microblaze/include/asm/pgtable.h | 11 - arch/mips/include/asm/pgtable-32.h | 39 -- arch/mips/include/asm/pgtable-64.h | 9 - arch/mips/include/asm/pgtable-bits.h | 10 - arch/mips/include/asm/pgtable.h | 2 - arch/mn10300/include/asm/pgtable.h | 17 +- arch/openrisc/include/asm/pgtable.h | 8 - arch/openrisc/kernel/head.S | 5 - arch/parisc/include/asm/pgtable.h | 10 - arch/powerpc/include/asm/pgtable-ppc32.h | 9 +- arch/powerpc/include/asm/pgtable-ppc64.h | 5 +- arch/powerpc/include/asm/pgtable.h | 1 - arch/powerpc/include/asm/pte-40x.h | 1 - arch/powerpc/include/asm/pte-44x.h | 5 - arch/powerpc/include/asm/pte-8xx.h | 1 - arch/powerpc/include/asm/pte-book3e.h | 1 - arch/powerpc/include/asm/pte-fsl-booke.h | 3 - arch/powerpc/include/asm/pte-hash32.h | 1 - arch/powerpc/include/asm/pte-hash64.h | 1 - arch/powerpc/mm/pgtable_64.c | 2 +- arch/s390/include/asm/pgtable.h | 29 +- arch/score/include/asm/pgtable-bits.h | 1 - arch/score/include/asm/pgtable.h | 18 +- arch/sh/include/asm/pgtable_32.h | 30 +- arch/sh/include/asm/pgtable_64.h | 9 +- arch/sparc/include/asm/pgtable_32.h | 24 -- arch/sparc/include/asm/pgtable_64.h | 40 -- arch/sparc/include/asm/pgtsrmmu.h | 14 +- arch/tile/include/asm/pgtable.h | 11 - arch/tile/mm/homecache.c | 4 - arch/um/include/asm/pgtable-2level.h | 9 - arch/um/include/asm/pgtable-3level.h | 20 - arch/um/include/asm/pgtable.h | 9 - arch/unicore32/include/asm/pgtable-hwdef.h | 1 - arch/unicore32/include/asm/pgtable.h | 14 - arch/x86/include/asm/cpufeature.h | 21 +- arch/x86/include/asm/io.h | 6 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/include/asm/nospec-branch.h | 43 +- arch/x86/include/asm/page_32_types.h | 9 +- arch/x86/include/asm/pgtable-2level.h | 55 +-- arch/x86/include/asm/pgtable-3level.h | 49 ++- arch/x86/include/asm/pgtable-invert.h | 41 ++ arch/x86/include/asm/pgtable.h | 144 +++++-- arch/x86/include/asm/pgtable_64.h | 60 ++- arch/x86/include/asm/pgtable_types.h | 13 +- arch/x86/include/asm/processor.h | 5 + arch/x86/include/asm/spec-ctrl.h | 80 ++++ arch/x86/include/asm/thread_info.h | 10 +- arch/x86/include/uapi/asm/msr-index.h | 9 + arch/x86/kernel/cpu/amd.c | 22 + arch/x86/kernel/cpu/bugs.c | 441 ++++++++++++++++++++- arch/x86/kernel/cpu/common.c | 97 ++++- arch/x86/kernel/cpu/cpu.h | 3 + arch/x86/kernel/cpu/intel.c | 3 + arch/x86/kernel/process.c | 146 +++++++ arch/x86/kernel/setup.c | 6 + arch/x86/kernel/smpboot.c | 5 + arch/x86/kvm/cpuid.c | 21 +- arch/x86/kvm/cpuid.h | 16 +- arch/x86/kvm/svm.c | 72 +++- arch/x86/kvm/vmx.c | 27 +- arch/x86/kvm/x86.c | 7 +- arch/x86/mm/init.c | 24 ++ arch/x86/mm/kmmio.c | 25 +- arch/x86/mm/mmap.c | 21 + arch/x86/mm/pageattr.c | 6 +- arch/x86/mm/pat.c | 14 + arch/x86/tools/relocs.c | 5 +- arch/x86/xen/smp.c | 5 + arch/xtensa/include/asm/pgtable.h | 10 - drivers/base/cpu.c | 16 + drivers/block/floppy.c | 3 + drivers/gpu/drm/ast/ast_ttm.c | 6 + drivers/gpu/drm/cirrus/cirrus_ttm.c | 7 + drivers/gpu/drm/drm_vma_manager.c | 3 +- drivers/gpu/drm/mgag200/mgag200_ttm.c | 7 + drivers/gpu/drm/nouveau/nouveau_ttm.c | 8 + drivers/gpu/drm/radeon/radeon_object.c | 5 + drivers/hid/hid-debug.c | 8 +- drivers/macintosh/via-cuda.c | 16 +- drivers/target/iscsi/iscsi_target_auth.c | 30 +- fs/9p/vfs_file.c | 2 - fs/btrfs/file.c | 1 - fs/ceph/addr.c | 1 - fs/cifs/file.c | 1 - fs/exec.c | 11 +- fs/ext4/file.c | 1 - fs/f2fs/file.c | 1 - fs/fuse/file.c | 1 - fs/gfs2/file.c | 1 - fs/inode.c | 1 - fs/nfs/file.c | 1 - fs/nilfs2/file.c | 1 - fs/ocfs2/mmap.c | 1 - fs/proc/array.c | 26 ++ fs/proc/task_mmu.c | 15 - fs/ubifs/file.c | 1 - fs/xfs/xfs_file.c | 1 - include/asm-generic/pgtable.h | 27 +- include/linux/cpu.h | 4 + include/linux/fs.h | 6 +- include/linux/io.h | 22 + include/linux/mm.h | 50 +-- include/linux/mm_types.h | 12 +- include/linux/nospec.h | 10 + include/linux/rmap.h | 2 - include/linux/sched.h | 10 +- include/linux/seccomp.h | 2 + include/linux/swapfile.h | 2 + include/linux/swapops.h | 4 +- include/linux/vm_event_item.h | 2 - include/uapi/linux/prctl.h | 12 + include/uapi/linux/seccomp.h | 3 + kernel/fork.c | 8 +- kernel/seccomp.c | 16 +- kernel/sys.c | 23 ++ mm/Makefile | 2 +- mm/filemap.c | 1 - mm/filemap_xip.c | 1 - mm/fremap.c | 283 ------------- mm/gup.c | 2 +- mm/interval_tree.c | 34 +- mm/ksm.c | 2 +- mm/madvise.c | 13 +- mm/memcontrol.c | 7 +- mm/memory.c | 285 ++++++------- mm/migrate.c | 32 -- mm/mincore.c | 9 +- mm/mmap.c | 117 +++++- mm/mprotect.c | 51 ++- mm/mremap.c | 2 - mm/msync.c | 5 +- mm/nommu.c | 8 - mm/pagewalk.c | 213 +++++----- mm/rmap.c | 222 +---------- mm/shmem.c | 1 - mm/swap.c | 4 +- mm/swapfile.c | 46 ++- net/irda/af_irda.c | 13 +- 170 files changed, 2215 insertions(+), 1885 deletions(-)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 1f50ddb4f4189243c05926b842dc1a0332195f31 upstream.
The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when hyperthreading is enabled the SSBD bit toggle needs to take both cores into account. Otherwise the following situation can happen:
CPU0 CPU1
disable SSB disable SSB enable SSB <- Enables it for the Core, i.e. for CPU0 as well
So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled again.
On Intel the SSBD control is per core as well, but the synchronization logic is implemented behind the per thread SPEC_CTRL MSR. It works like this:
CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL
i.e. if one of the threads enables a mitigation then this affects both and the mitigation is only disabled in the core when both threads disabled it.
Add the necessary synchronization logic for AMD family 17H. Unfortunately that requires a spinlock to serialize the access to the MSR, but the locks are only shared between siblings.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - s/topology_sibling_cpumask/topology_thread_cpumask/ - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/spec-ctrl.h | 6 ++ arch/x86/kernel/process.c | 125 +++++++++++++++++++++++++++++-- arch/x86/kernel/smpboot.c | 5 ++ 3 files changed, 130 insertions(+), 6 deletions(-)
--- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -33,6 +33,12 @@ static inline u64 ssbd_tif_to_amd_ls_cfg return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL; }
+#ifdef CONFIG_SMP +extern void speculative_store_bypass_ht_init(void); +#else +static inline void speculative_store_bypass_ht_init(void) { } +#endif + extern void speculative_store_bypass_update(void);
#endif --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -217,22 +217,135 @@ static inline void switch_to_bitmap(stru } }
-static __always_inline void __speculative_store_bypass_update(unsigned long tifn) +#ifdef CONFIG_SMP + +struct ssb_state { + struct ssb_state *shared_state; + raw_spinlock_t lock; + unsigned int disable_state; + unsigned long local_state; +}; + +#define LSTATE_SSB 0 + +static DEFINE_PER_CPU(struct ssb_state, ssb_state); + +void speculative_store_bypass_ht_init(void) +{ + struct ssb_state *st = this_cpu_ptr(&ssb_state); + unsigned int this_cpu = smp_processor_id(); + unsigned int cpu; + + st->local_state = 0; + + /* + * Shared state setup happens once on the first bringup + * of the CPU. It's not destroyed on CPU hotunplug. + */ + if (st->shared_state) + return; + + raw_spin_lock_init(&st->lock); + + /* + * Go over HT siblings and check whether one of them has set up the + * shared state pointer already. + */ + for_each_cpu(cpu, topology_thread_cpumask(this_cpu)) { + if (cpu == this_cpu) + continue; + + if (!per_cpu(ssb_state, cpu).shared_state) + continue; + + /* Link it to the state of the sibling: */ + st->shared_state = per_cpu(ssb_state, cpu).shared_state; + return; + } + + /* + * First HT sibling to come up on the core. Link shared state of + * the first HT sibling to itself. The siblings on the same core + * which come up later will see the shared state pointer and link + * themself to the state of this CPU. + */ + st->shared_state = st; +} + +/* + * Logic is: First HT sibling enables SSBD for both siblings in the core + * and last sibling to disable it, disables it for the whole core. This how + * MSR_SPEC_CTRL works in "hardware": + * + * CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL + */ +static __always_inline void amd_set_core_ssb_state(unsigned long tifn) { - u64 msr; + struct ssb_state *st = this_cpu_ptr(&ssb_state); + u64 msr = x86_amd_ls_cfg_base;
- if (static_cpu_has(X86_FEATURE_LS_CFG_SSBD)) { - msr = x86_amd_ls_cfg_base | ssbd_tif_to_amd_ls_cfg(tifn); + if (!static_cpu_has(X86_FEATURE_ZEN)) { + msr |= ssbd_tif_to_amd_ls_cfg(tifn); wrmsrl(MSR_AMD64_LS_CFG, msr); + return; + } + + if (tifn & _TIF_SSBD) { + /* + * Since this can race with prctl(), block reentry on the + * same CPU. + */ + if (__test_and_set_bit(LSTATE_SSB, &st->local_state)) + return; + + msr |= x86_amd_ls_cfg_ssbd_mask; + + raw_spin_lock(&st->shared_state->lock); + /* First sibling enables SSBD: */ + if (!st->shared_state->disable_state) + wrmsrl(MSR_AMD64_LS_CFG, msr); + st->shared_state->disable_state++; + raw_spin_unlock(&st->shared_state->lock); } else { - msr = x86_spec_ctrl_base | ssbd_tif_to_spec_ctrl(tifn); - wrmsrl(MSR_IA32_SPEC_CTRL, msr); + if (!__test_and_clear_bit(LSTATE_SSB, &st->local_state)) + return; + + raw_spin_lock(&st->shared_state->lock); + st->shared_state->disable_state--; + if (!st->shared_state->disable_state) + wrmsrl(MSR_AMD64_LS_CFG, msr); + raw_spin_unlock(&st->shared_state->lock); } } +#else +static __always_inline void amd_set_core_ssb_state(unsigned long tifn) +{ + u64 msr = x86_amd_ls_cfg_base | ssbd_tif_to_amd_ls_cfg(tifn); + + wrmsrl(MSR_AMD64_LS_CFG, msr); +} +#endif + +static __always_inline void intel_set_ssb_state(unsigned long tifn) +{ + u64 msr = x86_spec_ctrl_base | ssbd_tif_to_spec_ctrl(tifn); + + wrmsrl(MSR_IA32_SPEC_CTRL, msr); +} + +static __always_inline void __speculative_store_bypass_update(unsigned long tifn) +{ + if (static_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + amd_set_core_ssb_state(tifn); + else + intel_set_ssb_state(tifn); +}
void speculative_store_bypass_update(void) { + preempt_disable(); __speculative_store_bypass_update(current_thread_info()->flags); + preempt_enable(); }
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -77,6 +77,7 @@ #include <asm/i8259.h> #include <asm/realmode.h> #include <asm/misc.h> +#include <asm/spec-ctrl.h>
/* State of each CPU */ DEFINE_PER_CPU(int, cpu_state) = { 0 }; @@ -243,6 +244,8 @@ static void notrace start_secondary(void */ check_tsc_sync_target();
+ speculative_store_bypass_ht_init(); + /* * Enable the espfix hack for this CPU */ @@ -1162,6 +1165,8 @@ void __init native_smp_prepare_cpus(unsi set_mtrr_aps_delayed_init(); out: preempt_enable(); + + speculative_store_bypass_ht_init(); }
void arch_enable_nonboot_cpus_begin(void)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit d99f95e6522db22192c331c75de182023a49fbcc upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Richard Kuo rkuo@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/hexagon/include/asm/pgtable.h | 60 ++++++++---------------------- 1 file changed, 16 insertions(+), 44 deletions(-)
--- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -62,13 +62,6 @@ extern unsigned long zero_page_mask; #define _PAGE_ACCESSED (1<<2)
/* - * _PAGE_FILE is only meaningful if _PAGE_PRESENT is false, while - * _PAGE_DIRTY is only meaningful if _PAGE_PRESENT is true. - * So we can overload the bit... - */ -#define _PAGE_FILE _PAGE_DIRTY /* set: pagecache, unset = swap */ - -/* * For now, let's say that Valid and Present are the same thing. * Alternatively, we could say that it's the "or" of R, W, and X * permissions. @@ -456,57 +449,36 @@ static inline int pte_exec(pte_t pte) #define pgtable_cache_init() do { } while (0)
/* - * Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the - * PTE is interpreted as swap information. Depending on the _PAGE_FILE - * bit, the remaining free bits are eitehr interpreted as a file offset - * or a swap type/offset tuple. Rather than have the TLB fill handler - * test _PAGE_PRESENT, we're going to reserve the permissions bits - * and set them to all zeros for swap entries, which speeds up the - * miss handler at the cost of 3 bits of offset. That trade-off can - * be revisited if necessary, but Hexagon processor architecture and - * target applications suggest a lot of TLB misses and not much swap space. + * Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the PTE is + * interpreted as swap information. The remaining free bits are interpreted as + * swap type/offset tuple. Rather than have the TLB fill handler test + * _PAGE_PRESENT, we're going to reserve the permissions bits and set them to + * all zeros for swap entries, which speeds up the miss handler at the cost of + * 3 bits of offset. That trade-off can be revisited if necessary, but Hexagon + * processor architecture and target applications suggest a lot of TLB misses + * and not much swap space. * * Format of swap PTE: * bit 0: Present (zero) - * bit 1: _PAGE_FILE (zero) - * bits 2-6: swap type (arch independent layer uses 5 bits max) - * bits 7-9: bits 2:0 of offset - * bits 10-12: effectively _PAGE_PROTNONE (all zero) - * bits 13-31: bits 21:3 of swap offset - * - * Format of file PTE: - * bit 0: Present (zero) - * bit 1: _PAGE_FILE (zero) - * bits 2-9: bits 7:0 of offset - * bits 10-12: effectively _PAGE_PROTNONE (all zero) - * bits 13-31: bits 26:8 of swap offset + * bits 1-5: swap type (arch independent layer uses 5 bits max) + * bits 6-9: bits 3:0 of offset + * bits 10-12: effectively _PAGE_PROTNONE (all zero) + * bits 13-31: bits 22:4 of swap offset * * The split offset makes some of the following macros a little gnarly, * but there's plenty of precedent for this sort of thing. */ -#define PTE_FILE_MAX_BITS 27
/* Used for swap PTEs */ -#define __swp_type(swp_pte) (((swp_pte).val >> 2) & 0x1f) +#define __swp_type(swp_pte) (((swp_pte).val >> 1) & 0x1f)
#define __swp_offset(swp_pte) \ - ((((swp_pte).val >> 7) & 0x7) | (((swp_pte).val >> 10) & 0x003ffff8)) + ((((swp_pte).val >> 6) & 0xf) | (((swp_pte).val >> 9) & 0x7ffff0))
#define __swp_entry(type, offset) \ ((swp_entry_t) { \ - ((type << 2) | \ - ((offset & 0x3ffff8) << 10) | ((offset & 0x7) << 7)) }) - -/* Used for file PTEs */ -#define pte_file(pte) \ - ((pte_val(pte) & (_PAGE_FILE | _PAGE_PRESENT)) == _PAGE_FILE) - -#define pte_to_pgoff(pte) \ - (((pte_val(pte) >> 2) & 0xff) | ((pte_val(pte) >> 5) & 0x07ffff00)) - -#define pgoff_to_pte(off) \ - ((pte_t) { ((((off) & 0x7ffff00) << 5) | (((off) & 0xff) << 2)\ - | _PAGE_FILE) }) + ((type << 1) | \ + ((offset & 0x7ffff0) << 9) | ((offset & 0xf) << 6)) })
/* Oh boy. There are a lot of possible arch overrides found in this file. */ #include <asm-generic/pgtable.h>
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Sean Christopherson sean.j.christopherson@intel.com
commit f19f5c49bbc3ffcc9126cc245fc1b24cc29f4a37 upstream.
It turns out that we should *not* invert all not-present mappings, because the all zeroes case is obviously special.
clear_page() does not undergo the XOR logic to invert the address bits, i.e. PTE, PMD and PUD entries that have not been individually written will have val=0 and so will trigger __pte_needs_invert(). As a result, {pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones (adjusted by the max PFN mask) instead of zero. A zeroed entry is ok because the page at physical address 0 is reserved early in boot specifically to mitigate L1TF, so explicitly exempt them from the inversion when reading the PFN.
Manifested as an unexpected mprotect(..., PROT_NONE) failure when called on a VMA that has VM_PFNMAP and was mmap'd to as something other than PROT_NONE but never used. mprotect() sends the PROT_NONE request down prot_none_walk(), which walks the PTEs to check the PFNs. prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns -EACCES because it thinks mprotect() is trying to adjust a high MMIO address.
[ This is a very modified version of Sean's original patch, but all credit goes to Sean for doing this and also pointing out that sometimes the __pte_needs_invert() function only gets the protection bits, not the full eventual pte. But zero remains special even in just protection bits, so that's ok. - Linus ]
Fixes: f22cc87f6c1f ("x86/speculation/l1tf: Invert all not present mappings") Signed-off-by: Sean Christopherson sean.j.christopherson@intel.com Acked-by: Andi Kleen ak@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Dave Hansen dave.hansen@intel.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable-invert.h | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
--- a/arch/x86/include/asm/pgtable-invert.h +++ b/arch/x86/include/asm/pgtable-invert.h @@ -4,9 +4,18 @@
#ifndef __ASSEMBLY__
+/* + * A clear pte value is special, and doesn't get inverted. + * + * Note that even users that only pass a pgprot_t (rather + * than a full pte) won't trigger the special zero case, + * because even PAGE_NONE has _PAGE_PROTNONE | _PAGE_ACCESSED + * set. So the all zero case really is limited to just the + * cleared page table entry case. + */ static inline bool __pte_needs_invert(u64 val) { - return !(val & _PAGE_PRESENT); + return val && !(val & _PAGE_PRESENT); }
/* Get a mask to xor with the page table entry to get the correct pfn. */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit a73ec77ee17ec556fe7f165d00314cb7c047b1ac upstream.
Add prctl based control for Speculative Store Bypass mitigation and make it the default mitigation for Intel and AMD.
Andi Kleen provided the following rationale (slightly redacted):
There are multiple levels of impact of Speculative Store Bypass:
1) JITed sandbox. It cannot invoke system calls, but can do PRIME+PROBE and may have call interfaces to other code
2) Native code process. No protection inside the process at this level.
3) Kernel.
4) Between processes.
The prctl tries to protect against case (1) doing attacks.
If the untrusted code can do random system calls then control is already lost in a much worse way. So there needs to be system call protection in some way (using a JIT not allowing them or seccomp). Or rather if the process can subvert its environment somehow to do the prctl it can already execute arbitrary code, which is much worse than SSB.
To put it differently, the point of the prctl is to not allow JITed code to read data it shouldn't read from its JITed sandbox. If it already has escaped its sandbox then it can already read everything it wants in its address space, and do much worse.
The ability to control Speculative Store Bypass allows to enable the protection selectively without affecting overall system performance.
Based on an initial patch from Tim Chen. Completely rewritten.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com [bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/kernel-parameters.txt | 6 +- arch/x86/include/asm/nospec-branch.h | 1 + arch/x86/kernel/cpu/bugs.c | 83 ++++++++++++++++++++++++---- 3 files changed, 79 insertions(+), 11 deletions(-)
--- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3222,7 +3222,11 @@ bytes respectively. Such letter suffixes off - Unconditionally enable Speculative Store Bypass auto - Kernel detects whether the CPU model contains an implementation of Speculative Store Bypass and - picks the most appropriate mitigation + picks the most appropriate mitigation. + prctl - Control Speculative Store Bypass per thread + via prctl. Speculative Store Bypass is enabled + for a process by default. The state of the control + is inherited on fork.
Not specifying this option is equivalent to spec_store_bypass_disable=auto. --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -187,6 +187,7 @@ extern u64 x86_spec_ctrl_get_default(voi enum ssb_mitigation { SPEC_STORE_BYPASS_NONE, SPEC_STORE_BYPASS_DISABLE, + SPEC_STORE_BYPASS_PRCTL, };
extern char __indirect_thunk_start[]; --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -11,6 +11,8 @@ #include <linux/utsname.h> #include <linux/cpu.h> #include <linux/module.h> +#include <linux/nospec.h> +#include <linux/prctl.h>
#include <asm/spec-ctrl.h> #include <asm/cmdline.h> @@ -450,20 +452,23 @@ enum ssb_mitigation_cmd { SPEC_STORE_BYPASS_CMD_NONE, SPEC_STORE_BYPASS_CMD_AUTO, SPEC_STORE_BYPASS_CMD_ON, + SPEC_STORE_BYPASS_CMD_PRCTL, };
static const char *ssb_strings[] = { [SPEC_STORE_BYPASS_NONE] = "Vulnerable", - [SPEC_STORE_BYPASS_DISABLE] = "Mitigation: Speculative Store Bypass disabled" + [SPEC_STORE_BYPASS_DISABLE] = "Mitigation: Speculative Store Bypass disabled", + [SPEC_STORE_BYPASS_PRCTL] = "Mitigation: Speculative Store Bypass disabled via prctl" };
static const struct { const char *option; enum ssb_mitigation_cmd cmd; } ssb_mitigation_options[] = { - { "auto", SPEC_STORE_BYPASS_CMD_AUTO }, /* Platform decides */ - { "on", SPEC_STORE_BYPASS_CMD_ON }, /* Disable Speculative Store Bypass */ - { "off", SPEC_STORE_BYPASS_CMD_NONE }, /* Don't touch Speculative Store Bypass */ + { "auto", SPEC_STORE_BYPASS_CMD_AUTO }, /* Platform decides */ + { "on", SPEC_STORE_BYPASS_CMD_ON }, /* Disable Speculative Store Bypass */ + { "off", SPEC_STORE_BYPASS_CMD_NONE }, /* Don't touch Speculative Store Bypass */ + { "prctl", SPEC_STORE_BYPASS_CMD_PRCTL }, /* Disable Speculative Store Bypass via prctl */ };
static enum ssb_mitigation_cmd __init ssb_parse_cmdline(void) @@ -513,14 +518,15 @@ static enum ssb_mitigation_cmd __init __
switch (cmd) { case SPEC_STORE_BYPASS_CMD_AUTO: - /* - * AMD platforms by default don't need SSB mitigation. - */ - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) - break; + /* Choose prctl as the default mode */ + mode = SPEC_STORE_BYPASS_PRCTL; + break; case SPEC_STORE_BYPASS_CMD_ON: mode = SPEC_STORE_BYPASS_DISABLE; break; + case SPEC_STORE_BYPASS_CMD_PRCTL: + mode = SPEC_STORE_BYPASS_PRCTL; + break; case SPEC_STORE_BYPASS_CMD_NONE: break; } @@ -531,7 +537,7 @@ static enum ssb_mitigation_cmd __init __ * - X86_FEATURE_RDS - CPU is able to turn off speculative store bypass * - X86_FEATURE_SPEC_STORE_BYPASS_DISABLE - engage the mitigation */ - if (mode != SPEC_STORE_BYPASS_NONE) { + if (mode == SPEC_STORE_BYPASS_DISABLE) { setup_force_cpu_cap(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE); /* * Intel uses the SPEC CTRL MSR Bit(2) for this, while AMD uses @@ -562,6 +568,63 @@ static void ssb_select_mitigation()
#undef pr_fmt
+static int ssb_prctl_set(unsigned long ctrl) +{ + bool rds = !!test_tsk_thread_flag(current, TIF_RDS); + + if (ssb_mode != SPEC_STORE_BYPASS_PRCTL) + return -ENXIO; + + if (ctrl == PR_SPEC_ENABLE) + clear_tsk_thread_flag(current, TIF_RDS); + else + set_tsk_thread_flag(current, TIF_RDS); + + if (rds != !!test_tsk_thread_flag(current, TIF_RDS)) + speculative_store_bypass_update(); + + return 0; +} + +static int ssb_prctl_get(void) +{ + switch (ssb_mode) { + case SPEC_STORE_BYPASS_DISABLE: + return PR_SPEC_DISABLE; + case SPEC_STORE_BYPASS_PRCTL: + if (test_tsk_thread_flag(current, TIF_RDS)) + return PR_SPEC_PRCTL | PR_SPEC_DISABLE; + return PR_SPEC_PRCTL | PR_SPEC_ENABLE; + default: + if (boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) + return PR_SPEC_ENABLE; + return PR_SPEC_NOT_AFFECTED; + } +} + +int arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl) +{ + if (ctrl != PR_SPEC_ENABLE && ctrl != PR_SPEC_DISABLE) + return -ERANGE; + + switch (which) { + case PR_SPEC_STORE_BYPASS: + return ssb_prctl_set(ctrl); + default: + return -ENODEV; + } +} + +int arch_prctl_spec_ctrl_get(unsigned long which) +{ + switch (which) { + case PR_SPEC_STORE_BYPASS: + return ssb_prctl_get(); + default: + return -ENODEV; + } +} + void x86_spec_ctrl_setup_ap(void) { if (boot_cpu_has(X86_FEATURE_IBRS))
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit da39556f66f5cfe8f9c989206974f1cb16ca5d7c upstream.
Expose the CPUID.7.EDX[31] bit to the guest, and also guard against various combinations of SPEC_CTRL MSR values.
The handling of the MSR (to take into account the host value of SPEC_CTRL Bit(2)) is taken care of in patch:
KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org
[dwmw2: Handle 4.9 guest CPUID differences, rename guest_cpu_has_ibrs() → guest_cpu_has_spec_ctrl()] Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/cpuid.h | 4 ++-- arch/x86/kvm/svm.c | 4 ++-- arch/x86/kvm/vmx.c | 6 +++--- 4 files changed, 8 insertions(+), 8 deletions(-)
--- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -318,7 +318,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 7.0.edx*/ const u32 kvm_cpuid_7_0_edx_x86_features = - F(SPEC_CTRL) | F(ARCH_CAPABILITIES); + F(SPEC_CTRL) | F(RDS) | F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -115,7 +115,7 @@ static inline bool guest_cpuid_has_ibpb( return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL)); }
-static inline bool guest_cpuid_has_ibrs(struct kvm_vcpu *vcpu) +static inline bool guest_cpuid_has_spec_ctrl(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best;
@@ -123,7 +123,7 @@ static inline bool guest_cpuid_has_ibrs( if (best && (best->ebx & bit(X86_FEATURE_IBRS))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); - return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL)); + return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_RDS))); }
static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu) --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3139,7 +3139,7 @@ static int svm_get_msr(struct kvm_vcpu * break; case MSR_IA32_SPEC_CTRL: if (!msr_info->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1;
msr_info->data = svm->spec_ctrl; @@ -3218,7 +3218,7 @@ static int svm_set_msr(struct kvm_vcpu * break; case MSR_IA32_SPEC_CTRL: if (!msr->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1;
/* The STIBP bit doesn't fault even if it's not advertised */ --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2528,7 +2528,7 @@ static int vmx_get_msr(struct kvm_vcpu * break; case MSR_IA32_SPEC_CTRL: if (!msr_info->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1;
msr_info->data = to_vmx(vcpu)->spec_ctrl; @@ -2633,11 +2633,11 @@ static int vmx_set_msr(struct kvm_vcpu * break; case MSR_IA32_SPEC_CTRL: if (!msr_info->host_initiated && - !guest_cpuid_has_ibrs(vcpu)) + !guest_cpuid_has_spec_ctrl(vcpu)) return 1;
/* The STIBP bit doesn't fault even if it's not advertised */ - if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP)) + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | SPEC_CTRL_RDS)) return 1;
vmx->spec_ctrl = data;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit c456442cd3a59eeb1d60293c26cbe2ff2c4e42cf upstream.
Add the sysfs file for the new vulerability. It does not do much except show the words 'Vulnerable' for recent x86 cores.
Intel cores prior to family 6 are known not to be vulnerable, and so are some Atoms and some Xeon Phi.
It assumes that older Cyrix, Centaur, etc. cores are immune.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: - Renumber X86_BUG_SPEC_STORE_BYPASS - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- .../ABI/testing/sysfs-devices-system-cpu | 1 + arch/x86/include/asm/cpufeature.h | 1 + arch/x86/kernel/cpu/bugs.c | 5 ++++ arch/x86/kernel/cpu/common.c | 23 +++++++++++++++++++ drivers/base/cpu.c | 8 +++++++ include/linux/cpu.h | 2 ++ 6 files changed, 40 insertions(+)
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -229,6 +229,7 @@ What: /sys/devices/system/cpu/vulnerabi /sys/devices/system/cpu/vulnerabilities/meltdown /sys/devices/system/cpu/vulnerabilities/spectre_v1 /sys/devices/system/cpu/vulnerabilities/spectre_v2 + /sys/devices/system/cpu/vulnerabilities/spec_store_bypass Date: January 2018 Contact: Linux kernel mailing list linux-kernel@vger.kernel.org Description: Information about CPU vulnerabilities --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -260,6 +260,7 @@ #define X86_BUG_CPU_MELTDOWN X86_BUG(5) /* CPU is affected by meltdown attack and needs kernel page table isolation */ #define X86_BUG_SPECTRE_V1 X86_BUG(6) /* CPU is affected by Spectre variant 1 attack with conditional branches */ #define X86_BUG_SPECTRE_V2 X86_BUG(7) /* CPU is affected by Spectre variant 2 attack with indirect branches */ +#define X86_BUG_SPEC_STORE_BYPASS X86_BUG(8) /* CPU is affected by speculative store bypass attack */
#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -442,4 +442,9 @@ ssize_t cpu_show_spectre_v2(struct devic { return cpu_show_common(dev, attr, buf, X86_BUG_SPECTRE_V2); } + +ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS); +} #endif --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -803,10 +803,33 @@ static const __initconst struct x86_cpu_ {} };
+static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = { + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PINEVIEW }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_LINCROFT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PENWELL }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CLOVERVIEW }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CEDARVIEW }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_CORE_YONAH }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM }, + { X86_VENDOR_CENTAUR, 5, }, + { X86_VENDOR_INTEL, 5, }, + { X86_VENDOR_NSC, 5, }, + { X86_VENDOR_ANY, 4, }, + {} +}; + static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = 0;
+ if (!x86_match_cpu(cpu_no_spec_store_bypass)) + setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS); + if (x86_match_cpu(cpu_no_speculation)) return;
--- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -438,14 +438,22 @@ ssize_t __weak cpu_show_spectre_v2(struc return sprintf(buf, "Not affected\n"); }
+ssize_t __weak cpu_show_spec_store_bypass(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "Not affected\n"); +} + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL); static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL); static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL); +static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_meltdown.attr, &dev_attr_spectre_v1.attr, &dev_attr_spectre_v2.attr, + &dev_attr_spec_store_bypass.attr, NULL };
--- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -45,6 +45,8 @@ extern ssize_t cpu_show_spectre_v1(struc struct device_attribute *attr, char *buf); extern ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, char *buf); +extern ssize_t cpu_show_spec_store_bypass(struct device *dev, + struct device_attribute *attr, char *buf);
#ifdef CONFIG_HOTPLUG_CPU extern void unregister_cpu(struct cpu *cpu);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 1b86883ccb8d5d9506529d42dbe1a5257cb30b18 upstream.
The 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to all the other bits as reserved. The Intel SDM glossary defines reserved as implementation specific - aka unknown.
As such at bootup this must be taken it into account and proper masking for the bits in use applied.
A copy of this document is available at https://bugzilla.kernel.org/show_bug.cgi?id=199511
[ tglx: Made x86_spec_ctrl_base __ro_after_init ]
Suggested-by: Jon Masters jcm@redhat.com Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: - We don't have __ro_after_init - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 24 ++++++++++++++++++++---- arch/x86/kernel/cpu/bugs.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+), 4 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -172,6 +172,17 @@ enum spectre_v2_mitigation { SPECTRE_V2_IBRS, };
+/* + * The Intel specification for the SPEC_CTRL MSR requires that we + * preserve any already set reserved bits at boot time (e.g. for + * future additions that this kernel is not currently aware of). + * We then set any additional mitigation bits that we want + * ourselves and always use this as the base for SPEC_CTRL. + * We also use this when handling guest entry/exit as below. + */ +extern void x86_spec_ctrl_set(u64); +extern u64 x86_spec_ctrl_get_default(void); + extern char __indirect_thunk_start[]; extern char __indirect_thunk_end[];
@@ -208,8 +219,9 @@ void alternative_msr_write(unsigned int
static inline void indirect_branch_prediction_barrier(void) { - alternative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, - X86_FEATURE_USE_IBPB); + u64 val = PRED_CMD_IBPB; + + alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB); }
/* @@ -220,14 +232,18 @@ static inline void indirect_branch_predi */ #define firmware_restrict_branch_speculation_start() \ do { \ + u64 val = x86_spec_ctrl_get_default() | SPEC_CTRL_IBRS; \ + \ preempt_disable(); \ - alternative_msr_write(MSR_IA32_SPEC_CTRL, SPEC_CTRL_IBRS, \ + alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \ X86_FEATURE_USE_IBRS_FW); \ } while (0)
#define firmware_restrict_branch_speculation_end() \ do { \ - alternative_msr_write(MSR_IA32_SPEC_CTRL, 0, \ + u64 val = x86_spec_ctrl_get_default(); \ + \ + alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \ X86_FEATURE_USE_IBRS_FW); \ preempt_enable(); \ } while (0) --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -27,6 +27,12 @@
static void __init spectre_v2_select_mitigation(void);
+/* + * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any + * writes to SPEC_CTRL contain whatever reserved bits have been set. + */ +static u64 x86_spec_ctrl_base; + #ifdef CONFIG_X86_32
static double __initdata x = 4195835.0; @@ -94,6 +100,13 @@ void __init check_bugs(void) print_cpu_info(&boot_cpu_data); }
+ /* + * Read the SPEC_CTRL MSR to account for reserved bits which may + * have unknown values. + */ + if (boot_cpu_has(X86_FEATURE_IBRS)) + rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + /* Select the proper spectre mitigation before patching alternatives */ spectre_v2_select_mitigation();
@@ -157,6 +170,21 @@ static const char *spectre_v2_strings[]
static enum spectre_v2_mitigation spectre_v2_enabled = SPECTRE_V2_NONE;
+void x86_spec_ctrl_set(u64 val) +{ + if (val & ~SPEC_CTRL_IBRS) + WARN_ONCE(1, "SPEC_CTRL MSR value 0x%16llx is unknown.\n", val); + else + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base | val); +} +EXPORT_SYMBOL_GPL(x86_spec_ctrl_set); + +u64 x86_spec_ctrl_get_default(void) +{ + return x86_spec_ctrl_base; +} +EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default); + #ifdef RETPOLINE static bool spectre_v2_bad_module;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 772439717dbf703b39990be58d8d4e3e4ad0598a upstream.
Intel CPUs expose methods to:
- Detect whether RDS capability is available via CPUID.7.0.EDX[31],
- The SPEC_CTRL MSR(0x48), bit 2 set to enable RDS.
- MSR_IA32_ARCH_CAPABILITIES, Bit(4) no need to enable RRS.
With that in mind if spec_store_bypass_disable=[auto,on] is selected set at boot-time the SPEC_CTRL MSR to enable RDS if the platform requires it.
Note that this does not fix the KVM case where the SPEC_CTRL is exposed to guests which can muck with it, see patch titled : KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS.
And for the firmware (IBRS to be set), see patch titled: x86/spectre_v2: Read SPEC_CTRL MSR during boot and re-use reserved bits
[ tglx: Distangled it from the intel implementation and kept the call order ]
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/uapi/asm/msr-index.h | 6 ++++++ arch/x86/kernel/cpu/bugs.c | 30 +++++++++++++++++++++++++-- arch/x86/kernel/cpu/common.c | 10 +++++---- arch/x86/kernel/cpu/cpu.h | 3 +++ arch/x86/kernel/cpu/intel.c | 1 + 5 files changed, 44 insertions(+), 6 deletions(-)
--- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -35,6 +35,7 @@ #define MSR_IA32_SPEC_CTRL 0x00000048 /* Speculation Control */ #define SPEC_CTRL_IBRS (1 << 0) /* Indirect Branch Restricted Speculation */ #define SPEC_CTRL_STIBP (1 << 1) /* Single Thread Indirect Branch Predictors */ +#define SPEC_CTRL_RDS (1 << 2) /* Reduced Data Speculation */
#define MSR_IA32_PRED_CMD 0x00000049 /* Prediction Command */ #define PRED_CMD_IBPB (1 << 0) /* Indirect Branch Prediction Barrier */ @@ -57,6 +58,11 @@ #define MSR_IA32_ARCH_CAPABILITIES 0x0000010a #define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */ #define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */ +#define ARCH_CAP_RDS_NO (1 << 4) /* + * Not susceptible to Speculative Store Bypass + * attack, so no Reduced Data Speculation control + * required. + */
#define MSR_IA32_BBL_CR_CTL 0x00000119 #define MSR_IA32_BBL_CR_CTL3 0x0000011e --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -179,7 +179,7 @@ static enum spectre_v2_mitigation spectr
void x86_spec_ctrl_set(u64 val) { - if (val & ~SPEC_CTRL_IBRS) + if (val & ~(SPEC_CTRL_IBRS | SPEC_CTRL_RDS)) WARN_ONCE(1, "SPEC_CTRL MSR value 0x%16llx is unknown.\n", val); else wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base | val); @@ -482,8 +482,28 @@ static enum ssb_mitigation_cmd __init __ break; }
- if (mode != SPEC_STORE_BYPASS_NONE) + /* + * We have three CPU feature flags that are in play here: + * - X86_BUG_SPEC_STORE_BYPASS - CPU is susceptible. + * - X86_FEATURE_RDS - CPU is able to turn off speculative store bypass + * - X86_FEATURE_SPEC_STORE_BYPASS_DISABLE - engage the mitigation + */ + if (mode != SPEC_STORE_BYPASS_NONE) { setup_force_cpu_cap(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE); + /* + * Intel uses the SPEC CTRL MSR Bit(2) for this, while AMD uses + * a completely different MSR and bit dependent on family. + */ + switch (boot_cpu_data.x86_vendor) { + case X86_VENDOR_INTEL: + x86_spec_ctrl_base |= SPEC_CTRL_RDS; + x86_spec_ctrl_set(SPEC_CTRL_RDS); + break; + case X86_VENDOR_AMD: + break; + } + } + return mode; }
@@ -497,6 +517,12 @@ static void ssb_select_mitigation()
#undef pr_fmt
+void x86_spec_ctrl_setup_ap(void) +{ + if (boot_cpu_has(X86_FEATURE_IBRS)) + x86_spec_ctrl_set(x86_spec_ctrl_base & (SPEC_CTRL_IBRS | SPEC_CTRL_RDS)); +} + #ifdef CONFIG_SYSFS
ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr, --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -827,7 +827,11 @@ static void __init cpu_set_bug_bits(stru { u64 ia32_cap = 0;
- if (!x86_match_cpu(cpu_no_spec_store_bypass)) + if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES)) + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap); + + if (!x86_match_cpu(cpu_no_spec_store_bypass) && + !(ia32_cap & ARCH_CAP_RDS_NO)) setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS);
if (x86_match_cpu(cpu_no_speculation)) @@ -839,9 +843,6 @@ static void __init cpu_set_bug_bits(stru if (x86_match_cpu(cpu_no_meltdown)) return;
- if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES)) - rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap); - /* Rogue Data Cache Load? No! */ if (ia32_cap & ARCH_CAP_RDCL_NO) return; @@ -1175,6 +1176,7 @@ void identify_secondary_cpu(struct cpuin enable_sep_cpu(); #endif mtrr_ap_init(); + x86_spec_ctrl_setup_ap(); }
struct msr_range { --- a/arch/x86/kernel/cpu/cpu.h +++ b/arch/x86/kernel/cpu/cpu.h @@ -45,4 +45,7 @@ extern const struct cpu_dev *const __x86
extern void get_cpu_cap(struct cpuinfo_x86 *c); extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c); + +extern void x86_spec_ctrl_setup_ap(void); + #endif /* ARCH_X86_CPU_H */ --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -120,6 +120,7 @@ static void early_init_intel(struct cpui setup_clear_cpu_cap(X86_FEATURE_STIBP); setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL); setup_clear_cpu_cap(X86_FEATURE_INTEL_STIBP); + setup_clear_cpu_cap(X86_FEATURE_RDS); }
/*
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 50896e180c6aa3a9c61a26ced99e15d602666a4c upstream.
L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU speculates on PTE entries which do not have the PRESENT bit set, if the content of the resulting physical address is available in the L1D cache.
The OS side mitigation makes sure that a !PRESENT PTE entry points to a physical address outside the actually existing and cachable memory space. This is achieved by inverting the upper bits of the PTE. Due to the address space limitations this only works for 64bit and 32bit PAE kernels, but not for 32bit non PAE.
This mitigation applies to both host and guest kernels, but in case of a 64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of the PAE address space (44bit) is not enough if the host has more than 43 bits of populated memory address space, because the speculation treats the PTE content as a physical host address bypassing EPT.
The host (hypervisor) protects itself against the guest by flushing L1D as needed, but pages inside the guest are not protected against attacks from other processes inside the same guest.
For the guest the inverted PTE mask has to match the host to provide the full protection for all pages the host could possibly map into the guest. The hosts populated address space is not known to the guest, so the mask must cover the possible maximal host address space, i.e. 52 bit.
On 32bit PAE the maximum PTE mask is currently set to 44 bit because that is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits the mask to be below what the host could possible use for physical pages.
The L1TF PROT_NONE protection code uses the PTE masks to determine which bits to invert to make sure the higher bits are set for unmapped entries to prevent L1TF speculation attacks against EPT inside guests.
In order to invert all bits that could be used by the host, increase __PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
The real limit for a 32bit PAE kernel is still 44 bits because all Linux PTEs are created from unsigned long PFNs, so they cannot be higher than 44 bits on a 32bit kernel. So these extra PFN bits should be never set. The only users of this macro are using it to look at PTEs, so it's safe.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Dave Hansen dave.hansen@intel.com Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/page_32_types.h | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/page_32_types.h +++ b/arch/x86/include/asm/page_32_types.h @@ -27,8 +27,13 @@ #define N_EXCEPTION_STACKS 1
#ifdef CONFIG_X86_PAE -/* 44=32+12, the limit we can fit into an unsigned long pfn */ -#define __PHYSICAL_MASK_SHIFT 44 +/* + * This is beyond the 44 bit limit imposed by the 32bit long pfns, + * but we need the full mask to make sure inverted PROT_NONE + * entries have all the host bits set in a guest. + * The real limit is still 44 bits. + */ +#define __PHYSICAL_MASK_SHIFT 52 #define __VIRTUAL_MASK_SHIFT 32
#else /* !CONFIG_X86_PAE */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 764f3c21588a059cd783c6ba0734d4db2d72822d upstream.
AMD does not need the Speculative Store Bypass mitigation to be enabled.
The parameters for this are already available and can be done via MSR C001_1020. Each family uses a different bit in that MSR for this.
[ tglx: Expose the bit mask via a variable and move the actual MSR fiddling into the bugs code as that's the right thing to do and also required to prepare for dynamic enable/disable ]
Suggested-by: Borislav Petkov bp@suse.de Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: - Renumber the feature bit - We don't have __ro_after_init - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/nospec-branch.h | 4 ++++ arch/x86/kernel/cpu/amd.c | 26 ++++++++++++++++++++++++++ arch/x86/kernel/cpu/bugs.c | 27 ++++++++++++++++++++++++++- arch/x86/kernel/cpu/common.c | 4 ++++ 5 files changed, 61 insertions(+), 1 deletion(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -192,6 +192,7 @@ #define X86_FEATURE_USE_IBPB (7*32+12) /* "" Indirect Branch Prediction Barrier enabled */ #define X86_FEATURE_USE_IBRS_FW (7*32+13) /* "" Use IBRS during runtime firmware calls */ #define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE (7*32+14) /* "" Disable Speculative Store Bypass. */ +#define X86_FEATURE_AMD_RDS (7*32+15) /* "" AMD RDS implementation */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -199,6 +199,10 @@ enum ssb_mitigation { SPEC_STORE_BYPASS_DISABLE, };
+/* AMD specific Speculative Store Bypass MSR data */ +extern u64 x86_amd_ls_cfg_base; +extern u64 x86_amd_ls_cfg_rds_mask; + extern char __indirect_thunk_start[]; extern char __indirect_thunk_end[];
--- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -8,6 +8,7 @@ #include <asm/processor.h> #include <asm/apic.h> #include <asm/cpu.h> +#include <asm/nospec-branch.h> #include <asm/pci-direct.h>
#ifdef CONFIG_X86_64 @@ -470,6 +471,26 @@ static void bsp_init_amd(struct cpuinfo_ va_align.mask = (upperbit - 1) & PAGE_MASK; va_align.flags = ALIGN_VA_32 | ALIGN_VA_64; } + + if (c->x86 >= 0x15 && c->x86 <= 0x17) { + unsigned int bit; + + switch (c->x86) { + case 0x15: bit = 54; break; + case 0x16: bit = 33; break; + case 0x17: bit = 10; break; + default: return; + } + /* + * Try to cache the base value so further operations can + * avoid RMW. If that faults, do not enable RDS. + */ + if (!rdmsrl_safe(MSR_AMD64_LS_CFG, &x86_amd_ls_cfg_base)) { + setup_force_cpu_cap(X86_FEATURE_RDS); + setup_force_cpu_cap(X86_FEATURE_AMD_RDS); + x86_amd_ls_cfg_rds_mask = 1ULL << bit; + } + } }
static void early_init_amd(struct cpuinfo_x86 *c) @@ -780,6 +801,11 @@ static void init_amd(struct cpuinfo_x86 set_cpu_bug(c, X86_BUG_AMD_APIC_C1E);
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy); + + if (boot_cpu_has(X86_FEATURE_AMD_RDS)) { + set_cpu_cap(c, X86_FEATURE_RDS); + set_cpu_cap(c, X86_FEATURE_AMD_RDS); + } }
#ifdef CONFIG_X86_32 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -40,6 +40,13 @@ static u64 x86_spec_ctrl_base; */ static u64 x86_spec_ctrl_mask = ~SPEC_CTRL_IBRS;
+/* + * AMD specific MSR info for Speculative Store Bypass control. + * x86_amd_ls_cfg_rds_mask is initialized in identify_boot_cpu(). + */ +u64 x86_amd_ls_cfg_base; +u64 x86_amd_ls_cfg_rds_mask; + #ifdef CONFIG_X86_32
static double __initdata x = 4195835.0; @@ -109,7 +116,8 @@ void __init check_bugs(void)
/* * Read the SPEC_CTRL MSR to account for reserved bits which may - * have unknown values. + * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD + * init code as it is not enumerated and depends on the family. */ if (boot_cpu_has(X86_FEATURE_IBRS)) rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); @@ -216,6 +224,14 @@ void x86_spec_ctrl_restore_host(u64 gues } EXPORT_SYMBOL_GPL(x86_spec_ctrl_restore_host);
+static void x86_amd_rds_enable(void) +{ + u64 msrval = x86_amd_ls_cfg_base | x86_amd_ls_cfg_rds_mask; + + if (boot_cpu_has(X86_FEATURE_AMD_RDS)) + wrmsrl(MSR_AMD64_LS_CFG, msrval); +} + #ifdef RETPOLINE static bool spectre_v2_bad_module;
@@ -481,6 +497,11 @@ static enum ssb_mitigation_cmd __init __
switch (cmd) { case SPEC_STORE_BYPASS_CMD_AUTO: + /* + * AMD platforms by default don't need SSB mitigation. + */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) + break; case SPEC_STORE_BYPASS_CMD_ON: mode = SPEC_STORE_BYPASS_DISABLE; break; @@ -507,6 +528,7 @@ static enum ssb_mitigation_cmd __init __ x86_spec_ctrl_set(SPEC_CTRL_RDS); break; case X86_VENDOR_AMD: + x86_amd_rds_enable(); break; } } @@ -528,6 +550,9 @@ void x86_spec_ctrl_setup_ap(void) { if (boot_cpu_has(X86_FEATURE_IBRS)) x86_spec_ctrl_set(x86_spec_ctrl_base & ~x86_spec_ctrl_mask); + + if (ssb_mode == SPEC_STORE_BYPASS_DISABLE) + x86_amd_rds_enable(); }
#ifdef CONFIG_SYSFS --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -819,6 +819,10 @@ static const __initconst struct x86_cpu_ { X86_VENDOR_CENTAUR, 5, }, { X86_VENDOR_INTEL, 5, }, { X86_VENDOR_NSC, 5, }, + { X86_VENDOR_AMD, 0x12, }, + { X86_VENDOR_AMD, 0x11, }, + { X86_VENDOR_AMD, 0x10, }, + { X86_VENDOR_AMD, 0xf, }, { X86_VENDOR_ANY, 4, }, {} };
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit b617cfc858161140d69cc0b5cc211996b557a1c7 upstream.
Add two new prctls to control aspects of speculation related vulnerabilites and their mitigations to provide finer grained control over performance impacting mitigations.
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature which is selected with arg2 of prctl(2). The return value uses bit 0-2 with the following meaning:
Bit Define Description 0 PR_SPEC_PRCTL Mitigation can be controlled per task by PR_SET_SPECULATION_CTRL 1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is disabled 2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is enabled
If all bits are 0 the CPU is not affected by the speculation misfeature.
If PR_SPEC_PRCTL is set, then the per task control of the mitigation is available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation misfeature will fail.
PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which is selected by arg2 of prctl(2) per task. arg3 is used to hand in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
The common return values are:
EINVAL prctl is not implemented by the architecture or the unused prctl() arguments are not 0 ENODEV arg2 is selecting a not supported speculation misfeature
PR_SET_SPECULATION_CTRL has these additional return values:
ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE ENXIO prctl control of the selected speculation misfeature is disabled
The first supported controlable speculation misfeature is PR_SPEC_STORE_BYPASS. Add the define so this can be shared between architectures.
Based on an initial patch from Tim Chen and mostly rewritten.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com [bwh: Backported to 3.16: - Add the documentation directly under Documentation/ since there is no userspace-api subdirectory or reST index - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/spec_ctrl.rst | 86 +++++++++++++++++++++++++++++++++++++ include/linux/nospec.h | 5 +++ include/uapi/linux/prctl.h | 11 +++++ kernel/sys.c | 22 ++++++++++ 4 files changed, 124 insertions(+) create mode 100644 Documentation/spec_ctrl.rst
--- /dev/null +++ b/Documentation/spec_ctrl.rst @@ -0,0 +1,86 @@ +=================== +Speculation Control +=================== + +Quite some CPUs have speculation related misfeatures which are in fact +vulnerabilites causing data leaks in various forms even accross privilege +domains. + +The kernel provides mitigation for such vulnerabilities in various +forms. Some of these mitigations are compile time configurable and some on +the kernel command line. + +There is also a class of mitigations which are very expensive, but they can +be restricted to a certain set of processes or tasks in controlled +environments. The mechanism to control these mitigations is via +:manpage:`prctl(2)`. + +There are two prctl options which are related to this: + + * PR_GET_SPECULATION_CTRL + + * PR_SET_SPECULATION_CTRL + +PR_GET_SPECULATION_CTRL +----------------------- + +PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature +which is selected with arg2 of prctl(2). The return value uses bits 0-2 with +the following meaning: + +==== ================ =================================================== +Bit Define Description +==== ================ =================================================== +0 PR_SPEC_PRCTL Mitigation can be controlled per task by + PR_SET_SPECULATION_CTRL +1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is + disabled +2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is + enabled +==== ================ =================================================== + +If all bits are 0 the CPU is not affected by the speculation misfeature. + +If PR_SPEC_PRCTL is set, then the per task control of the mitigation is +available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation +misfeature will fail. + +PR_SET_SPECULATION_CTRL +----------------------- +PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which +is selected by arg2 of :manpage:`prctl(2)` per task. arg3 is used to hand +in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE. + +Common error codes +------------------ +======= ================================================================= +Value Meaning +======= ================================================================= +EINVAL The prctl is not implemented by the architecture or unused + prctl(2) arguments are not 0 + +ENODEV arg2 is selecting a not supported speculation misfeature +======= ================================================================= + +PR_SET_SPECULATION_CTRL error codes +----------------------------------- +======= ================================================================= +Value Meaning +======= ================================================================= +0 Success + +ERANGE arg3 is incorrect, i.e. it's neither PR_SPEC_ENABLE nor + PR_SPEC_DISABLE + +ENXIO Control of the selected speculation misfeature is not possible. + See PR_GET_SPECULATION_CTRL. +======= ================================================================= + +Speculation misfeature controls +------------------------------- +- PR_SPEC_STORE_BYPASS: Speculative Store Bypass + + Invocations: + * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, 0, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_ENABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0); --- a/include/linux/nospec.h +++ b/include/linux/nospec.h @@ -55,4 +55,9 @@ static inline unsigned long array_index_ \ (typeof(_i)) (_i & _mask); \ }) + +/* Speculation control prctl */ +int arch_prctl_spec_ctrl_get(unsigned long which); +int arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl); + #endif /* _LINUX_NOSPEC_H */ --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -152,4 +152,15 @@ #define PR_SET_THP_DISABLE 41 #define PR_GET_THP_DISABLE 42
+/* Per task speculation control */ +#define PR_GET_SPECULATION_CTRL 52 +#define PR_SET_SPECULATION_CTRL 53 +/* Speculation control variants */ +# define PR_SPEC_STORE_BYPASS 0 +/* Return and control values for PR_SET/GET_SPECULATION_CTRL */ +# define PR_SPEC_NOT_AFFECTED 0 +# define PR_SPEC_PRCTL (1UL << 0) +# define PR_SPEC_ENABLE (1UL << 1) +# define PR_SPEC_DISABLE (1UL << 2) + #endif /* _LINUX_PRCTL_H */ --- a/kernel/sys.c +++ b/kernel/sys.c @@ -53,6 +53,8 @@ #include <linux/uidgid.h> #include <linux/cred.h>
+#include <linux/nospec.h> + #include <linux/kmsg_dump.h> /* Move somewhere else to avoid recompiling? */ #include <generated/utsrelease.h> @@ -1832,6 +1834,16 @@ static int prctl_get_tid_address(struct } #endif
+int __weak arch_prctl_spec_ctrl_get(unsigned long which) +{ + return -EINVAL; +} + +int __weak arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl) +{ + return -EINVAL; +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2010,6 +2022,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsi me->mm->def_flags &= ~VM_NOHUGEPAGE; up_write(&me->mm->mmap_sem); break; + case PR_GET_SPECULATION_CTRL: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = arch_prctl_spec_ctrl_get(arg2); + break; + case PR_SET_SPECULATION_CTRL: + if (arg4 || arg5) + return -EINVAL; + error = arch_prctl_spec_ctrl_set(arg2, arg3); + break; default: error = -EINVAL; break;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 885f82bfbc6fefb6664ea27965c3ab9ac4194b8c upstream.
The Speculative Store Bypass vulnerability can be mitigated with the Reduced Data Speculation (RDS) feature. To allow finer grained control of this eventually expensive mitigation a per task mitigation control is required.
Add a new TIF_RDS flag and put it into the group of TIF flags which are evaluated for mismatch in switch_to(). If these bits differ in the previous and the next task, then the slow path function __switch_to_xtra() is invoked. Implement the TIF_RDS dependent mitigation control in the slow path.
If the prctl for controlling Speculative Store Bypass is disabled or no task uses the prctl then there is no overhead in the switch_to() fast path.
Update the KVM related speculation control functions to take TID_RDS into account as well.
Based on a patch from Tim Chen. Completely rewritten.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com [bwh: Backported to 3.16: - Exclude _TIF_RDS from _TIF_WORK_MASK and _TIF_ALLWORK_MASK - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -2,6 +2,7 @@ #ifndef _ASM_X86_SPECCTRL_H_ #define _ASM_X86_SPECCTRL_H_
+#include <linux/thread_info.h> #include <asm/nospec-branch.h>
/* @@ -18,4 +19,20 @@ extern void x86_spec_ctrl_restore_host(u extern u64 x86_amd_ls_cfg_base; extern u64 x86_amd_ls_cfg_rds_mask;
+/* The Intel SPEC CTRL MSR base value cache */ +extern u64 x86_spec_ctrl_base; + +static inline u64 rds_tif_to_spec_ctrl(u64 tifn) +{ + BUILD_BUG_ON(TIF_RDS < SPEC_CTRL_RDS_SHIFT); + return (tifn & _TIF_RDS) >> (TIF_RDS - SPEC_CTRL_RDS_SHIFT); +} + +static inline u64 rds_tif_to_amd_ls_cfg(u64 tifn) +{ + return (tifn & _TIF_RDS) ? x86_amd_ls_cfg_rds_mask : 0ULL; +} + +extern void speculative_store_bypass_update(void); + #endif --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -72,6 +72,7 @@ struct thread_info { #define TIF_SIGPENDING 2 /* signal pending */ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ +#define TIF_RDS 5 /* Reduced data speculation */ #define TIF_SYSCALL_EMU 6 /* syscall emulation active */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SECCOMP 8 /* secure computing */ @@ -97,6 +98,7 @@ struct thread_info { #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) +#define _TIF_RDS (1 << TIF_RDS) #define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SECCOMP (1 << TIF_SECCOMP) @@ -131,12 +133,12 @@ struct thread_info { #define _TIF_WORK_MASK \ (0x0000FFFF & \ ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT| \ - _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) + _TIF_SINGLESTEP|_TIF_RDS|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
/* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ - ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \ - _TIF_NOHZ) + ((0x0000FFFF & ~(_TIF_RDS | _TIF_SECCOMP)) | \ + _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
/* Only used for 64 bit */ #define _TIF_DO_NOTIFY_MASK \ @@ -145,7 +147,7 @@ struct thread_info {
/* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW \ - (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP) + (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_RDS)
#define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY) #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW) --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -35,7 +35,8 @@ #define MSR_IA32_SPEC_CTRL 0x00000048 /* Speculation Control */ #define SPEC_CTRL_IBRS (1 << 0) /* Indirect Branch Restricted Speculation */ #define SPEC_CTRL_STIBP (1 << 1) /* Single Thread Indirect Branch Predictors */ -#define SPEC_CTRL_RDS (1 << 2) /* Reduced Data Speculation */ +#define SPEC_CTRL_RDS_SHIFT 2 /* Reduced Data Speculation bit */ +#define SPEC_CTRL_RDS (1 << SPEC_CTRL_RDS_SHIFT) /* Reduced Data Speculation */
#define MSR_IA32_PRED_CMD 0x00000049 /* Prediction Command */ #define PRED_CMD_IBPB (1 << 0) /* Indirect Branch Prediction Barrier */ --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -32,7 +32,7 @@ static void __init ssb_select_mitigation * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any * writes to SPEC_CTRL contain whatever reserved bits have been set. */ -static u64 x86_spec_ctrl_base; +u64 x86_spec_ctrl_base;
/* * The vendor and possibly platform specific bits which can be modified in @@ -202,25 +202,41 @@ EXPORT_SYMBOL_GPL(x86_spec_ctrl_set);
u64 x86_spec_ctrl_get_default(void) { - return x86_spec_ctrl_base; + u64 msrval = x86_spec_ctrl_base; + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + msrval |= rds_tif_to_spec_ctrl(current_thread_info()->flags); + return msrval; } EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default);
void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl) { + u64 host = x86_spec_ctrl_base; + if (!boot_cpu_has(X86_FEATURE_IBRS)) return; - if (x86_spec_ctrl_base != guest_spec_ctrl) + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + host |= rds_tif_to_spec_ctrl(current_thread_info()->flags); + + if (host != guest_spec_ctrl) wrmsrl(MSR_IA32_SPEC_CTRL, guest_spec_ctrl); } EXPORT_SYMBOL_GPL(x86_spec_ctrl_set_guest);
void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl) { + u64 host = x86_spec_ctrl_base; + if (!boot_cpu_has(X86_FEATURE_IBRS)) return; - if (x86_spec_ctrl_base != guest_spec_ctrl) - wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + host |= rds_tif_to_spec_ctrl(current_thread_info()->flags); + + if (host != guest_spec_ctrl) + wrmsrl(MSR_IA32_SPEC_CTRL, host); } EXPORT_SYMBOL_GPL(x86_spec_ctrl_restore_host);
--- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -30,6 +30,7 @@ #include <asm/debugreg.h> #include <asm/nmi.h> #include <asm/tlbflush.h> +#include <asm/spec-ctrl.h>
/* * per-CPU TSS segments. Threads are completely 'soft' on Linux, @@ -216,6 +217,24 @@ static inline void switch_to_bitmap(stru } }
+static __always_inline void __speculative_store_bypass_update(unsigned long tifn) +{ + u64 msr; + + if (static_cpu_has(X86_FEATURE_AMD_RDS)) { + msr = x86_amd_ls_cfg_base | rds_tif_to_amd_ls_cfg(tifn); + wrmsrl(MSR_AMD64_LS_CFG, msr); + } else { + msr = x86_spec_ctrl_base | rds_tif_to_spec_ctrl(tifn); + wrmsrl(MSR_IA32_SPEC_CTRL, msr); + } +} + +void speculative_store_bypass_update(void) +{ + __speculative_store_bypass_update(current_thread_info()->flags); +} + void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, struct tss_struct *tss) { @@ -248,6 +267,9 @@ void __switch_to_xtra(struct task_struct else hard_enable_TSC(); } + + if ((tifp ^ tifn) & _TIF_RDS) + __speculative_store_bypass_update(tifn); }
/*
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 5cf687548705412da47c9cec342fd952d71ed3d5 upstream.
A guest may modify the SPEC_CTRL MSR from the value used by the kernel. Since the kernel doesn't use IBRS, this means a value of zero is what is needed in the host.
But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to the other bits as reserved so the kernel should respect the boot time SPEC_CTRL value and use that.
This allows to deal with future extensions to the SPEC_CTRL interface if any at all.
Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any difference as paravirt will over-write the callq *0xfff.. with the wrmsrl assembler code.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 10 ++++++++++ arch/x86/kernel/cpu/bugs.c | 18 ++++++++++++++++++ arch/x86/kvm/svm.c | 6 ++---- arch/x86/kvm/vmx.c | 6 ++---- 4 files changed, 32 insertions(+), 8 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -183,6 +183,16 @@ enum spectre_v2_mitigation { extern void x86_spec_ctrl_set(u64); extern u64 x86_spec_ctrl_get_default(void);
+/* + * On VMENTER we must preserve whatever view of the SPEC_CTRL MSR + * the guest has, while on VMEXIT we restore the host view. This + * would be easier if SPEC_CTRL were architecturally maskable or + * shadowable for guests but this is not (currently) the case. + * Takes the guest view of SPEC_CTRL MSR as a parameter. + */ +extern void x86_spec_ctrl_set_guest(u64); +extern void x86_spec_ctrl_restore_host(u64); + extern char __indirect_thunk_start[]; extern char __indirect_thunk_end[];
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -185,6 +185,24 @@ u64 x86_spec_ctrl_get_default(void) } EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default);
+void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl) +{ + if (!boot_cpu_has(X86_FEATURE_IBRS)) + return; + if (x86_spec_ctrl_base != guest_spec_ctrl) + wrmsrl(MSR_IA32_SPEC_CTRL, guest_spec_ctrl); +} +EXPORT_SYMBOL_GPL(x86_spec_ctrl_set_guest); + +void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl) +{ + if (!boot_cpu_has(X86_FEATURE_IBRS)) + return; + if (x86_spec_ctrl_base != guest_spec_ctrl) + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); +} +EXPORT_SYMBOL_GPL(x86_spec_ctrl_restore_host); + #ifdef RETPOLINE static bool spectre_v2_bad_module;
--- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3967,8 +3967,7 @@ static void svm_vcpu_run(struct kvm_vcpu * is no need to worry about the conditional branch over the wrmsr * being speculatively taken. */ - if (svm->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl); + x86_spec_ctrl_set_guest(svm->spec_ctrl);
asm volatile ( "push %%" _ASM_BP "; \n\t" @@ -4080,8 +4079,7 @@ static void svm_vcpu_run(struct kvm_vcpu if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))) svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- if (svm->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, 0); + x86_spec_ctrl_restore_host(svm->spec_ctrl);
/* Eliminate branch target predictions from guest mode */ vmexit_fill_RSB(); --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7540,8 +7540,7 @@ static void __noclone vmx_vcpu_run(struc * is no need to worry about the conditional branch over the wrmsr * being speculatively taken. */ - if (vmx->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl); + x86_spec_ctrl_set_guest(vmx->spec_ctrl);
vmx->__launched = vmx->loaded_vmcs->launched; asm( @@ -7674,8 +7673,7 @@ static void __noclone vmx_vcpu_run(struc if (unlikely(!msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL))) vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- if (vmx->spec_ctrl) - native_wrmsrl(MSR_IA32_SPEC_CTRL, 0); + x86_spec_ctrl_restore_host(vmx->spec_ctrl);
/* Eliminate branch target predictions from guest mode */ vmexit_fill_RSB();
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 9f65fb29374ee37856dbad847b4e121aab72b510 upstream.
Intel collateral will reference the SSB mitigation bit in IA32_SPEC_CTL[2] as SSBD (Speculative Store Bypass Disable).
Hence changing it.
It is unclear yet what the MSR_IA32_ARCH_CAPABILITIES (0x10a) Bit(4) name is going to be. Following the rename it would be SSBD_NO but that rolls out to Speculative Store Bypass Disable No.
Also fixed the missing space in X86_FEATURE_AMD_SSBD.
[ tglx: Fixup x86_amd_rds_enable() and rds_tif_to_amd_ls_cfg() as well ]
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: - Update guest_cpuid_has_spec_ctrl() rather than vmx_{get,set}_msr() - Update _TIF_WORK_MASK and _TIF_ALLWORK_MASK - Adjust filenames, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 4 +-- arch/x86/include/asm/spec-ctrl.h | 12 ++++----- arch/x86/include/asm/thread_info.h | 6 ++--- arch/x86/include/uapi/asm/msr-index.h | 10 ++++---- arch/x86/kernel/cpu/amd.c | 14 +++++------ arch/x86/kernel/cpu/bugs.c | 36 +++++++++++++-------------- arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/cpu/intel.c | 2 +- arch/x86/kernel/process.c | 8 +++--- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/cpuid.h | 2 +- arch/x86/kvm/vmx.c | 2 +- 12 files changed, 50 insertions(+), 50 deletions(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -192,7 +192,7 @@ #define X86_FEATURE_USE_IBPB (7*32+12) /* "" Indirect Branch Prediction Barrier enabled */ #define X86_FEATURE_USE_IBRS_FW (7*32+13) /* "" Use IBRS during runtime firmware calls */ #define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE (7*32+14) /* "" Disable Speculative Store Bypass. */ -#define X86_FEATURE_AMD_RDS (7*32+15) /* "" AMD RDS implementation */ +#define X86_FEATURE_AMD_SSBD (7*32+15) /* "" AMD SSBD implementation */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ @@ -243,7 +243,7 @@ #define X86_FEATURE_SPEC_CTRL (10*32+26) /* "" Speculation Control (IBRS + IBPB) */ #define X86_FEATURE_INTEL_STIBP (10*32+27) /* "" Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ARCH_CAPABILITIES (10*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */ -#define X86_FEATURE_RDS (10*32+31) /* Reduced Data Speculation */ +#define X86_FEATURE_SSBD (10*32+31) /* Speculative Store Bypass Disable */
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 11 */ #define X86_FEATURE_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ --- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -17,20 +17,20 @@ extern void x86_spec_ctrl_restore_host(u
/* AMD specific Speculative Store Bypass MSR data */ extern u64 x86_amd_ls_cfg_base; -extern u64 x86_amd_ls_cfg_rds_mask; +extern u64 x86_amd_ls_cfg_ssbd_mask;
/* The Intel SPEC CTRL MSR base value cache */ extern u64 x86_spec_ctrl_base;
-static inline u64 rds_tif_to_spec_ctrl(u64 tifn) +static inline u64 ssbd_tif_to_spec_ctrl(u64 tifn) { - BUILD_BUG_ON(TIF_RDS < SPEC_CTRL_RDS_SHIFT); - return (tifn & _TIF_RDS) >> (TIF_RDS - SPEC_CTRL_RDS_SHIFT); + BUILD_BUG_ON(TIF_SSBD < SPEC_CTRL_SSBD_SHIFT); + return (tifn & _TIF_SSBD) >> (TIF_SSBD - SPEC_CTRL_SSBD_SHIFT); }
-static inline u64 rds_tif_to_amd_ls_cfg(u64 tifn) +static inline u64 ssbd_tif_to_amd_ls_cfg(u64 tifn) { - return (tifn & _TIF_RDS) ? x86_amd_ls_cfg_rds_mask : 0ULL; + return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL; }
extern void speculative_store_bypass_update(void); --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -72,7 +72,7 @@ struct thread_info { #define TIF_SIGPENDING 2 /* signal pending */ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ -#define TIF_RDS 5 /* Reduced data speculation */ +#define TIF_SSBD 5 /* Reduced data speculation */ #define TIF_SYSCALL_EMU 6 /* syscall emulation active */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SECCOMP 8 /* secure computing */ @@ -98,7 +98,7 @@ struct thread_info { #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) -#define _TIF_RDS (1 << TIF_RDS) +#define _TIF_SSBD (1 << TIF_SSBD) #define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SECCOMP (1 << TIF_SECCOMP) @@ -133,11 +133,11 @@ struct thread_info { #define _TIF_WORK_MASK \ (0x0000FFFF & \ ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT| \ - _TIF_SINGLESTEP|_TIF_RDS|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) + _TIF_SINGLESTEP|_TIF_SSBD|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
/* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ - ((0x0000FFFF & ~(_TIF_RDS | _TIF_SECCOMP)) | \ + ((0x0000FFFF & ~(_TIF_SSBD | _TIF_SECCOMP)) | \ _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
/* Only used for 64 bit */ @@ -147,7 +147,7 @@ struct thread_info {
/* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW \ - (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_RDS) + (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_SSBD)
#define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY) #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW) --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -35,8 +35,8 @@ #define MSR_IA32_SPEC_CTRL 0x00000048 /* Speculation Control */ #define SPEC_CTRL_IBRS (1 << 0) /* Indirect Branch Restricted Speculation */ #define SPEC_CTRL_STIBP (1 << 1) /* Single Thread Indirect Branch Predictors */ -#define SPEC_CTRL_RDS_SHIFT 2 /* Reduced Data Speculation bit */ -#define SPEC_CTRL_RDS (1 << SPEC_CTRL_RDS_SHIFT) /* Reduced Data Speculation */ +#define SPEC_CTRL_SSBD_SHIFT 2 /* Speculative Store Bypass Disable bit */ +#define SPEC_CTRL_SSBD (1 << SPEC_CTRL_SSBD_SHIFT) /* Speculative Store Bypass Disable */
#define MSR_IA32_PRED_CMD 0x00000049 /* Prediction Command */ #define PRED_CMD_IBPB (1 << 0) /* Indirect Branch Prediction Barrier */ @@ -59,10 +59,10 @@ #define MSR_IA32_ARCH_CAPABILITIES 0x0000010a #define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */ #define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */ -#define ARCH_CAP_RDS_NO (1 << 4) /* +#define ARCH_CAP_SSBD_NO (1 << 4) /* * Not susceptible to Speculative Store Bypass - * attack, so no Reduced Data Speculation control - * required. + * attack, so no Speculative Store Bypass + * control required. */
#define MSR_IA32_BBL_CR_CTL 0x00000119 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -483,12 +483,12 @@ static void bsp_init_amd(struct cpuinfo_ } /* * Try to cache the base value so further operations can - * avoid RMW. If that faults, do not enable RDS. + * avoid RMW. If that faults, do not enable SSBD. */ if (!rdmsrl_safe(MSR_AMD64_LS_CFG, &x86_amd_ls_cfg_base)) { - setup_force_cpu_cap(X86_FEATURE_RDS); - setup_force_cpu_cap(X86_FEATURE_AMD_RDS); - x86_amd_ls_cfg_rds_mask = 1ULL << bit; + setup_force_cpu_cap(X86_FEATURE_SSBD); + setup_force_cpu_cap(X86_FEATURE_AMD_SSBD); + x86_amd_ls_cfg_ssbd_mask = 1ULL << bit; } } } @@ -802,9 +802,9 @@ static void init_amd(struct cpuinfo_x86
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
- if (boot_cpu_has(X86_FEATURE_AMD_RDS)) { - set_cpu_cap(c, X86_FEATURE_RDS); - set_cpu_cap(c, X86_FEATURE_AMD_RDS); + if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) { + set_cpu_cap(c, X86_FEATURE_SSBD); + set_cpu_cap(c, X86_FEATURE_AMD_SSBD); } }
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -44,10 +44,10 @@ static u64 x86_spec_ctrl_mask = ~SPEC_CT
/* * AMD specific MSR info for Speculative Store Bypass control. - * x86_amd_ls_cfg_rds_mask is initialized in identify_boot_cpu(). + * x86_amd_ls_cfg_ssbd_mask is initialized in identify_boot_cpu(). */ u64 x86_amd_ls_cfg_base; -u64 x86_amd_ls_cfg_rds_mask; +u64 x86_amd_ls_cfg_ssbd_mask;
#ifdef CONFIG_X86_32
@@ -207,7 +207,7 @@ u64 x86_spec_ctrl_get_default(void) u64 msrval = x86_spec_ctrl_base;
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) - msrval |= rds_tif_to_spec_ctrl(current_thread_info()->flags); + msrval |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags); return msrval; } EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default); @@ -220,7 +220,7 @@ void x86_spec_ctrl_set_guest(u64 guest_s return;
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) - host |= rds_tif_to_spec_ctrl(current_thread_info()->flags); + host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags);
if (host != guest_spec_ctrl) wrmsrl(MSR_IA32_SPEC_CTRL, guest_spec_ctrl); @@ -235,18 +235,18 @@ void x86_spec_ctrl_restore_host(u64 gues return;
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) - host |= rds_tif_to_spec_ctrl(current_thread_info()->flags); + host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags);
if (host != guest_spec_ctrl) wrmsrl(MSR_IA32_SPEC_CTRL, host); } EXPORT_SYMBOL_GPL(x86_spec_ctrl_restore_host);
-static void x86_amd_rds_enable(void) +static void x86_amd_ssb_disable(void) { - u64 msrval = x86_amd_ls_cfg_base | x86_amd_ls_cfg_rds_mask; + u64 msrval = x86_amd_ls_cfg_base | x86_amd_ls_cfg_ssbd_mask;
- if (boot_cpu_has(X86_FEATURE_AMD_RDS)) + if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) wrmsrl(MSR_AMD64_LS_CFG, msrval); }
@@ -510,7 +510,7 @@ static enum ssb_mitigation_cmd __init __ enum ssb_mitigation mode = SPEC_STORE_BYPASS_NONE; enum ssb_mitigation_cmd cmd;
- if (!boot_cpu_has(X86_FEATURE_RDS)) + if (!boot_cpu_has(X86_FEATURE_SSBD)) return mode;
cmd = ssb_parse_cmdline(); @@ -544,7 +544,7 @@ static enum ssb_mitigation_cmd __init __ /* * We have three CPU feature flags that are in play here: * - X86_BUG_SPEC_STORE_BYPASS - CPU is susceptible. - * - X86_FEATURE_RDS - CPU is able to turn off speculative store bypass + * - X86_FEATURE_SSBD - CPU is able to turn off speculative store bypass * - X86_FEATURE_SPEC_STORE_BYPASS_DISABLE - engage the mitigation */ if (mode == SPEC_STORE_BYPASS_DISABLE) { @@ -555,12 +555,12 @@ static enum ssb_mitigation_cmd __init __ */ switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: - x86_spec_ctrl_base |= SPEC_CTRL_RDS; - x86_spec_ctrl_mask &= ~SPEC_CTRL_RDS; - x86_spec_ctrl_set(SPEC_CTRL_RDS); + x86_spec_ctrl_base |= SPEC_CTRL_SSBD; + x86_spec_ctrl_mask &= ~SPEC_CTRL_SSBD; + x86_spec_ctrl_set(SPEC_CTRL_SSBD); break; case X86_VENDOR_AMD: - x86_amd_rds_enable(); + x86_amd_ssb_disable(); break; } } @@ -593,16 +593,16 @@ static int ssb_prctl_set(struct task_str if (task_spec_ssb_force_disable(task)) return -EPERM; task_clear_spec_ssb_disable(task); - update = test_and_clear_tsk_thread_flag(task, TIF_RDS); + update = test_and_clear_tsk_thread_flag(task, TIF_SSBD); break; case PR_SPEC_DISABLE: task_set_spec_ssb_disable(task); - update = !test_and_set_tsk_thread_flag(task, TIF_RDS); + update = !test_and_set_tsk_thread_flag(task, TIF_SSBD); break; case PR_SPEC_FORCE_DISABLE: task_set_spec_ssb_disable(task); task_set_spec_ssb_force_disable(task); - update = !test_and_set_tsk_thread_flag(task, TIF_RDS); + update = !test_and_set_tsk_thread_flag(task, TIF_SSBD); break; default: return -ERANGE; @@ -672,7 +672,7 @@ void x86_spec_ctrl_setup_ap(void) x86_spec_ctrl_set(x86_spec_ctrl_base & ~x86_spec_ctrl_mask);
if (ssb_mode == SPEC_STORE_BYPASS_DISABLE) - x86_amd_rds_enable(); + x86_amd_ssb_disable(); }
#ifdef CONFIG_SYSFS --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -835,7 +835,7 @@ static void __init cpu_set_bug_bits(stru rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
if (!x86_match_cpu(cpu_no_spec_store_bypass) && - !(ia32_cap & ARCH_CAP_RDS_NO)) + !(ia32_cap & ARCH_CAP_SSBD_NO)) setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS);
if (x86_match_cpu(cpu_no_speculation)) --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -120,7 +120,7 @@ static void early_init_intel(struct cpui setup_clear_cpu_cap(X86_FEATURE_STIBP); setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL); setup_clear_cpu_cap(X86_FEATURE_INTEL_STIBP); - setup_clear_cpu_cap(X86_FEATURE_RDS); + setup_clear_cpu_cap(X86_FEATURE_SSBD); }
/* --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -221,11 +221,11 @@ static __always_inline void __speculativ { u64 msr;
- if (static_cpu_has(X86_FEATURE_AMD_RDS)) { - msr = x86_amd_ls_cfg_base | rds_tif_to_amd_ls_cfg(tifn); + if (static_cpu_has(X86_FEATURE_AMD_SSBD)) { + msr = x86_amd_ls_cfg_base | ssbd_tif_to_amd_ls_cfg(tifn); wrmsrl(MSR_AMD64_LS_CFG, msr); } else { - msr = x86_spec_ctrl_base | rds_tif_to_spec_ctrl(tifn); + msr = x86_spec_ctrl_base | ssbd_tif_to_spec_ctrl(tifn); wrmsrl(MSR_IA32_SPEC_CTRL, msr); } } @@ -268,7 +268,7 @@ void __switch_to_xtra(struct task_struct hard_enable_TSC(); }
- if ((tifp ^ tifn) & _TIF_RDS) + if ((tifp ^ tifn) & _TIF_SSBD) __speculative_store_bypass_update(tifn); }
--- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -318,7 +318,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 7.0.edx*/ const u32 kvm_cpuid_7_0_edx_x86_features = - F(SPEC_CTRL) | F(RDS) | F(ARCH_CAPABILITIES); + F(SPEC_CTRL) | F(SSBD) | F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -123,7 +123,7 @@ static inline bool guest_cpuid_has_spec_ if (best && (best->ebx & bit(X86_FEATURE_IBRS))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); - return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_RDS))); + return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_SSBD))); }
static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu) --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2637,7 +2637,7 @@ static int vmx_set_msr(struct kvm_vcpu * return 1;
/* The STIBP bit doesn't fault even if it's not advertised */ - if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | SPEC_CTRL_RDS)) + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | SPEC_CTRL_SSBD)) return 1;
vmx->spec_ctrl = data;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 0661a33611fca12570cba48d9344ce68834ee86c upstream.
One bit in ->vm_flags is unused now!
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Dan Carpenter dan.carpenter@oracle.com Cc: Michal Hocko mhocko@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: Drop changes in mm/debug.c] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/drivers/gpu/drm/drm_vma_manager.c +++ b/drivers/gpu/drm/drm_vma_manager.c @@ -50,8 +50,7 @@ * * You must not use multiple offset managers on a single address_space. * Otherwise, mm-core will be unable to tear down memory mappings as the VM will - * no longer be linear. Please use VM_NONLINEAR in that case and implement your - * own offset managers. + * no longer be linear. * * This offset manager works on page-based addresses. That is, every argument * and return code (with the exception of drm_vma_node_offset_addr()) is given --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -125,7 +125,6 @@ extern unsigned int kobjsize(const void #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ -#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
--- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -54,7 +54,7 @@ static inline pgoff_t swp_offset(swp_ent /* check whether a pte points to a swap entry */ static inline int is_swap_pte(pte_t pte) { - return !pte_none(pte) && !pte_present_nonuma(pte) && !pte_file(pte); + return !pte_none(pte) && !pte_present_nonuma(pte); } #endif
@@ -66,7 +66,6 @@ static inline swp_entry_t pte_to_swp_ent { swp_entry_t arch_entry;
- BUG_ON(pte_file(pte)); if (pte_swp_soft_dirty(pte)) pte = pte_swp_clear_soft_dirty(pte); arch_entry = __pte_to_swp_entry(pte); @@ -82,7 +81,6 @@ static inline pte_t swp_entry_to_pte(swp swp_entry_t arch_entry;
arch_entry = __swp_entry(swp_type(entry), swp_offset(entry)); - BUG_ON(pte_file(__swp_entry_to_pte(arch_entry))); return __swp_entry_to_pte(arch_entry); }
--- a/mm/gup.c +++ b/mm/gup.c @@ -61,7 +61,7 @@ retry: */ if (likely(!(flags & FOLL_MIGRATION))) goto no_page; - if (pte_none(pte) || pte_file(pte)) + if (pte_none(pte)) goto no_page; entry = pte_to_swp_entry(pte); if (!is_migration_entry(entry)) --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1749,7 +1749,7 @@ int ksm_madvise(struct vm_area_struct *v */ if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | VM_PFNMAP | VM_IO | VM_DONTEXPAND | - VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP)) + VM_HUGETLB | VM_MIXEDMAP)) return 0; /* just ignore the advice */
#ifdef VM_SAO --- a/mm/madvise.c +++ b/mm/madvise.c @@ -155,7 +155,7 @@ static int swapin_walk_pmd_entry(pmd_t * pte = *(orig_pte + ((index - start) / PAGE_SIZE)); pte_unmap_unlock(orig_pte, ptl);
- if (pte_present(pte) || pte_none(pte) || pte_file(pte)) + if (pte_present(pte) || pte_none(pte)) continue; entry = pte_to_swp_entry(pte); if (unlikely(non_swap_entry(entry))) @@ -298,7 +298,7 @@ static long madvise_remove(struct vm_are
*prev = NULL; /* tell sys_madvise we drop mmap_sem */
- if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB)) + if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB)) return -EINVAL;
f = vma->vm_file; --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6580,10 +6580,7 @@ static struct page *mc_handle_file_pte(s return NULL;
mapping = vma->vm_file->f_mapping; - if (pte_none(ptent)) - pgoff = linear_page_index(vma, addr); - else /* pte_file(ptent) is true */ - pgoff = pte_to_pgoff(ptent); + pgoff = linear_page_index(vma, addr);
/* page is moved even if it's not RSS of this task(page-faulted). */ #ifdef CONFIG_SWAP @@ -6616,7 +6613,7 @@ static enum mc_target_type get_mctgt_typ page = mc_handle_present_pte(vma, addr, ptent); else if (is_swap_pte(ptent)) page = mc_handle_swap_pte(vma, addr, ptent, &ent); - else if (pte_none(ptent) || pte_file(ptent)) + else if (pte_none(ptent)) page = mc_handle_file_pte(vma, addr, ptent, &ent);
if (!page && !ent.val) --- a/mm/memory.c +++ b/mm/memory.c @@ -810,42 +810,40 @@ copy_one_pte(struct mm_struct *dst_mm, s
/* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { - if (!pte_file(pte)) { - swp_entry_t entry = pte_to_swp_entry(pte); + swp_entry_t entry = pte_to_swp_entry(pte);
- if (likely(!non_swap_entry(entry))) { - if (swap_duplicate(entry) < 0) - return entry.val; - - /* make sure dst_mm is on swapoff's mmlist. */ - if (unlikely(list_empty(&dst_mm->mmlist))) { - spin_lock(&mmlist_lock); - if (list_empty(&dst_mm->mmlist)) - list_add(&dst_mm->mmlist, - &src_mm->mmlist); - spin_unlock(&mmlist_lock); - } - rss[MM_SWAPENTS]++; - } else if (is_migration_entry(entry)) { - page = migration_entry_to_page(entry); - - if (PageAnon(page)) - rss[MM_ANONPAGES]++; - else - rss[MM_FILEPAGES]++; - - if (is_write_migration_entry(entry) && - is_cow_mapping(vm_flags)) { - /* - * COW mappings require pages in both - * parent and child to be set to read. - */ - make_migration_entry_read(&entry); - pte = swp_entry_to_pte(entry); - if (pte_swp_soft_dirty(*src_pte)) - pte = pte_swp_mksoft_dirty(pte); - set_pte_at(src_mm, addr, src_pte, pte); - } + if (likely(!non_swap_entry(entry))) { + if (swap_duplicate(entry) < 0) + return entry.val; + + /* make sure dst_mm is on swapoff's mmlist. */ + if (unlikely(list_empty(&dst_mm->mmlist))) { + spin_lock(&mmlist_lock); + if (list_empty(&dst_mm->mmlist)) + list_add(&dst_mm->mmlist, + &src_mm->mmlist); + spin_unlock(&mmlist_lock); + } + rss[MM_SWAPENTS]++; + } else if (is_migration_entry(entry)) { + page = migration_entry_to_page(entry); + + if (PageAnon(page)) + rss[MM_ANONPAGES]++; + else + rss[MM_FILEPAGES]++; + + if (is_write_migration_entry(entry) && + is_cow_mapping(vm_flags)) { + /* + * COW mappings require pages in both + * parent and child to be set to read. + */ + make_migration_entry_read(&entry); + pte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(*src_pte)) + pte = pte_swp_mksoft_dirty(pte); + set_pte_at(src_mm, addr, src_pte, pte); } } goto out_set_pte; @@ -1019,11 +1017,9 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB | VM_NONLINEAR | - VM_PFNMAP | VM_MIXEDMAP))) { - if (!vma->anon_vma) - return 0; - } + if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && + !vma->anon_vma) + return 0;
if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); --- a/mm/mincore.c +++ b/mm/mincore.c @@ -124,17 +124,13 @@ static void mincore_pte_range(struct vm_ ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); do { pte_t pte = *ptep; - pgoff_t pgoff;
next = addr + PAGE_SIZE; if (pte_none(pte)) mincore_unmapped_range(vma, addr, next, vec); else if (pte_present(pte)) *vec = 1; - else if (pte_file(pte)) { - pgoff = pte_to_pgoff(pte); - *vec = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { /* pte is a swap entry */ + else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte);
if (is_migration_entry(entry)) { @@ -142,9 +138,8 @@ static void mincore_pte_range(struct vm_ *vec = 1; } else { #ifdef CONFIG_SWAP - pgoff = entry.val; *vec = mincore_page(swap_address_space(entry), - pgoff); + entry.val); #else WARN_ON(1); *vec = 1; --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -110,7 +110,7 @@ static unsigned long change_pte_range(st } if (updated) pages++; - } else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) { + } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte);
if (is_write_migration_entry(entry)) { --- a/mm/mremap.c +++ b/mm/mremap.c @@ -81,8 +81,6 @@ static pte_t move_soft_dirty_pte(pte_t p pte = pte_mksoft_dirty(pte); else if (is_swap_pte(pte)) pte = pte_swp_mksoft_dirty(pte); - else if (pte_file(pte)) - pte = pte_file_mksoft_dirty(pte); #endif return pte; } --- a/mm/msync.c +++ b/mm/msync.c @@ -86,10 +86,7 @@ SYSCALL_DEFINE3(msync, unsigned long, st (vma->vm_flags & VM_SHARED)) { get_file(file); up_read(&mm->mmap_sem); - if (vma->vm_flags & VM_NONLINEAR) - error = vfs_fsync(file, 1); - else - error = vfs_fsync_range(file, fstart, fend, 1); + error = vfs_fsync_range(file, fstart, fend, 1); fput(file); if (error || start >= end) goto out;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit c8d78c1823f46519473949d33f0d1d33fe21ea16 upstream.
remap_file_pages(2) was invented to be able efficiently map parts of huge file into limited 32-bit virtual address space such as in database workloads.
Nonlinear mappings are pain to support and it seems there's no legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on each remap_file_pages(), unless they it can be merged with an adjacent one.
I didn't find *any* real code that uses remap_file_pages(2) to test emulation impact on. I've checked Debian code search and source of all packages in ALT Linux. No real users: libc wrappers, mentions in strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine with emulation.
To test performance impact, I've written small test case which demonstrate pretty much worst case scenario: map 4G shmfs file, write to begin of every page pgoff of the page, remap pages in reverse order, read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds After: 43.9 ( +- 0.85% ) seconds Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h>
#define MB (1024UL * 1024) #define SIZE (4096 * MB)
int main(int argc, char **argv) { unsigned long *p; long i, pass;
for (pass = 0; pass < 10; pass++) { p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; }
for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) { if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096, 0, (SIZE - 4096 * (i + 1)) >> 12, 0)) { perror("remap_file_pages"); return -1; } }
for (i = SIZE / 4096 - 1; i >= 0; i--) assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE); }
return 0; }
[akpm@linux-foundation.org: fix spello] [sasha.levin@oracle.com: initialize populate before usage] [sasha.levin@oracle.com: grab file ref to prevent race while mmaping] Signed-off-by: "Kirill A. Shutemov" kirill@shutemov.name Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@kernel.org Cc: Dave Jones davej@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Armin Rigo arigo@tunes.org Signed-off-by: Sasha Levin sasha.levin@oracle.com Cc: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/vm/remap_file_pages.txt | 7 +- include/linux/fs.h | 8 +- mm/Makefile | 2 +- mm/fremap.c | 283 -------------------------- mm/mmap.c | 69 +++++++ mm/nommu.c | 8 - 6 files changed, 79 insertions(+), 298 deletions(-) delete mode 100644 mm/fremap.c
--- a/Documentation/vm/remap_file_pages.txt +++ b/Documentation/vm/remap_file_pages.txt @@ -18,10 +18,9 @@ on 32-bit systems to map files bigger th virtual address space. This use-case is not critical anymore since 64-bit systems are widely available.
-The plan is to deprecate the syscall and replace it with an emulation. -The emulation will create new VMAs instead of nonlinear mappings. It's -going to work slower for rare users of remap_file_pages() but ABI is -preserved. +The syscall is deprecated and replaced it with an emulation now. The +emulation creates new VMAs instead of nonlinear mappings. It's going to +work slower for rare users of remap_file_pages() but ABI is preserved.
One side effect of emulation (apart from performance) is that user can hit vm.max_map_count limit more easily due to additional VMAs. See comment for --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2430,8 +2430,12 @@ extern int sb_min_blocksize(struct super
extern int generic_file_mmap(struct file *, struct vm_area_struct *); extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); -extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr, - unsigned long size, pgoff_t pgoff); +static inline int generic_file_remap_pages(struct vm_area_struct *vma, + unsigned long addr, unsigned long size, pgoff_t pgoff) +{ + BUG(); + return 0; +} int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *); extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *); --- a/mm/Makefile +++ b/mm/Makefile @@ -3,7 +3,7 @@ #
mmu-y := nommu.o -mmu-$(CONFIG_MMU) := fremap.o gup.o highmem.o madvise.o memory.o mincore.o \ +mmu-$(CONFIG_MMU) := gup.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ vmalloc.o pagewalk.o pgtable-generic.o
--- a/mm/fremap.c +++ /dev/null @@ -1,283 +0,0 @@ -/* - * linux/mm/fremap.c - * - * Explicit pagetable population and nonlinear (random) mappings support. - * - * started by Ingo Molnar, Copyright (C) 2002, 2003 - */ -#include <linux/export.h> -#include <linux/backing-dev.h> -#include <linux/mm.h> -#include <linux/swap.h> -#include <linux/file.h> -#include <linux/mman.h> -#include <linux/pagemap.h> -#include <linux/swapops.h> -#include <linux/rmap.h> -#include <linux/syscalls.h> -#include <linux/mmu_notifier.h> - -#include <asm/mmu_context.h> -#include <asm/cacheflush.h> -#include <asm/tlbflush.h> - -#include "internal.h" - -static int mm_counter(struct page *page) -{ - return PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES; -} - -static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) -{ - pte_t pte = *ptep; - struct page *page; - swp_entry_t entry; - - if (pte_present(pte)) { - flush_cache_page(vma, addr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, addr, ptep); - page = vm_normal_page(vma, addr, pte); - if (page) { - if (pte_dirty(pte)) - set_page_dirty(page); - update_hiwater_rss(mm); - dec_mm_counter(mm, mm_counter(page)); - page_remove_rmap(page); - page_cache_release(page); - } - } else { /* zap_pte() is not called when pte_none() */ - if (!pte_file(pte)) { - update_hiwater_rss(mm); - entry = pte_to_swp_entry(pte); - if (non_swap_entry(entry)) { - if (is_migration_entry(entry)) { - page = migration_entry_to_page(entry); - dec_mm_counter(mm, mm_counter(page)); - } - } else { - free_swap_and_cache(entry); - dec_mm_counter(mm, MM_SWAPENTS); - } - } - pte_clear_not_present_full(mm, addr, ptep, 0); - } -} - -/* - * Install a file pte to a given virtual memory address, release any - * previously existing mapping. - */ -static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, unsigned long pgoff, pgprot_t prot) -{ - int err = -ENOMEM; - pte_t *pte, ptfile; - spinlock_t *ptl; - - pte = get_locked_pte(mm, addr, &ptl); - if (!pte) - goto out; - - ptfile = pgoff_to_pte(pgoff); - - if (!pte_none(*pte)) - zap_pte(mm, vma, addr, pte); - - set_pte_at(mm, addr, pte, pte_file_mksoft_dirty(ptfile)); - /* - * We don't need to run update_mmu_cache() here because the "file pte" - * being installed by install_file_pte() is not a real pte - it's a - * non-present entry (like a swap entry), noting what file offset should - * be mapped there when there's a fault (in a non-linear vma where - * that's not obvious). - */ - pte_unmap_unlock(pte, ptl); - err = 0; -out: - return err; -} - -int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr, - unsigned long size, pgoff_t pgoff) -{ - struct mm_struct *mm = vma->vm_mm; - int err; - - do { - err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot); - if (err) - return err; - - size -= PAGE_SIZE; - addr += PAGE_SIZE; - pgoff++; - } while (size); - - return 0; -} -EXPORT_SYMBOL(generic_file_remap_pages); - -/** - * sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma - * @start: start of the remapped virtual memory range - * @size: size of the remapped virtual memory range - * @prot: new protection bits of the range (see NOTE) - * @pgoff: to-be-mapped page of the backing store file - * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO. - * - * sys_remap_file_pages remaps arbitrary pages of an existing VM_SHARED vma - * (shared backing store file). - * - * This syscall works purely via pagetables, so it's the most efficient - * way to map the same (large) file into a given virtual window. Unlike - * mmap()/mremap() it does not create any new vmas. The new mappings are - * also safe across swapout. - * - * NOTE: the @prot parameter right now is ignored (but must be zero), - * and the vma's default protection is used. Arbitrary protections - * might be implemented in the future. - */ -SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, - unsigned long, prot, unsigned long, pgoff, unsigned long, flags) -{ - struct mm_struct *mm = current->mm; - struct address_space *mapping; - struct vm_area_struct *vma; - int err = -EINVAL; - int has_write_lock = 0; - vm_flags_t vm_flags = 0; - - pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. " - "See Documentation/vm/remap_file_pages.txt.\n", - current->comm, current->pid); - - if (prot) - return err; - /* - * Sanitize the syscall parameters: - */ - start = start & PAGE_MASK; - size = size & PAGE_MASK; - - /* Does the address range wrap, or is the span zero-sized? */ - if (start + size <= start) - return err; - - /* Does pgoff wrap? */ - if (pgoff + (size >> PAGE_SHIFT) < pgoff) - return err; - - /* Can we represent this offset inside this architecture's pte's? */ -#if PTE_FILE_MAX_BITS < BITS_PER_LONG - if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS)) - return err; -#endif - - /* We need down_write() to change vma->vm_flags. */ - down_read(&mm->mmap_sem); - retry: - vma = find_vma(mm, start); - - /* - * Make sure the vma is shared, that it supports prefaulting, - * and that the remapped range is valid and fully within - * the single existing vma. - */ - if (!vma || !(vma->vm_flags & VM_SHARED)) - goto out; - - if (!vma->vm_ops || !vma->vm_ops->remap_pages) - goto out; - - if (start < vma->vm_start || start + size > vma->vm_end) - goto out; - - /* Must set VM_NONLINEAR before any pages are populated. */ - if (!(vma->vm_flags & VM_NONLINEAR)) { - /* - * vm_private_data is used as a swapout cursor - * in a VM_NONLINEAR vma. - */ - if (vma->vm_private_data) - goto out; - - /* Don't need a nonlinear mapping, exit success */ - if (pgoff == linear_page_index(vma, start)) { - err = 0; - goto out; - } - - if (!has_write_lock) { -get_write_lock: - up_read(&mm->mmap_sem); - down_write(&mm->mmap_sem); - has_write_lock = 1; - goto retry; - } - mapping = vma->vm_file->f_mapping; - /* - * page_mkclean doesn't work on nonlinear vmas, so if - * dirty pages need to be accounted, emulate with linear - * vmas. - */ - if (mapping_cap_account_dirty(mapping)) { - unsigned long addr; - struct file *file = get_file(vma->vm_file); - /* mmap_region may free vma; grab the info now */ - vm_flags = vma->vm_flags; - - addr = mmap_region(file, start, size, vm_flags, pgoff); - fput(file); - if (IS_ERR_VALUE(addr)) { - err = addr; - } else { - BUG_ON(addr != start); - err = 0; - } - goto out_freed; - } - mutex_lock(&mapping->i_mmap_mutex); - flush_dcache_mmap_lock(mapping); - vma->vm_flags |= VM_NONLINEAR; - vma_interval_tree_remove(vma, &mapping->i_mmap); - vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); - flush_dcache_mmap_unlock(mapping); - mutex_unlock(&mapping->i_mmap_mutex); - } - - if (vma->vm_flags & VM_LOCKED) { - /* - * drop PG_Mlocked flag for over-mapped range - */ - if (!has_write_lock) - goto get_write_lock; - vm_flags = vma->vm_flags; - munlock_vma_pages_range(vma, start, start + size); - vma->vm_flags = vm_flags; - } - - mmu_notifier_invalidate_range_start(mm, start, start + size); - err = vma->vm_ops->remap_pages(vma, start, size, pgoff); - mmu_notifier_invalidate_range_end(mm, start, start + size); - - /* - * We can't clear VM_NONLINEAR because we'd have to do - * it after ->populate completes, and that would prevent - * downgrading the lock. (Locks can't be upgraded). - */ - -out: - if (vma) - vm_flags = vma->vm_flags; -out_freed: - if (likely(!has_write_lock)) - up_read(&mm->mmap_sem); - else - up_write(&mm->mmap_sem); - if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK))) - mm_populate(start, size); - - return err; -} --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2620,6 +2620,75 @@ SYSCALL_DEFINE2(munmap, unsigned long, a return vm_munmap(addr, len); }
+ +/* + * Emulation of deprecated remap_file_pages() syscall. + */ +SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, + unsigned long, prot, unsigned long, pgoff, unsigned long, flags) +{ + + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + unsigned long populate = 0; + unsigned long ret = -EINVAL; + struct file *file; + + pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. " + "See Documentation/vm/remap_file_pages.txt.\n", + current->comm, current->pid); + + if (prot) + return ret; + start = start & PAGE_MASK; + size = size & PAGE_MASK; + + if (start + size <= start) + return ret; + + /* Does pgoff wrap? */ + if (pgoff + (size >> PAGE_SHIFT) < pgoff) + return ret; + + down_write(&mm->mmap_sem); + vma = find_vma(mm, start); + + if (!vma || !(vma->vm_flags & VM_SHARED)) + goto out; + + if (start < vma->vm_start || start + size > vma->vm_end) + goto out; + + if (pgoff == linear_page_index(vma, start)) { + ret = 0; + goto out; + } + + prot |= vma->vm_flags & VM_READ ? PROT_READ : 0; + prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0; + prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0; + + flags &= MAP_NONBLOCK; + flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE; + if (vma->vm_flags & VM_LOCKED) { + flags |= MAP_LOCKED; + /* drop PG_Mlocked flag for over-mapped range */ + munlock_vma_pages_range(vma, start, start + size); + } + + file = get_file(vma->vm_file); + ret = do_mmap_pgoff(vma->vm_file, start, size, + prot, flags, pgoff, &populate); + fput(file); +out: + up_write(&mm->mmap_sem); + if (populate) + mm_populate(ret, populate); + if (!IS_ERR_VALUE(ret)) + ret = 0; + return ret; +} + static inline void verify_mm_writelocked(struct mm_struct *mm) { #ifdef CONFIG_DEBUG_VM --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1999,14 +1999,6 @@ void filemap_map_pages(struct vm_area_st } EXPORT_SYMBOL(filemap_map_pages);
-int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr, - unsigned long size, pgoff_t pgoff) -{ - BUG(); - return 0; -} -EXPORT_SYMBOL(generic_file_remap_pages); - int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, unsigned long addr, void *buf, int len, int write) {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 8a5f14a23177061ec11daeaa3d09d0765d785c47 upstream.
We have remap_file_pages(2) emulation in -mm tree for few release cycles and we plan to have it mainline in v3.20. This patchset removes rest of VM_NONLINEAR infrastructure.
Patches 1-8 take care about generic code. They are pretty straight-forward and can be applied without other of patches.
Rest patches removes pte_file()-related stuff from architecture-specific code. It usually frees up one bit in non-present pte. I've tried to reuse that bit for swap offset, where I was able to figure out how to do that.
For obvious reason I cannot test all that arch-specific code and would like to see acks from maintainers.
In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial kernel code. That's too much for functionality nobody uses.
Tested-by: Felipe Balbi balbi@ti.com
This patch (of 38):
We don't create non-linear mappings anymore. Let's drop code which handles them on unmap/zap.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm.h | 1 - mm/madvise.c | 9 +---- mm/memory.c | 82 ++++++++++++---------------------------------- 3 files changed, 22 insertions(+), 70 deletions(-)
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1103,7 +1103,6 @@ extern void user_shm_unlock(size_t, stru * Parameter block passed down to zap_pte_range in exceptional cases. */ struct zap_details { - struct vm_area_struct *nonlinear_vma; /* Check page->index if set */ struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ --- a/mm/madvise.c +++ b/mm/madvise.c @@ -277,14 +277,7 @@ static long madvise_dontneed(struct vm_a if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL;
- if (unlikely(vma->vm_flags & VM_NONLINEAR)) { - struct zap_details details = { - .nonlinear_vma = vma, - .last_index = ULONG_MAX, - }; - zap_page_range(vma, start, end - start, &details); - } else - zap_page_range(vma, start, end - start, NULL); + zap_page_range(vma, start, end - start, NULL); return 0; }
--- a/mm/memory.c +++ b/mm/memory.c @@ -1081,6 +1081,7 @@ static unsigned long zap_pte_range(struc spinlock_t *ptl; pte_t *start_pte; pte_t *pte; + swp_entry_t entry;
again: init_rss_vec(rss); @@ -1106,28 +1107,12 @@ again: if (details->check_mapping && details->check_mapping != page->mapping) continue; - /* - * Each page->index must be checked when - * invalidating or truncating nonlinear. - */ - if (details->nonlinear_vma && - (page->index < details->first_index || - page->index > details->last_index)) - continue; } ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; - if (unlikely(details) && details->nonlinear_vma - && linear_page_index(details->nonlinear_vma, - addr) != page->index) { - pte_t ptfile = pgoff_to_pte(page->index); - if (pte_soft_dirty(ptent)) - ptfile = pte_file_mksoft_dirty(ptfile); - set_pte_at(mm, addr, pte, ptfile); - } if (PageAnon(page)) rss[MM_ANONPAGES]--; else { @@ -1150,33 +1135,25 @@ again: } continue; } - /* - * If details->check_mapping, we leave swap entries; - * if details->nonlinear_vma, we leave file entries. - */ + /* If details->check_mapping, we leave swap entries. */ if (unlikely(details)) continue; - if (pte_file(ptent)) { - if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) - print_bad_pte(vma, addr, ptent, NULL); - } else { - swp_entry_t entry = pte_to_swp_entry(ptent); - - if (!non_swap_entry(entry)) - rss[MM_SWAPENTS]--; - else if (is_migration_entry(entry)) { - struct page *page; - - page = migration_entry_to_page(entry); - - if (PageAnon(page)) - rss[MM_ANONPAGES]--; - else - rss[MM_FILEPAGES]--; - } - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); + + entry = pte_to_swp_entry(ptent); + if (!non_swap_entry(entry)) + rss[MM_SWAPENTS]--; + else if (is_migration_entry(entry)) { + struct page *page; + + page = migration_entry_to_page(entry); + + if (PageAnon(page)) + rss[MM_ANONPAGES]--; + else + rss[MM_FILEPAGES]--; } + if (unlikely(!free_swap_and_cache(entry))) + print_bad_pte(vma, addr, ptent, NULL); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } while (pte++, addr += PAGE_SIZE, addr != end);
@@ -1288,7 +1265,7 @@ static void unmap_page_range(struct mmu_ pgd_t *pgd; unsigned long next;
- if (details && !details->check_mapping && !details->nonlinear_vma) + if (details && !details->check_mapping) details = NULL;
BUG_ON(addr >= end); @@ -1384,7 +1361,7 @@ void unmap_vmas(struct mmu_gather *tlb, * @vma: vm_area_struct holding the applicable pages * @start: starting address of pages to zap * @size: number of bytes to zap - * @details: details of nonlinear truncation or shared cache invalidation + * @details: details of shared cache invalidation * * Caller must protect the VMA list */ @@ -1410,7 +1387,7 @@ void zap_page_range(struct vm_area_struc * @vma: vm_area_struct holding the applicable pages * @address: starting address of pages to zap * @size: number of bytes to zap - * @details: details of nonlinear truncation or shared cache invalidation + * @details: details of shared cache invalidation * * The range must fit into one VMA. */ @@ -2340,25 +2317,11 @@ static inline void unmap_mapping_range_t } }
-static inline void unmap_mapping_range_list(struct list_head *head, - struct zap_details *details) -{ - struct vm_area_struct *vma; - - /* - * In nonlinear VMAs there is no correspondence between virtual address - * offset and file offset. So we must perform an exhaustive search - * across *all* the pages in each nonlinear VMA, not just the pages - * whose virtual address lies outside the file truncation point. - */ - list_for_each_entry(vma, head, shared.nonlinear) { - details->nonlinear_vma = vma; - unmap_mapping_range_vma(vma, vma->vm_start, vma->vm_end, details); - } -} - /** - * unmap_mapping_range - unmap the portion of all mmaps in the specified address_space corresponding to the specified page range in the underlying file. + * unmap_mapping_range - unmap the portion of all mmaps in the specified + * address_space corresponding to the specified page range in the underlying + * file. + * * @mapping: the address space containing mmaps to be unmapped. * @holebegin: byte in first page to unmap, relative to the start of * the underlying file. This will be rounded down to a PAGE_SIZE @@ -2387,7 +2350,6 @@ void unmap_mapping_range(struct address_ }
details.check_mapping = even_cows? NULL: mapping; - details.nonlinear_vma = NULL; details.first_index = hba; details.last_index = hba + hlen - 1; if (details.last_index < details.first_index) @@ -2397,8 +2359,6 @@ void unmap_mapping_range(struct address_ mutex_lock(&mapping->i_mmap_mutex); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); - if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) - unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); mutex_unlock(&mapping->i_mmap_mutex); } EXPORT_SYMBOL(unmap_mapping_range);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 6a8c4820895cf1dd2a128aef67ce079ba6eded80 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/sparc/include/asm/pgtable_32.h | 24 ----------------- arch/sparc/include/asm/pgtable_64.h | 40 ----------------------------- arch/sparc/include/asm/pgtsrmmu.h | 14 ++++------ 3 files changed, 5 insertions(+), 73 deletions(-)
--- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -221,14 +221,6 @@ static inline int pte_young(pte_t pte) return pte_val(pte) & SRMMU_REF; }
-/* - * The following only work if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & SRMMU_FILE; -} - static inline int pte_special(pte_t pte) { return 0; @@ -375,22 +367,6 @@ static inline swp_entry_t __swp_entry(un #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-/* file-offset-in-pte helpers */ -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte_val(pte) >> SRMMU_PTE_FILE_SHIFT; -} - -static inline pte_t pgoff_to_pte(unsigned long pgoff) -{ - return __pte((pgoff << SRMMU_PTE_FILE_SHIFT) | SRMMU_FILE); -} - -/* - * This is made a constant because mm/fremap.c required a constant. - */ -#define PTE_FILE_MAX_BITS 24 - static inline unsigned long __get_phys (unsigned long addr) { --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -137,7 +137,6 @@ bool kern_addr_valid(unsigned long addr) #define _PAGE_SOFT_4U _AC(0x0000000000001F80,UL) /* Software bits: */ #define _PAGE_EXEC_4U _AC(0x0000000000001000,UL) /* Executable SW bit */ #define _PAGE_MODIFIED_4U _AC(0x0000000000000800,UL) /* Modified (dirty) */ -#define _PAGE_FILE_4U _AC(0x0000000000000800,UL) /* Pagecache page */ #define _PAGE_ACCESSED_4U _AC(0x0000000000000400,UL) /* Accessed (ref'd) */ #define _PAGE_READ_4U _AC(0x0000000000000200,UL) /* Readable SW Bit */ #define _PAGE_WRITE_4U _AC(0x0000000000000100,UL) /* Writable SW Bit */ @@ -167,7 +166,6 @@ bool kern_addr_valid(unsigned long addr) #define _PAGE_EXEC_4V _AC(0x0000000000000080,UL) /* Executable Page */ #define _PAGE_W_4V _AC(0x0000000000000040,UL) /* Writable */ #define _PAGE_SOFT_4V _AC(0x0000000000000030,UL) /* Software bits */ -#define _PAGE_FILE_4V _AC(0x0000000000000020,UL) /* Pagecache page */ #define _PAGE_PRESENT_4V _AC(0x0000000000000010,UL) /* Present */ #define _PAGE_RESV_4V _AC(0x0000000000000008,UL) /* Reserved */ #define _PAGE_SZ16GB_4V _AC(0x0000000000000007,UL) /* 16GB Page */ @@ -332,22 +330,6 @@ static inline pmd_t pmd_modify(pmd_t pmd } #endif
-static inline pte_t pgoff_to_pte(unsigned long off) -{ - off <<= PAGE_SHIFT; - - __asm__ __volatile__( - "\n661: or %0, %2, %0\n" - " .section .sun4v_1insn_patch, "ax"\n" - " .word 661b\n" - " or %0, %3, %0\n" - " .previous\n" - : "=r" (off) - : "0" (off), "i" (_PAGE_FILE_4U), "i" (_PAGE_FILE_4V)); - - return __pte(off); -} - static inline pgprot_t pgprot_noncached(pgprot_t prot) { unsigned long val = pgprot_val(prot); @@ -609,22 +591,6 @@ static inline unsigned long pte_exec(pte return (pte_val(pte) & mask); }
-static inline unsigned long pte_file(pte_t pte) -{ - unsigned long val = pte_val(pte); - - __asm__ __volatile__( - "\n661: and %0, %2, %0\n" - " .section .sun4v_1insn_patch, "ax"\n" - " .word 661b\n" - " and %0, %3, %0\n" - " .previous\n" - : "=r" (val) - : "0" (val), "i" (_PAGE_FILE_4U), "i" (_PAGE_FILE_4V)); - - return val; -} - static inline unsigned long pte_present(pte_t pte) { unsigned long val = pte_val(pte); @@ -964,12 +930,6 @@ pgtable_t pgtable_trans_huge_withdraw(st #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-/* File offset in PTE support. */ -unsigned long pte_file(pte_t); -#define pte_to_pgoff(pte) (pte_val(pte) >> PAGE_SHIFT) -pte_t pgoff_to_pte(unsigned long); -#define PTE_FILE_MAX_BITS (64UL - PAGE_SHIFT - 1UL) - int page_in_phys_avail(unsigned long paddr);
/* --- a/arch/sparc/include/asm/pgtsrmmu.h +++ b/arch/sparc/include/asm/pgtsrmmu.h @@ -80,10 +80,6 @@ #define SRMMU_PRIV 0x1c #define SRMMU_PRIV_RDONLY 0x18
-#define SRMMU_FILE 0x40 /* Implemented in software */ - -#define SRMMU_PTE_FILE_SHIFT 8 /* == 32-PTE_FILE_MAX_BITS */ - #define SRMMU_CHG_MASK (0xffffff00 | SRMMU_REF | SRMMU_DIRTY)
/* SRMMU swap entry encoding @@ -94,13 +90,13 @@ * oooooooooooooooooootttttRRRRRRRR * fedcba9876543210fedcba9876543210 * - * The bottom 8 bits are reserved for protection and status bits, especially - * FILE and PRESENT. + * The bottom 7 bits are reserved for protection and status bits, especially + * PRESENT. */ #define SRMMU_SWP_TYPE_MASK 0x1f -#define SRMMU_SWP_TYPE_SHIFT SRMMU_PTE_FILE_SHIFT -#define SRMMU_SWP_OFF_MASK 0x7ffff -#define SRMMU_SWP_OFF_SHIFT (SRMMU_PTE_FILE_SHIFT + 5) +#define SRMMU_SWP_TYPE_SHIFT 7 +#define SRMMU_SWP_OFF_MASK 0xfffff +#define SRMMU_SWP_OFF_SHIFT (SRMMU_SWP_TYPE_SHIFT + 5)
/* Some day I will implement true fine grained access bits for * user pages because the SRMMU gives us the capabilities to
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 9b4bdd2ffab9557ac43af7dff02e7dab1c8c58bd upstream.
We don't create non-linear mappings anymore. Let's drop code which handles them on page fault.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm.h | 16 +++++------- mm/memory.c | 65 +++++++--------------------------------------- 2 files changed, 16 insertions(+), 65 deletions(-)
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -187,21 +187,19 @@ extern unsigned int kobjsize(const void extern pgprot_t protection_map[16];
#define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */ -#define FAULT_FLAG_NONLINEAR 0x02 /* Fault was via a nonlinear mapping */ -#define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */ -#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ -#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ -#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ -#define FAULT_FLAG_TRIED 0x40 /* second try */ -#define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */ +#define FAULT_FLAG_MKWRITE 0x02 /* Fault was mkwrite of existing pte */ +#define FAULT_FLAG_ALLOW_RETRY 0x04 /* Retry fault if blocking */ +#define FAULT_FLAG_RETRY_NOWAIT 0x08 /* Don't drop mmap_sem and wait when retrying */ +#define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_TRIED 0x20 /* Second try */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
/* * vm_fault is filled by the the pagefault handler and passed to the vma's * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * - * pgoff should be used in favour of virtual_address, if possible. If pgoff - * is used, one may implement ->remap_pages to get nonlinear mapping support. + * pgoff should be used in favour of virtual_address, if possible. */ struct vm_fault { unsigned int flags; /* FAULT_FLAG_xxx flags */ --- a/mm/memory.c +++ b/mm/memory.c @@ -1912,12 +1912,11 @@ int apply_to_page_range(struct mm_struct EXPORT_SYMBOL_GPL(apply_to_page_range);
/* - * handle_pte_fault chooses page fault handler according to an entry - * which was read non-atomically. Before making any commitment, on - * those architectures or configurations (e.g. i386 with PAE) which - * might give a mix of unmatched parts, do_swap_page and do_nonlinear_fault - * must check under lock before unmapping the pte and proceeding - * (but do_wp_page is only called after already making such a check; + * handle_pte_fault chooses page fault handler according to an entry which was + * read non-atomically. Before making any commitment, on those architectures + * or configurations (e.g. i386 with PAE) which might give a mix of unmatched + * parts, do_swap_page must check under lock before unmapping the pte and + * proceeding (but do_wp_page is only called after already making such a check; * and do_anonymous_page can safely check later on). */ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, @@ -2676,8 +2675,6 @@ void do_set_pte(struct vm_area_struct *v entry = mk_pte(page, vma->vm_page_prot); if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); - else if (pte_file(*pte) && pte_file_soft_dirty(*pte)) - pte_mksoft_dirty(entry); if (anon) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, address); @@ -2818,8 +2815,7 @@ static int do_read_fault(struct mm_struc * if page by the offset is not ready to be mapped (cold cache or * something). */ - if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) && - fault_around_pages() > 1) { + if (vma->vm_ops->map_pages && fault_around_pages() > 1) { pte = pte_offset_map_lock(mm, pmd, address, &ptl); do_fault_around(vma, address, pte, pgoff, flags); if (!pte_same(*pte, orig_pte)) @@ -2949,7 +2945,7 @@ static int do_shared_fault(struct mm_str return ret; }
-static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma, +static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, unsigned int flags, pte_t orig_pte) { @@ -2969,44 +2965,6 @@ static int do_linear_fault(struct mm_str return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); }
-/* - * Fault of a previously existing named mapping. Repopulate the pte - * from the encoded file_pte if possible. This enables swappable - * nonlinear vmas. - * - * We enter with non-exclusive mmap_sem (to exclude vma changes, - * but allow concurrent faults), and pte mapped but not yet locked. - * We return with mmap_sem still held, but pte unmapped and unlocked. - */ -static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *page_table, pmd_t *pmd, - unsigned int flags, pte_t orig_pte) -{ - pgoff_t pgoff; - - flags |= FAULT_FLAG_NONLINEAR; - - if (!pte_unmap_same(mm, pmd, page_table, orig_pte)) - return 0; - - if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) { - /* - * Page table corrupted: show pte and kill process. - */ - print_bad_pte(vma, address, orig_pte, NULL); - return VM_FAULT_SIGBUS; - } - - pgoff = pte_to_pgoff(orig_pte); - if (!(flags & FAULT_FLAG_WRITE)) - return do_read_fault(mm, vma, address, pmd, pgoff, flags, - orig_pte); - if (!(vma->vm_flags & VM_SHARED)) - return do_cow_fault(mm, vma, address, pmd, pgoff, flags, - orig_pte); - return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); -} - static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags) @@ -3121,15 +3079,12 @@ static int handle_pte_fault(struct mm_st if (!pte_present(entry)) { if (pte_none(entry)) { if (vma->vm_ops) - return do_linear_fault(mm, vma, address, - pte, pmd, flags, entry); + return do_fault(mm, vma, address, pte, + pmd, flags, entry);
return do_anonymous_page(mm, vma, address, pte, pmd, flags); } - if (pte_file(entry)) - return do_nonlinear_fault(mm, vma, address, - pte, pmd, flags, entry); return do_swap_page(mm, vma, address, pte, pmd, flags, entry); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vlastimil Babka vbabka@suse.cz
commit 0d0f6249058834ffe1ceaad0bb31464af66f6e7a upstream.
The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for offset. The lower word is zeroed, thus systems with less than 4GB memory are safe. With 4GB to 128GB the swap type selects the memory locations vulnerable to L1TF; with even more memory, also the swap offfset influences the address. This might be a problem with 32bit PAE guests running on large 64bit hosts.
By continuing to keep the whole swap entry in either high or low 32bit word of PTE we would limit the swap size too much. Thus this patch uses the whole PAE PTE with the same layout as the 64bit version does. The macros just become a bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Michal Hocko mhocko@suse.com [bwh: Backported to 3.16: CONFIG_PGTABLE_LEVELS is not defined; use other config symbols in the condition.] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable-3level.h | 35 +++++++++++++++++++++++++-- arch/x86/mm/init.c | 2 +- 2 files changed, 34 insertions(+), 3 deletions(-)
--- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -177,12 +177,43 @@ static inline pmd_t native_pmdp_get_and_ #endif
/* Encode and de-code a swap entry */ +#define SWP_TYPE_BITS 5 + +#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1) + +/* We always extract/encode the offset by shifting it all the way up, and then down again */ +#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS) + #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5) #define __swp_type(x) (((x).val) & 0x1f) #define __swp_offset(x) ((x).val >> 5) #define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5}) -#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) -#define __swp_entry_to_pte(x) ((pte_t){ { .pte_high = (x).val } }) + +/* + * Normally, __swp_entry() converts from arch-independent swp_entry_t to + * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result + * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the + * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to + * __swp_entry_to_pte() through the following helper macro based on 64bit + * __swp_entry(). + */ +#define __swp_pteval_entry(type, offset) ((pteval_t) { \ + (~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \ + | ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) }) + +#define __swp_entry_to_pte(x) ((pte_t){ .pte = \ + __swp_pteval_entry(__swp_type(x), __swp_offset(x)) }) +/* + * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent + * swp_entry_t, but also has to convert it from 64bit to the 32bit + * intermediate representation, using the following macros based on 64bit + * __swp_type() and __swp_offset(). + */ +#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS))) +#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)) + +#define __pte_to_swp_entry(pte) (__swp_entry(__pteval_swp_type(pte), \ + __pteval_swp_offset(pte)))
#include <asm/pgtable-invert.h>
--- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -715,7 +715,7 @@ unsigned long max_swapfile_size(void) * We encode swap offsets also with 3 bits below those for pfn * which makes the usable limit higher. */ -#ifdef CONFIG_X86_64 +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT; #endif pages = min_t(unsigned long, l1tf_limit, pages);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill@shutemov.name
commit b32da82e28ce90bff4e371fc15d2816fa3175bb0 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Ralf Baechle ralf@linux-mips.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: Deleted definitions are slightly different] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/mips/include/asm/pgtable-32.h +++ b/arch/mips/include/asm/pgtable-32.h @@ -148,20 +148,6 @@ pfn_pte(unsigned long pfn, pgprot_t prot #define __swp_entry(type,offset) \ ((swp_entry_t) { ((type) << 10) | ((offset) << 15) })
-/* - * Bits 0, 4, 8, and 9 are taken, split up 28 bits of offset into this range: - */ -#define PTE_FILE_MAX_BITS 28 - -#define pte_to_pgoff(_pte) ((((_pte).pte >> 1 ) & 0x07) | \ - (((_pte).pte >> 2 ) & 0x38) | \ - (((_pte).pte >> 10) << 6 )) - -#define pgoff_to_pte(off) ((pte_t) { (((off) & 0x07) << 1 ) | \ - (((off) & 0x38) << 2 ) | \ - (((off) >> 6 ) << 10) | \ - _PAGE_FILE }) - #else
/* Swap entries must have VALID and GLOBAL bits cleared. */ @@ -177,31 +163,6 @@ pfn_pte(unsigned long pfn, pgprot_t prot ((swp_entry_t) { ((type) << 8) | ((offset) << 13) }) #endif /* defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32) */
-#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32) -/* - * Bits 0 and 1 of pte_high are taken, use the rest for the page offset... - */ -#define PTE_FILE_MAX_BITS 30 - -#define pte_to_pgoff(_pte) ((_pte).pte_high >> 2) -#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) << 2 }) - -#else -/* - * Bits 0, 4, 6, and 7 are taken, split up 28 bits of offset into this range: - */ -#define PTE_FILE_MAX_BITS 28 - -#define pte_to_pgoff(_pte) ((((_pte).pte >> 1) & 0x7) | \ - (((_pte).pte >> 2) & 0x8) | \ - (((_pte).pte >> 8) << 4)) - -#define pgoff_to_pte(off) ((pte_t) { (((off) & 0x7) << 1) | \ - (((off) & 0x8) << 2) | \ - (((off) >> 4) << 8) | \ - _PAGE_FILE }) -#endif - #endif
#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32) --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -291,13 +291,4 @@ static inline pte_t mk_swap_pte(unsigned #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-/* - * Bits 0, 4, 6, and 7 are taken. Let's leave bits 1, 2, 3, and 5 alone to - * make things easier, and only use the upper 56 bits for the page offset... - */ -#define PTE_FILE_MAX_BITS 56 - -#define pte_to_pgoff(_pte) ((_pte).pte >> 8) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 8) | _PAGE_FILE }) - #endif /* _ASM_PGTABLE_64_H */ --- a/arch/mips/include/asm/pgtable-bits.h +++ b/arch/mips/include/asm/pgtable-bits.h @@ -50,8 +50,6 @@
/* * The following bits are implemented in software - * - * _PAGE_FILE semantics: set:pagecache unset:swap */ #define _PAGE_PRESENT_SHIFT 6 #define _PAGE_PRESENT (1 << _PAGE_PRESENT_SHIFT) @@ -64,14 +62,10 @@ #define _PAGE_MODIFIED_SHIFT 10 #define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT)
-#define _PAGE_FILE (1 << 10) - #elif defined(CONFIG_CPU_R3000) || defined(CONFIG_CPU_TX39XX)
/* * The following are implemented by software - * - * _PAGE_FILE semantics: set:pagecache unset:swap */ #define _PAGE_PRESENT_SHIFT 0 #define _PAGE_PRESENT (1 << _PAGE_PRESENT_SHIFT) @@ -83,8 +77,6 @@ #define _PAGE_ACCESSED (1 << _PAGE_ACCESSED_SHIFT) #define _PAGE_MODIFIED_SHIFT 4 #define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT) -#define _PAGE_FILE_SHIFT 4 -#define _PAGE_FILE (1 << _PAGE_FILE_SHIFT)
/* * And these are the hardware TLB bits @@ -114,7 +106,6 @@ * The following bits are implemented in software * * _PAGE_READ / _PAGE_READ_SHIFT should be unused if cpu_has_rixi. - * _PAGE_FILE semantics: set:pagecache unset:swap */ #define _PAGE_PRESENT_SHIFT (0) #define _PAGE_PRESENT (1 << _PAGE_PRESENT_SHIFT) @@ -126,7 +117,6 @@ #define _PAGE_ACCESSED (1 << _PAGE_ACCESSED_SHIFT) #define _PAGE_MODIFIED_SHIFT (_PAGE_ACCESSED_SHIFT + 1) #define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT) -#define _PAGE_FILE (_PAGE_MODIFIED)
#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT /* huge tlb page */ --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -253,7 +253,6 @@ extern pgd_t swapper_pg_dir[]; static inline int pte_write(pte_t pte) { return pte.pte_low & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte.pte_low & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte.pte_low & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte.pte_low & _PAGE_FILE; }
static inline pte_t pte_wrprotect(pte_t pte) { @@ -309,7 +308,6 @@ static inline pte_t pte_mkyoung(pte_t pt static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline pte_t pte_wrprotect(pte_t pte) {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 28a2775217b17208811fa43a9e96bd1fdf417b86 upstream.
Having everything in nospec-branch.h creates a hell of dependencies when adding the prctl based switching mechanism. Move everything which is not required in nospec-branch.h to spec-ctrl.h and fix up the includes in the relevant files.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Reviewed-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 14 -------------- arch/x86/include/asm/spec-ctrl.h | 21 +++++++++++++++++++++ arch/x86/kernel/cpu/amd.c | 2 +- arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 2 +- 6 files changed, 25 insertions(+), 18 deletions(-) create mode 100644 arch/x86/include/asm/spec-ctrl.h
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -183,26 +183,12 @@ enum spectre_v2_mitigation { extern void x86_spec_ctrl_set(u64); extern u64 x86_spec_ctrl_get_default(void);
-/* - * On VMENTER we must preserve whatever view of the SPEC_CTRL MSR - * the guest has, while on VMEXIT we restore the host view. This - * would be easier if SPEC_CTRL were architecturally maskable or - * shadowable for guests but this is not (currently) the case. - * Takes the guest view of SPEC_CTRL MSR as a parameter. - */ -extern void x86_spec_ctrl_set_guest(u64); -extern void x86_spec_ctrl_restore_host(u64); - /* The Speculative Store Bypass disable variants */ enum ssb_mitigation { SPEC_STORE_BYPASS_NONE, SPEC_STORE_BYPASS_DISABLE, };
-/* AMD specific Speculative Store Bypass MSR data */ -extern u64 x86_amd_ls_cfg_base; -extern u64 x86_amd_ls_cfg_rds_mask; - extern char __indirect_thunk_start[]; extern char __indirect_thunk_end[];
--- /dev/null +++ b/arch/x86/include/asm/spec-ctrl.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_SPECCTRL_H_ +#define _ASM_X86_SPECCTRL_H_ + +#include <asm/nospec-branch.h> + +/* + * On VMENTER we must preserve whatever view of the SPEC_CTRL MSR + * the guest has, while on VMEXIT we restore the host view. This + * would be easier if SPEC_CTRL were architecturally maskable or + * shadowable for guests but this is not (currently) the case. + * Takes the guest view of SPEC_CTRL MSR as a parameter. + */ +extern void x86_spec_ctrl_set_guest(u64); +extern void x86_spec_ctrl_restore_host(u64); + +/* AMD specific Speculative Store Bypass MSR data */ +extern u64 x86_amd_ls_cfg_base; +extern u64 x86_amd_ls_cfg_rds_mask; + +#endif --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -8,7 +8,7 @@ #include <asm/processor.h> #include <asm/apic.h> #include <asm/cpu.h> -#include <asm/nospec-branch.h> +#include <asm/spec-ctrl.h> #include <asm/pci-direct.h>
#ifdef CONFIG_X86_64 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -12,7 +12,7 @@ #include <linux/cpu.h> #include <linux/module.h>
-#include <asm/nospec-branch.h> +#include <asm/spec-ctrl.h> #include <asm/cmdline.h> #include <asm/bugs.h> #include <asm/processor.h> --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -37,7 +37,7 @@ #include <asm/debugreg.h> #include <asm/kvm_para.h> #include <asm/microcode.h> -#include <asm/nospec-branch.h> +#include <asm/spec-ctrl.h>
#include <asm/virtext.h> #include "trace.h" --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -47,7 +47,7 @@ #include <asm/debugreg.h> #include <asm/kexec.h> #include <asm/microcode.h> -#include <asm/nospec-branch.h> +#include <asm/spec-ctrl.h>
#include "trace.h"
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c upstream.
When PTEs are set to PROT_NONE the kernel just clears the Present bit and preserves the PFN, which creates attack surface for L1TF speculation speculation attacks.
This is important inside guests, because L1TF speculation bypasses physical page remapping. While the host has its own migitations preventing leaking data from other VMs into the guest, this would still risk leaking the wrong page inside the current guest.
This uses the same technique as Linus' swap entry patch: while an entry is is in PROTNONE state invert the complete PFN part part of it. This ensures that the the highest bit will point to non existing memory.
The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and pte/pmd/pud_pfn undo it.
This assume that no code path touches the PFN part of a PTE directly without using these primitives.
This doesn't handle the case that MMIO is on the top of the CPU physical memory. If such an MMIO region was exposed by an unpriviledged driver for mmap it would be possible to attack some real memory. However this situation is all rather unlikely.
For 32bit non PAE the inversion is not done because there are really not enough bits to protect anything.
Q: Why does the guest need to be protected when the HyperVisor already has L1TF mitigations?
A: Here's an example:
Physical pages 1 2 get mapped into a guest as GPA 1 -> PA 2 GPA 2 -> PA 1 through EPT.
The L1TF speculation ignores the EPT remapping.
Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and they belong to different users and should be isolated.
A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and gets read access to the underlying physical page. Which in this case points to PA 2, so it can read process B's data, if it happened to be in L1, so isolation inside the guest is broken.
There's nothing the hypervisor can do about this. This mitigation has to be done in the guest itself.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: Dave Hansen dave.hansen@intel.com [bwh: Backported to 3.16: - s/check_pgprot/massage_pgprot/ - Keep using PTE_PFN_MASK to extract PFN from pmd_pfn() and pud_pfn(), as we don't need to worry about the PAT bit being set here] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable-2level.h | 17 +++++++++++ arch/x86/include/asm/pgtable-3level.h | 2 ++ arch/x86/include/asm/pgtable-invert.h | 32 +++++++++++++++++++ arch/x86/include/asm/pgtable.h | 44 +++++++++++++++++++-------- arch/x86/include/asm/pgtable_64.h | 2 ++ 5 files changed, 84 insertions(+), 13 deletions(-) create mode 100644 arch/x86/include/asm/pgtable-invert.h
--- a/arch/x86/include/asm/pgtable-2level.h +++ b/arch/x86/include/asm/pgtable-2level.h @@ -77,4 +77,21 @@ static inline unsigned long pte_bitop(un #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
+/* No inverted PFNs on 2 level page tables */ + +static inline u64 protnone_mask(u64 val) +{ + return 0; +} + +static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask) +{ + return val; +} + +static inline bool __pte_needs_invert(u64 val) +{ + return false; +} + #endif /* _ASM_X86_PGTABLE_2LEVEL_H */ --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -184,4 +184,6 @@ static inline pmd_t native_pmdp_get_and_ #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ { .pte_high = (x).val } })
+#include <asm/pgtable-invert.h> + #endif /* _ASM_X86_PGTABLE_3LEVEL_H */ --- /dev/null +++ b/arch/x86/include/asm/pgtable-invert.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_PGTABLE_INVERT_H +#define _ASM_PGTABLE_INVERT_H 1 + +#ifndef __ASSEMBLY__ + +static inline bool __pte_needs_invert(u64 val) +{ + return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) == _PAGE_PROTNONE; +} + +/* Get a mask to xor with the page table entry to get the correct pfn. */ +static inline u64 protnone_mask(u64 val) +{ + return __pte_needs_invert(val) ? ~0ull : 0; +} + +static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask) +{ + /* + * When a PTE transitions from NONE to !NONE or vice-versa + * invert the PFN part to stop speculation. + * pte_pfn undoes this when needed. + */ + if (__pte_needs_invert(oldval) != __pte_needs_invert(val)) + val = (val & ~mask) | (~val & mask); + return val; +} + +#endif /* __ASSEMBLY__ */ + +#endif --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -141,19 +141,29 @@ static inline int pte_special(pte_t pte) (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); }
+/* Entries that were set to PROT_NONE are inverted */ + +static inline u64 protnone_mask(u64 val); + static inline unsigned long pte_pfn(pte_t pte) { - return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT; + unsigned long pfn = pte_val(pte); + pfn ^= protnone_mask(pfn); + return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; }
static inline unsigned long pmd_pfn(pmd_t pmd) { - return (pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT; + unsigned long pfn = pmd_val(pmd); + pfn ^= protnone_mask(pfn); + return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; }
static inline unsigned long pud_pfn(pud_t pud) { - return (pud_val(pud) & PTE_PFN_MASK) >> PAGE_SHIFT; + unsigned long pfn = pud_val(pud); + pfn ^= protnone_mask(pfn); + return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; }
#define pte_page(pte) pfn_to_page(pte_pfn(pte)) @@ -361,25 +371,33 @@ static inline pgprotval_t massage_pgprot
static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) { - return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) | - massage_pgprot(pgprot)); + phys_addr_t pfn = page_nr << PAGE_SHIFT; + pfn ^= protnone_mask(pgprot_val(pgprot)); + pfn &= PTE_PFN_MASK; + return __pte(pfn | massage_pgprot(pgprot)); }
static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) { - return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) | - massage_pgprot(pgprot)); + phys_addr_t pfn = page_nr << PAGE_SHIFT; + pfn ^= protnone_mask(pgprot_val(pgprot)); + pfn &= PTE_PFN_MASK; + return __pmd(pfn | massage_pgprot(pgprot)); }
static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot) { - return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) | - massage_pgprot(pgprot)); + phys_addr_t pfn = page_nr << PAGE_SHIFT; + pfn ^= protnone_mask(pgprot_val(pgprot)); + pfn &= PTE_PFN_MASK; + return __pud(pfn | massage_pgprot(pgprot)); }
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask); + static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { - pteval_t val = pte_val(pte); + pteval_t val = pte_val(pte), oldval = val;
/* * Chop off the NX bit (if present), and add the NX portion of @@ -387,17 +405,17 @@ static inline pte_t pte_modify(pte_t pte */ val &= _PAGE_CHG_MASK; val |= massage_pgprot(newprot) & ~_PAGE_CHG_MASK; - + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); return __pte(val); }
static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { - pmdval_t val = pmd_val(pmd); + pmdval_t val = pmd_val(pmd), oldval = val;
val &= _HPAGE_CHG_MASK; val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK; - + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); return __pmd(val); }
--- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -239,6 +239,8 @@ extern void cleanup_highmap(void); extern void init_extra_mapping_uc(unsigned long phys, unsigned long size); extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
+#include <asm/pgtable-invert.h> + #endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_PGTABLE_64_H */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andy Lutomirski luto@kernel.org
commit 1745cbc5d0dee0749a6bc0ea8e872c5db0074061 upstream.
The x86 vvar vma contains pages with differing cacheability flags. x86 currently implements this by manually inserting all the ptes using (io_)remap_pfn_range when the vma is set up.
x86 wants to move to using .fault with VM_FAULT_NOPAGE to set up the mappings as needed. The correct API to use to insert a pfn in .fault is vm_insert_pfn(), but vm_insert_pfn() can't override the vma's cache mode, and the HPET page in particular needs to be uncached despite the fact that the rest of the VMA is cached.
Add vm_insert_pfn_prot() to support varying cacheability within the same non-COW VMA in a more sane manner.
x86 could alternatively use multiple VMAs, but that's messy, would break CRIU, and would create unnecessary VMAs that would waste memory.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Kees Cook keescook@chromium.org Acked-by: Andrew Morton akpm@linux-foundation.org Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: H. Peter Anvin hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Quentin Casasnovas quentin.casasnovas@oracle.com Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/d2938d1eb37be7a5e4f86182db646551f11e45aa.1451446564... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm.h | 2 ++ mm/memory.c | 25 +++++++++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-)
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1965,6 +1965,8 @@ int remap_pfn_range(struct vm_area_struc int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); --- a/mm/memory.c +++ b/mm/memory.c @@ -1574,8 +1574,29 @@ out: int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { + return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); +} +EXPORT_SYMBOL(vm_insert_pfn); + +/** + * vm_insert_pfn_prot - insert single pfn into user vma with specified pgprot + * @vma: user vma to map to + * @addr: target user address of this page + * @pfn: source kernel pfn + * @pgprot: pgprot flags for the inserted page + * + * This is exactly like vm_insert_pfn, except that it allows drivers to + * to override pgprot on a per-page basis. + * + * This only makes sense for IO mappings, and it makes no sense for + * cow mappings. In general, using multiple vmas is preferable; + * vm_insert_pfn_prot should only be used if using multiple VMAs is + * impractical. + */ +int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t pgprot) +{ int ret; - pgprot_t pgprot = vma->vm_page_prot; /* * Technically, architectures with pte_special can avoid all these * restrictions (same for remap_pfn_range). However we would like @@ -1597,7 +1618,7 @@ int vm_insert_pfn(struct vm_area_struct
return ret; } -EXPORT_SYMBOL(vm_insert_pfn); +EXPORT_SYMBOL(vm_insert_pfn_prot);
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 4a28bfe3267b68e22c663ac26185aa16c9b879ef upstream.
Combine the various logic which goes through all those x86_cpu_id matching structures in one function.
Suggested-by: Borislav Petkov bp@suse.de Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/common.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-)
--- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -803,21 +803,27 @@ static const __initconst struct x86_cpu_ {} };
-static bool __init cpu_vulnerable_to_meltdown(struct cpuinfo_x86 *c) +static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = 0;
+ if (x86_match_cpu(cpu_no_speculation)) + return; + + setup_force_cpu_bug(X86_BUG_SPECTRE_V1); + setup_force_cpu_bug(X86_BUG_SPECTRE_V2); + if (x86_match_cpu(cpu_no_meltdown)) - return false; + return;
if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES)) rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
/* Rogue Data Cache Load? No! */ if (ia32_cap & ARCH_CAP_RDCL_NO) - return false; + return;
- return true; + setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN); }
/* @@ -868,12 +874,7 @@ static void __init early_identify_cpu(st
setup_force_cpu_cap(X86_FEATURE_ALWAYS);
- if (!x86_match_cpu(cpu_no_speculation)) { - if (cpu_vulnerable_to_meltdown(c)) - setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN); - setup_force_cpu_bug(X86_BUG_SPECTRE_V1); - setup_force_cpu_bug(X86_BUG_SPECTRE_V2); - } + cpu_set_bug_bits(c); }
void __init early_cpu_init(void)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit d1059518b4789cabe34bb4b714d07e6089c82ca1 upstream.
Those SysFS functions have a similar preamble, as such make common code to handle them.
Suggested-by: Borislav Petkov bp@suse.de Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: s/X86_FEATURE_PTI/X86_FEATURE_KAISER/] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 46 ++++++++++++++++++++++++++------------ 1 file changed, 32 insertions(+), 14 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -352,30 +352,48 @@ retpoline_auto: #undef pr_fmt
#ifdef CONFIG_SYSFS -ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, char *buf) + +ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr, + char *buf, unsigned int bug) { - if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN)) + if (!boot_cpu_has_bug(bug)) return sprintf(buf, "Not affected\n"); - if (boot_cpu_has(X86_FEATURE_KAISER)) - return sprintf(buf, "Mitigation: PTI\n"); + + switch (bug) { + case X86_BUG_CPU_MELTDOWN: + if (boot_cpu_has(X86_FEATURE_KAISER)) + return sprintf(buf, "Mitigation: PTI\n"); + + break; + + case X86_BUG_SPECTRE_V1: + return sprintf(buf, "Mitigation: __user pointer sanitization\n"); + + case X86_BUG_SPECTRE_V2: + return sprintf(buf, "%s%s%s%s\n", spectre_v2_strings[spectre_v2_enabled], + boot_cpu_has(X86_FEATURE_USE_IBPB) ? ", IBPB" : "", + boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "", + spectre_v2_module_string()); + + default: + break; + } + return sprintf(buf, "Vulnerable\n"); }
+ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_CPU_MELTDOWN); +} + ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, char *buf) { - if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V1)) - return sprintf(buf, "Not affected\n"); - return sprintf(buf, "Mitigation: __user pointer sanitization\n"); + return cpu_show_common(dev, attr, buf, X86_BUG_SPECTRE_V1); }
ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, char *buf) { - if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2)) - return sprintf(buf, "Not affected\n"); - - return sprintf(buf, "%s%s%s%s\n", spectre_v2_strings[spectre_v2_enabled], - boot_cpu_has(X86_FEATURE_USE_IBPB) ? ", IBPB" : "", - boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "", - spectre_v2_module_string()); + return cpu_show_common(dev, attr, buf, X86_BUG_SPECTRE_V2); } #endif
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 8bf37d8c067bb7eb8e7c381bdadf9bd89182b6bc upstream.
The migitation control is simpler to implement in architecture code as it avoids the extra function call to check the mode. Aside of that having an explicit seccomp enabled mode in the architecture mitigations would require even more workarounds.
Move it into architecture code and provide a weak function in the seccomp code. Remove the 'which' argument as this allows the architecture to decide which mitigations are relevant for seccomp.
Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 29 ++++++++++++++++++----------- include/linux/nospec.h | 2 ++ kernel/seccomp.c | 15 ++------------- 3 files changed, 22 insertions(+), 24 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -606,6 +606,24 @@ static int ssb_prctl_set(struct task_str return 0; }
+int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, + unsigned long ctrl) +{ + switch (which) { + case PR_SPEC_STORE_BYPASS: + return ssb_prctl_set(task, ctrl); + default: + return -ENODEV; + } +} + +#ifdef CONFIG_SECCOMP +void arch_seccomp_spec_mitigate(struct task_struct *task) +{ + ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE); +} +#endif + static int ssb_prctl_get(struct task_struct *task) { switch (ssb_mode) { @@ -624,17 +642,6 @@ static int ssb_prctl_get(struct task_str } }
-int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, - unsigned long ctrl) -{ - switch (which) { - case PR_SPEC_STORE_BYPASS: - return ssb_prctl_set(task, ctrl); - default: - return -ENODEV; - } -} - int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which) { switch (which) { --- a/include/linux/nospec.h +++ b/include/linux/nospec.h @@ -62,5 +62,7 @@ static inline unsigned long array_index_ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which); int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, unsigned long ctrl); +/* Speculation control for seccomp enforced mitigation */ +void arch_seccomp_spec_mitigate(struct task_struct *task);
#endif /* _LINUX_NOSPEC_H */ --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -207,18 +207,7 @@ static inline bool seccomp_may_assign_mo return true; }
-/* - * If a given speculation mitigation is opt-in (prctl()-controlled), - * select it, by disabling speculation (enabling mitigation). - */ -static inline void spec_mitigate(struct task_struct *task, - unsigned long which) -{ - int state = arch_prctl_spec_ctrl_get(task, which); - - if (state > 0 && (state & PR_SPEC_PRCTL)) - arch_prctl_spec_ctrl_set(task, which, PR_SPEC_FORCE_DISABLE); -} +void __weak arch_seccomp_spec_mitigate(struct task_struct *task) { }
static inline void seccomp_assign_mode(unsigned long seccomp_mode, unsigned long flags) @@ -226,7 +215,7 @@ static inline void seccomp_assign_mode(u current->seccomp.mode = seccomp_mode; /* Assume default seccomp processes want spec flaw mitigation. */ if ((flags & SECCOMP_FILTER_FLAG_SPEC_ALLOW) == 0) - spec_mitigate(current, PR_SPEC_STORE_BYPASS); + arch_seccomp_spec_mitigate(current); set_tsk_thread_flag(current, TIF_SECCOMP); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 356e4bfff2c5489e016fdb925adbf12a1e3950ee upstream.
For certain use cases it is desired to enforce mitigations so they cannot be undone afterwards. That's important for loader stubs which want to prevent a child from disabling the mitigation again. Will also be used for seccomp(). The extra state preserving of the prctl state for SSB is a preparatory step for EBPF dymanic speculation control.
Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/spec_ctrl.rst | 34 +++++++++++++++++++++------------- arch/x86/kernel/cpu/bugs.c | 35 +++++++++++++++++++++++++---------- fs/proc/array.c | 3 +++ include/linux/sched.h | 10 +++++++++- include/uapi/linux/prctl.h | 1 + 5 files changed, 59 insertions(+), 24 deletions(-)
--- a/Documentation/spec_ctrl.rst +++ b/Documentation/spec_ctrl.rst @@ -25,19 +25,21 @@ PR_GET_SPECULATION_CTRL -----------------------
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature -which is selected with arg2 of prctl(2). The return value uses bits 0-2 with +which is selected with arg2 of prctl(2). The return value uses bits 0-3 with the following meaning:
-==== ================ =================================================== -Bit Define Description -==== ================ =================================================== -0 PR_SPEC_PRCTL Mitigation can be controlled per task by - PR_SET_SPECULATION_CTRL -1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is - disabled -2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is - enabled -==== ================ =================================================== +==== ===================== =================================================== +Bit Define Description +==== ===================== =================================================== +0 PR_SPEC_PRCTL Mitigation can be controlled per task by + PR_SET_SPECULATION_CTRL +1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is + disabled +2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is + enabled +3 PR_SPEC_FORCE_DISABLE Same as PR_SPEC_DISABLE, but cannot be undone. A + subsequent prctl(..., PR_SPEC_ENABLE) will fail. +==== ===================== ===================================================
If all bits are 0 the CPU is not affected by the speculation misfeature.
@@ -47,9 +49,11 @@ misfeature will fail.
PR_SET_SPECULATION_CTRL ----------------------- + PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which is selected by arg2 of :manpage:`prctl(2)` per task. arg3 is used to hand -in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE. +in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE or +PR_SPEC_FORCE_DISABLE.
Common error codes ------------------ @@ -70,10 +74,13 @@ Value Meaning 0 Success
ERANGE arg3 is incorrect, i.e. it's neither PR_SPEC_ENABLE nor - PR_SPEC_DISABLE + PR_SPEC_DISABLE nor PR_SPEC_FORCE_DISABLE
ENXIO Control of the selected speculation misfeature is not possible. See PR_GET_SPECULATION_CTRL. + +EPERM Speculation was disabled with PR_SPEC_FORCE_DISABLE and caller + tried to enable it again. ======= =================================================================
Speculation misfeature controls @@ -84,3 +91,4 @@ Speculation misfeature controls * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, 0, 0, 0); * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_ENABLE, 0, 0); * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_FORCE_DISABLE, 0, 0); --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -570,21 +570,37 @@ static void ssb_select_mitigation()
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl) { - bool rds = !!test_tsk_thread_flag(task, TIF_RDS); + bool update;
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL) return -ENXIO;
- if (ctrl == PR_SPEC_ENABLE) - clear_tsk_thread_flag(task, TIF_RDS); - else - set_tsk_thread_flag(task, TIF_RDS); + switch (ctrl) { + case PR_SPEC_ENABLE: + /* If speculation is force disabled, enable is not allowed */ + if (task_spec_ssb_force_disable(task)) + return -EPERM; + task_clear_spec_ssb_disable(task); + update = test_and_clear_tsk_thread_flag(task, TIF_RDS); + break; + case PR_SPEC_DISABLE: + task_set_spec_ssb_disable(task); + update = !test_and_set_tsk_thread_flag(task, TIF_RDS); + break; + case PR_SPEC_FORCE_DISABLE: + task_set_spec_ssb_disable(task); + task_set_spec_ssb_force_disable(task); + update = !test_and_set_tsk_thread_flag(task, TIF_RDS); + break; + default: + return -ERANGE; + }
/* * If being set on non-current task, delay setting the CPU * mitigation until it is next scheduled. */ - if (task == current && rds != !!test_tsk_thread_flag(task, TIF_RDS)) + if (task == current && update) speculative_store_bypass_update();
return 0; @@ -596,7 +612,9 @@ static int ssb_prctl_get(struct task_str case SPEC_STORE_BYPASS_DISABLE: return PR_SPEC_DISABLE; case SPEC_STORE_BYPASS_PRCTL: - if (test_tsk_thread_flag(task, TIF_RDS)) + if (task_spec_ssb_force_disable(task)) + return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE; + if (task_spec_ssb_disable(task)) return PR_SPEC_PRCTL | PR_SPEC_DISABLE; return PR_SPEC_PRCTL | PR_SPEC_ENABLE; default: @@ -609,9 +627,6 @@ static int ssb_prctl_get(struct task_str int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, unsigned long ctrl) { - if (ctrl != PR_SPEC_ENABLE && ctrl != PR_SPEC_DISABLE) - return -ERANGE; - switch (which) { case PR_SPEC_STORE_BYPASS: return ssb_prctl_set(task, ctrl); --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -335,6 +335,9 @@ static inline void task_seccomp(struct s case PR_SPEC_NOT_AFFECTED: seq_printf(m, "not vulnerable"); break; + case PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE: + seq_printf(m, "thread force mitigated"); + break; case PR_SPEC_PRCTL | PR_SPEC_DISABLE: seq_printf(m, "thread mitigated"); break; --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1973,7 +1973,8 @@ static inline void memalloc_noio_restore #define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */ #define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */ #define PFA_SPREAD_SLAB 2 /* Spread some slab caches over cpuset */ - +#define PFA_SPEC_SSB_DISABLE 3 /* Speculative Store Bypass disabled */ +#define PFA_SPEC_SSB_FORCE_DISABLE 4 /* Speculative Store Bypass force disabled*/
#define TASK_PFA_TEST(name, func) \ static inline bool task_##func(struct task_struct *p) \ @@ -1996,6 +1997,13 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab) TASK_PFA_SET(SPREAD_SLAB, spread_slab) TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
+TASK_PFA_TEST(SPEC_SSB_DISABLE, spec_ssb_disable) +TASK_PFA_SET(SPEC_SSB_DISABLE, spec_ssb_disable) +TASK_PFA_CLEAR(SPEC_SSB_DISABLE, spec_ssb_disable) + +TASK_PFA_TEST(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable) +TASK_PFA_SET(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable) + /* * task->jobctl flags */ --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -162,5 +162,6 @@ # define PR_SPEC_PRCTL (1UL << 0) # define PR_SPEC_ENABLE (1UL << 1) # define PR_SPEC_DISABLE (1UL << 2) +# define PR_SPEC_FORCE_DISABLE (1UL << 3)
#endif /* _LINUX_PRCTL_H */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 6bf63a8ccb1dccd6ab81bc8bc46863493629cdb8 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also increases the number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: David Howells dhowells@redhat.com Cc: Koichi Yasutake yasutake.koichi@jp.panasonic.com Cc: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/mn10300/include/asm/pgtable.h | 17 +++-------------- 1 file changed, 3 insertions(+), 14 deletions(-)
--- a/arch/mn10300/include/asm/pgtable.h +++ b/arch/mn10300/include/asm/pgtable.h @@ -134,7 +134,6 @@ extern pte_t kernel_vmalloc_ptes[(VMALLO #define _PAGE_NX 0 /* no-execute bit */
/* If _PAGE_VALID is clear, we use these: */ -#define _PAGE_FILE xPTEL2_C /* set:pagecache unset:swap */ #define _PAGE_PROTNONE 0x000 /* If not present */
#define __PAGE_PROT_UWAUX 0x010 @@ -241,11 +240,6 @@ static inline int pte_young(pte_t pte) { static inline int pte_write(pte_t pte) { return pte_val(pte) & __PAGE_PROT_WRITE; } static inline int pte_special(pte_t pte){ return 0; }
-/* - * The following only works if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } - static inline pte_t pte_rdprotect(pte_t pte) { pte_val(pte) &= ~(__PAGE_PROT_USER|__PAGE_PROT_UWAUX); return pte; @@ -338,16 +332,11 @@ static inline int pte_exec_kernel(pte_t return 1; }
-#define PTE_FILE_MAX_BITS 30 - -#define pte_to_pgoff(pte) (pte_val(pte) >> 2) -#define pgoff_to_pte(off) __pte((off) << 2 | _PAGE_FILE) - /* Encode and de-code a swap entry */ -#define __swp_type(x) (((x).val >> 2) & 0x3f) -#define __swp_offset(x) ((x).val >> 8) +#define __swp_type(x) (((x).val >> 1) & 0x3f) +#define __swp_offset(x) ((x).val >> 7) #define __swp_entry(type, offset) \ - ((swp_entry_t) { ((type) << 2) | ((offset) << 8) }) + ((swp_entry_t) { ((type) << 1) | ((offset) << 7) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) __pte((x).val)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 1115a859f33276fe8afb31c60cf9d8e657872558 upstream.
Intel and AMD SPEC_CTRL (0x48) MSR semantics may differ in the future (or in fact use different MSRs for the same functionality).
As such a run-time mechanism is required to whitelist the appropriate MSR values.
[ tglx: Made the variable __ro_after_init ]
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: - We don't have __ro_after_init - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -34,6 +34,12 @@ static void __init ssb_select_mitigation */ static u64 x86_spec_ctrl_base;
+/* + * The vendor and possibly platform specific bits which can be modified in + * x86_spec_ctrl_base. + */ +static u64 x86_spec_ctrl_mask = ~SPEC_CTRL_IBRS; + #ifdef CONFIG_X86_32
static double __initdata x = 4195835.0; @@ -179,7 +185,7 @@ static enum spectre_v2_mitigation spectr
void x86_spec_ctrl_set(u64 val) { - if (val & ~(SPEC_CTRL_IBRS | SPEC_CTRL_RDS)) + if (val & x86_spec_ctrl_mask) WARN_ONCE(1, "SPEC_CTRL MSR value 0x%16llx is unknown.\n", val); else wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base | val); @@ -497,6 +503,7 @@ static enum ssb_mitigation_cmd __init __ switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: x86_spec_ctrl_base |= SPEC_CTRL_RDS; + x86_spec_ctrl_mask &= ~SPEC_CTRL_RDS; x86_spec_ctrl_set(SPEC_CTRL_RDS); break; case X86_VENDOR_AMD: @@ -520,7 +527,7 @@ static void ssb_select_mitigation() void x86_spec_ctrl_setup_ap(void) { if (boot_cpu_has(X86_FEATURE_IBRS)) - x86_spec_ctrl_set(x86_spec_ctrl_base & (SPEC_CTRL_IBRS | SPEC_CTRL_RDS)); + x86_spec_ctrl_set(x86_spec_ctrl_base & ~x86_spec_ctrl_mask); }
#ifdef CONFIG_SYSFS
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Borislav Petkov bp@suse.de
commit cc69b34989210f067b2c51d5539b5f96ebcc3a01 upstream.
Function bodies are very similar and are going to grow more almost identical code. Add a bool arg to determine whether SPEC_CTRL is being set for the guest or restored to the host.
No functional changes.
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/spec-ctrl.h | 33 +++++++++++++++--- arch/x86/kernel/cpu/bugs.c | 60 ++++++++------------------------ 2 files changed, 44 insertions(+), 49 deletions(-)
--- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -13,10 +13,35 @@ * Takes the guest view of SPEC_CTRL MSR as a parameter and also * the guest's version of VIRT_SPEC_CTRL, if emulated. */ -extern void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl, - u64 guest_virt_spec_ctrl); -extern void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl, - u64 guest_virt_spec_ctrl); +extern void x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool guest); + +/** + * x86_spec_ctrl_set_guest - Set speculation control registers for the guest + * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL + * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL + * (may get translated to MSR_AMD64_LS_CFG bits) + * + * Avoids writing to the MSR if the content/bits are the same + */ +static inline +void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) +{ + x86_virt_spec_ctrl(guest_spec_ctrl, guest_virt_spec_ctrl, true); +} + +/** + * x86_spec_ctrl_restore_host - Restore host speculation control registers + * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL + * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL + * (may get translated to MSR_AMD64_LS_CFG bits) + * + * Avoids writing to the MSR if the content/bits are the same + */ +static inline +void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) +{ + x86_virt_spec_ctrl(guest_spec_ctrl, guest_virt_spec_ctrl, false); +}
/* AMD specific Speculative Store Bypass MSR data */ extern u64 x86_amd_ls_cfg_base; --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -212,55 +212,25 @@ u64 x86_spec_ctrl_get_default(void) } EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default);
-/** - * x86_spec_ctrl_set_guest - Set speculation control registers for the guest - * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL - * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL - * (may get translated to MSR_AMD64_LS_CFG bits) - * - * Avoids writing to the MSR if the content/bits are the same - */ -void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) +void +x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool setguest) { - u64 host = x86_spec_ctrl_base; + struct thread_info *ti = current_thread_info(); + u64 msr, host = x86_spec_ctrl_base;
/* Is MSR_SPEC_CTRL implemented ? */ - if (!static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) - return; - - /* SSBD controlled in MSR_SPEC_CTRL */ - if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) - host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags); - - if (host != guest_spec_ctrl) - wrmsrl(MSR_IA32_SPEC_CTRL, guest_spec_ctrl); -} -EXPORT_SYMBOL_GPL(x86_spec_ctrl_set_guest); - -/** - * x86_spec_ctrl_restore_host - Restore host speculation control registers - * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL - * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL - * (may get translated to MSR_AMD64_LS_CFG bits) - * - * Avoids writing to the MSR if the content/bits are the same - */ -void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) -{ - u64 host = x86_spec_ctrl_base; - - /* Is MSR_SPEC_CTRL implemented ? */ - if (!static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) - return; - - /* SSBD controlled in MSR_SPEC_CTRL */ - if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) - host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags); - - if (host != guest_spec_ctrl) - wrmsrl(MSR_IA32_SPEC_CTRL, host); + if (static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { + /* SSBD controlled in MSR_SPEC_CTRL */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) + host |= ssbd_tif_to_spec_ctrl(ti->flags); + + if (host != guest_spec_ctrl) { + msr = setguest ? guest_spec_ctrl : host; + wrmsrl(MSR_IA32_SPEC_CTRL, msr); + } + } } -EXPORT_SYMBOL_GPL(x86_spec_ctrl_restore_host); +EXPORT_SYMBOL_GPL(x86_virt_spec_ctrl);
static void x86_amd_ssb_disable(void) {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Borislav Petkov bp@suse.de
commit dd0792699c4058e63c0715d9a7c2d40226fcdddc upstream.
Fix some typos, improve formulations, end sentences with a fullstop.
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/spec_ctrl.rst | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
--- a/Documentation/spec_ctrl.rst +++ b/Documentation/spec_ctrl.rst @@ -2,13 +2,13 @@ Speculation Control ===================
-Quite some CPUs have speculation related misfeatures which are in fact -vulnerabilites causing data leaks in various forms even accross privilege -domains. +Quite some CPUs have speculation-related misfeatures which are in +fact vulnerabilities causing data leaks in various forms even across +privilege domains.
The kernel provides mitigation for such vulnerabilities in various -forms. Some of these mitigations are compile time configurable and some on -the kernel command line. +forms. Some of these mitigations are compile-time configurable and some +can be supplied on the kernel command line.
There is also a class of mitigations which are very expensive, but they can be restricted to a certain set of processes or tasks in controlled @@ -32,18 +32,18 @@ the following meaning: Bit Define Description ==== ===================== =================================================== 0 PR_SPEC_PRCTL Mitigation can be controlled per task by - PR_SET_SPECULATION_CTRL + PR_SET_SPECULATION_CTRL. 1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is - disabled + disabled. 2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is - enabled + enabled. 3 PR_SPEC_FORCE_DISABLE Same as PR_SPEC_DISABLE, but cannot be undone. A subsequent prctl(..., PR_SPEC_ENABLE) will fail. ==== ===================== ===================================================
If all bits are 0 the CPU is not affected by the speculation misfeature.
-If PR_SPEC_PRCTL is set, then the per task control of the mitigation is +If PR_SPEC_PRCTL is set, then the per-task control of the mitigation is available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation misfeature will fail.
@@ -61,9 +61,9 @@ Common error codes Value Meaning ======= ================================================================= EINVAL The prctl is not implemented by the architecture or unused - prctl(2) arguments are not 0 + prctl(2) arguments are not 0.
-ENODEV arg2 is selecting a not supported speculation misfeature +ENODEV arg2 is selecting a not supported speculation misfeature. ======= =================================================================
PR_SET_SPECULATION_CTRL error codes @@ -74,7 +74,7 @@ Value Meaning 0 Success
ERANGE arg3 is incorrect, i.e. it's neither PR_SPEC_ENABLE nor - PR_SPEC_DISABLE nor PR_SPEC_FORCE_DISABLE + PR_SPEC_DISABLE nor PR_SPEC_FORCE_DISABLE.
ENXIO Control of the selected speculation misfeature is not possible. See PR_GET_SPECULATION_CTRL.
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream.
For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory.
Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Dave Hansen dave.hansen@intel.com [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -4,6 +4,8 @@ #include <linux/swap.h> #include <linux/memblock.h> #include <linux/bootmem.h> /* for max_low_pfn */ +#include <linux/swapfile.h> +#include <linux/swapops.h>
#include <asm/cacheflush.h> #include <asm/e820.h> @@ -699,3 +701,15 @@ void __init zone_sizes_init(void) free_area_init_nodes(max_zone_pfns); }
+unsigned long max_swapfile_size(void) +{ + unsigned long pages; + + pages = generic_max_swapfile_size(); + + if (boot_cpu_has_bug(X86_BUG_L1TF)) { + /* Limit the swap file size to MAX_PA/2 for L1TF workaround */ + pages = min_t(unsigned long, l1tf_pfn_limit() + 1, pages); + } + return pages; +} --- a/include/linux/swapfile.h +++ b/include/linux/swapfile.h @@ -9,5 +9,7 @@ extern spinlock_t swap_lock; extern struct plist_head swap_active_head; extern struct swap_info_struct *swap_info[]; extern int try_to_unuse(unsigned int, bool, unsigned long); +extern unsigned long generic_max_swapfile_size(void); +extern unsigned long max_swapfile_size(void);
#endif /* _LINUX_SWAPFILE_H */ --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2166,6 +2166,35 @@ static int claim_swapfile(struct swap_in return 0; }
+ +/* + * Find out how many pages are allowed for a single swap device. There + * are two limiting factors: + * 1) the number of bits for the swap offset in the swp_entry_t type, and + * 2) the number of bits in the swap pte, as defined by the different + * architectures. + * + * In order to find the largest possible bit mask, a swap entry with + * swap type 0 and swap offset ~0UL is created, encoded to a swap pte, + * decoded to a swp_entry_t again, and finally the swap offset is + * extracted. + * + * This will mask all the bits from the initial ~0UL mask that can't + * be encoded in either the swp_entry_t or the architecture definition + * of a swap pte. + */ +unsigned long generic_max_swapfile_size(void) +{ + return swp_offset(pte_to_swp_entry( + swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1; +} + +/* Can be overridden by an architecture for additional checks. */ +__weak unsigned long max_swapfile_size(void) +{ + return generic_max_swapfile_size(); +} + static unsigned long read_swap_header(struct swap_info_struct *p, union swap_header *swap_header, struct inode *inode) @@ -2201,22 +2230,7 @@ static unsigned long read_swap_header(st p->cluster_next = 1; p->cluster_nr = 0;
- /* - * Find out how many pages are allowed for a single swap - * device. There are two limiting factors: 1) the number - * of bits for the swap offset in the swp_entry_t type, and - * 2) the number of bits in the swap pte as defined by the - * different architectures. In order to find the - * largest possible bit mask, a swap entry with swap type 0 - * and swap offset ~0UL is created, encoded to a swap pte, - * decoded to a swp_entry_t again, and finally the swap - * offset is extracted. This will mask all the bits from - * the initial ~0UL mask that can't be encoded in either - * the swp_entry_t or the architecture definition of a - * swap pte. - */ - maxpages = swp_offset(pte_to_swp_entry( - swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1; + maxpages = max_swapfile_size(); last_page = swap_header->info.last_page; if (last_page > maxpages) { pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit fa8ac4988249c38476f6ad678a4848a736373403 upstream.
x86_spec_ctrl_base is the system wide default value for the SPEC_CTRL MSR. x86_spec_ctrl_get_default() returns x86_spec_ctrl_base and was intended to prevent modification to that variable. Though the variable is read only after init and globaly visible already.
Remove the function and export the variable instead.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 16 +++++----------- arch/x86/include/asm/spec-ctrl.h | 3 --- arch/x86/kernel/cpu/bugs.c | 11 +---------- 3 files changed, 6 insertions(+), 24 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -172,16 +172,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_IBRS, };
-/* - * The Intel specification for the SPEC_CTRL MSR requires that we - * preserve any already set reserved bits at boot time (e.g. for - * future additions that this kernel is not currently aware of). - * We then set any additional mitigation bits that we want - * ourselves and always use this as the base for SPEC_CTRL. - * We also use this when handling guest entry/exit as below. - */ extern void x86_spec_ctrl_set(u64); -extern u64 x86_spec_ctrl_get_default(void);
/* The Speculative Store Bypass disable variants */ enum ssb_mitigation { @@ -232,6 +223,9 @@ static inline void indirect_branch_predi alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB); }
+/* The Intel SPEC CTRL MSR base value cache */ +extern u64 x86_spec_ctrl_base; + /* * With retpoline, we must use IBRS to restrict branch prediction * before calling into firmware. @@ -240,7 +234,7 @@ static inline void indirect_branch_predi */ #define firmware_restrict_branch_speculation_start() \ do { \ - u64 val = x86_spec_ctrl_get_default() | SPEC_CTRL_IBRS; \ + u64 val = x86_spec_ctrl_base | SPEC_CTRL_IBRS; \ \ preempt_disable(); \ alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \ @@ -249,7 +243,7 @@ do { \
#define firmware_restrict_branch_speculation_end() \ do { \ - u64 val = x86_spec_ctrl_get_default(); \ + u64 val = x86_spec_ctrl_base; \ \ alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \ X86_FEATURE_USE_IBRS_FW); \ --- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -47,9 +47,6 @@ void x86_spec_ctrl_restore_host(u64 gues extern u64 x86_amd_ls_cfg_base; extern u64 x86_amd_ls_cfg_ssbd_mask;
-/* The Intel SPEC CTRL MSR base value cache */ -extern u64 x86_spec_ctrl_base; - static inline u64 ssbd_tif_to_spec_ctrl(u64 tifn) { BUILD_BUG_ON(TIF_SSBD < SPEC_CTRL_SSBD_SHIFT); --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -35,6 +35,7 @@ static void __init ssb_select_mitigation * writes to SPEC_CTRL contain whatever reserved bits have been set. */ u64 x86_spec_ctrl_base; +EXPORT_SYMBOL_GPL(x86_spec_ctrl_base);
/* * The vendor and possibly platform specific bits which can be modified in @@ -202,16 +203,6 @@ void x86_spec_ctrl_set(u64 val) } EXPORT_SYMBOL_GPL(x86_spec_ctrl_set);
-u64 x86_spec_ctrl_get_default(void) -{ - u64 msrval = x86_spec_ctrl_base; - - if (static_cpu_has(X86_FEATURE_SPEC_CTRL)) - msrval |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags); - return msrval; -} -EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default); - void x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool setguest) {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 7a7d2db4b8b3505a3195178619ffcc80985c4be1 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Haavard Skinnemoen hskinnemoen@gmail.com Acked-by: Hans-Christian Egtvedt egtvedt@samfundet.no Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/avr32/include/asm/pgtable.h | 25 ------------------------- 1 file changed, 25 deletions(-)
--- a/arch/avr32/include/asm/pgtable.h +++ b/arch/avr32/include/asm/pgtable.h @@ -86,9 +86,6 @@ extern struct page *empty_zero_page; #define _PAGE_BIT_PRESENT 10 #define _PAGE_BIT_ACCESSED 11 /* software: page was accessed */
-/* The following flags are only valid when !PRESENT */ -#define _PAGE_BIT_FILE 0 /* software: pagecache or swap? */ - #define _PAGE_WT (1 << _PAGE_BIT_WT) #define _PAGE_DIRTY (1 << _PAGE_BIT_DIRTY) #define _PAGE_EXECUTE (1 << _PAGE_BIT_EXECUTE) @@ -101,7 +98,6 @@ extern struct page *empty_zero_page; /* Software flags */ #define _PAGE_ACCESSED (1 << _PAGE_BIT_ACCESSED) #define _PAGE_PRESENT (1 << _PAGE_BIT_PRESENT) -#define _PAGE_FILE (1 << _PAGE_BIT_FILE)
/* * Page types, i.e. sizes. _PAGE_TYPE_NONE corresponds to what is @@ -210,14 +206,6 @@ static inline int pte_special(pte_t pte) return 0; }
-/* - * The following only work if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - /* Mutator functions for PTE bits */ static inline pte_t pte_wrprotect(pte_t pte) { @@ -329,7 +317,6 @@ extern void update_mmu_cache(struct vm_a * Encode and decode a swap entry * * Constraints: - * _PAGE_FILE at bit 0 * _PAGE_TYPE_* at bits 2-3 (for emulating _PAGE_PROTNONE) * _PAGE_PRESENT at bit 10 * @@ -346,18 +333,6 @@ extern void update_mmu_cache(struct vm_a #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-/* - * Encode and decode a nonlinear file mapping entry. We have to - * preserve _PAGE_FILE and _PAGE_PRESENT here. _PAGE_TYPE_* isn't - * necessary, since _PAGE_FILE implies !_PAGE_PROTNONE (?) - */ -#define PTE_FILE_MAX_BITS 30 -#define pte_to_pgoff(pte) (((pte_val(pte) >> 1) & 0x1ff) \ - | ((pte_val(pte) >> 11) << 9)) -#define pgoff_to_pte(off) ((pte_t) { ((((off) & 0x1ff) << 1) \ - | (((off) >> 9) << 11) \ - | _PAGE_FILE) }) - typedef pte_t *pte_addr_t;
#define kern_addr_valid(addr) (1)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vlastimil Babka vbabka@suse.cz
commit 6a012288d6906fee1dbc244050ade1dafe4a9c8d upstream.
Two users have reported [1] that they have an "extremely unlikely" system with more than MAX_PA/2 memory and L1TF mitigation is not effective.
Make the warning more helpful by suggesting the proper mem=X kernel boot parameter to make it effective and a link to the L1TF document to help decide if the mitigation is worth the unusable RAM.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536
Suggested-by: Michal Hocko mhocko@suse.com Signed-off-by: Vlastimil Babka vbabka@suse.cz Acked-by: Michal Hocko mhocko@suse.com Cc: "H . Peter Anvin" hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Andi Kleen ak@linux.intel.com Cc: Dave Hansen dave.hansen@intel.com Link: https://lkml.kernel.org/r/966571f0-9d7f-43dc-92c6-a10eec7a1254@suse.cz Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -717,6 +717,10 @@ static void __init l1tf_select_mitigatio half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT; if (e820_any_mapped(half_pa, ULLONG_MAX - half_pa, E820_RAM)) { pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n"); + pr_info("You may make it effective by booting the kernel with mem=%llu parameter.\n", + half_pa); + pr_info("However, doing so will make a part of your RAM unusable.\n"); + pr_info("Reading https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html might help you decide.\n"); return; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Dave Airlie airlied@redhat.com
commit 7cf321d118a825c1541b43ca45294126fd474efa upstream.
This fixes a regression in all these drivers since the cache mode tracking was fixed for mixed mappings. It uses the new arch API to add the VRAM range to the PAT mapping tracking tables.
Fixes: 87744ab3832 (mm: fix cache mode tracking in vm_insert_mixed()) Reviewed-by: Christian König christian.koenig@amd.com. Signed-off-by: Dave Airlie airlied@redhat.com [bwh: Backported to 3.16: - Drop changes in amdgpu - In nouveau, use struct nouveau_device * and nv_device_resource_{start,len}() - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/drivers/gpu/drm/ast/ast_ttm.c +++ b/drivers/gpu/drm/ast/ast_ttm.c @@ -275,6 +275,8 @@ int ast_mm_init(struct ast_private *ast) return ret; }
+ arch_io_reserve_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); ast->fb_mtrr = arch_phys_wc_add(pci_resource_start(dev->pdev, 0), pci_resource_len(dev->pdev, 0));
@@ -283,11 +285,15 @@ int ast_mm_init(struct ast_private *ast)
void ast_mm_fini(struct ast_private *ast) { + struct drm_device *dev = ast->dev; + ttm_bo_device_release(&ast->ttm.bdev);
ast_ttm_global_release(ast);
arch_phys_wc_del(ast->fb_mtrr); + arch_io_free_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); }
void ast_ttm_placement(struct ast_bo *bo, int domain) --- a/drivers/gpu/drm/cirrus/cirrus_ttm.c +++ b/drivers/gpu/drm/cirrus/cirrus_ttm.c @@ -275,6 +275,9 @@ int cirrus_mm_init(struct cirrus_device return ret; }
+ arch_io_reserve_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); + cirrus->fb_mtrr = arch_phys_wc_add(pci_resource_start(dev->pdev, 0), pci_resource_len(dev->pdev, 0));
@@ -284,6 +287,8 @@ int cirrus_mm_init(struct cirrus_device
void cirrus_mm_fini(struct cirrus_device *cirrus) { + struct drm_device *dev = cirrus->dev; + if (!cirrus->mm_inited) return;
@@ -293,6 +298,8 @@ void cirrus_mm_fini(struct cirrus_device
arch_phys_wc_del(cirrus->fb_mtrr); cirrus->fb_mtrr = 0; + arch_io_free_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); }
void cirrus_ttm_placement(struct cirrus_bo *bo, int domain) --- a/drivers/gpu/drm/mgag200/mgag200_ttm.c +++ b/drivers/gpu/drm/mgag200/mgag200_ttm.c @@ -274,6 +274,9 @@ int mgag200_mm_init(struct mga_device *m return ret; }
+ arch_io_reserve_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); + mdev->fb_mtrr = arch_phys_wc_add(pci_resource_start(dev->pdev, 0), pci_resource_len(dev->pdev, 0));
@@ -282,10 +285,14 @@ int mgag200_mm_init(struct mga_device *m
void mgag200_mm_fini(struct mga_device *mdev) { + struct drm_device *dev = mdev->dev; + ttm_bo_device_release(&mdev->ttm.bdev);
mgag200_ttm_global_release(mdev);
+ arch_io_free_memtype_wc(pci_resource_start(dev->pdev, 0), + pci_resource_len(dev->pdev, 0)); arch_phys_wc_del(mdev->fb_mtrr); mdev->fb_mtrr = 0; } --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -397,6 +397,9 @@ nouveau_ttm_init(struct nouveau_drm *drm drm->gem.vram_available = nouveau_fb(drm->device)->ram->size; drm->gem.vram_available -= nouveau_instmem(drm->device)->reserved;
+ arch_io_reserve_memtype_wc(nv_device_resource_start(device, 1), + nv_device_resource_len(device, 1)); + ret = ttm_bo_init_mm(&drm->ttm.bdev, TTM_PL_VRAM, drm->gem.vram_available >> PAGE_SHIFT); if (ret) { @@ -429,6 +432,8 @@ nouveau_ttm_init(struct nouveau_drm *drm void nouveau_ttm_fini(struct nouveau_drm *drm) { + struct nouveau_device *device = nv_device(drm->device); + mutex_lock(&drm->dev->struct_mutex); ttm_bo_clean_mm(&drm->ttm.bdev, TTM_PL_VRAM); ttm_bo_clean_mm(&drm->ttm.bdev, TTM_PL_TT); @@ -440,4 +445,7 @@ nouveau_ttm_fini(struct nouveau_drm *drm
arch_phys_wc_del(drm->ttm.mtrr); drm->ttm.mtrr = 0; + arch_io_free_memtype_wc(nv_device_resource_start(device, 1), + nv_device_resource_len(device, 1)); + } --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -359,6 +359,10 @@ void radeon_bo_force_delete(struct radeo
int radeon_bo_init(struct radeon_device *rdev) { + /* reserve PAT memory space to WC for VRAM */ + arch_io_reserve_memtype_wc(rdev->mc.aper_base, + rdev->mc.aper_size); + /* Add an MTRR for the VRAM */ if (!rdev->fastfb_working) { rdev->mc.vram_mtrr = arch_phys_wc_add(rdev->mc.aper_base, @@ -376,6 +380,7 @@ void radeon_bo_fini(struct radeon_device { radeon_ttm_fini(rdev); arch_phys_wc_del(rdev->mc.vram_mtrr); + arch_io_free_memtype_wc(rdev->mc.aper_base, rdev->mc.aper_size); }
/* Returns how many bytes TTM can move per IB.
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 0a191362058391878cc2a4d4ccddcd8223eb4f79 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: "H. Peter Anvin" hpa@zytor.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable-2level.h | 38 +-------------------------- arch/x86/include/asm/pgtable-3level.h | 12 --------- arch/x86/include/asm/pgtable.h | 20 -------------- arch/x86/include/asm/pgtable_64.h | 6 +---- arch/x86/include/asm/pgtable_types.h | 3 --- 5 files changed, 2 insertions(+), 77 deletions(-)
--- a/arch/x86/include/asm/pgtable-2level.h +++ b/arch/x86/include/asm/pgtable-2level.h @@ -62,44 +62,8 @@ static inline unsigned long pte_bitop(un return ((value >> rightshift) & mask) << leftshift; }
-/* - * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken, - * split up the 29 bits of offset into this range. - */ -#define PTE_FILE_MAX_BITS 29 -#define PTE_FILE_SHIFT1 (_PAGE_BIT_PRESENT + 1) -#define PTE_FILE_SHIFT2 (_PAGE_BIT_FILE + 1) -#define PTE_FILE_SHIFT3 (_PAGE_BIT_PROTNONE + 1) -#define PTE_FILE_BITS1 (PTE_FILE_SHIFT2 - PTE_FILE_SHIFT1 - 1) -#define PTE_FILE_BITS2 (PTE_FILE_SHIFT3 - PTE_FILE_SHIFT2 - 1) - -#define PTE_FILE_MASK1 ((1U << PTE_FILE_BITS1) - 1) -#define PTE_FILE_MASK2 ((1U << PTE_FILE_BITS2) - 1) - -#define PTE_FILE_LSHIFT2 (PTE_FILE_BITS1) -#define PTE_FILE_LSHIFT3 (PTE_FILE_BITS1 + PTE_FILE_BITS2) - -static __always_inline pgoff_t pte_to_pgoff(pte_t pte) -{ - return (pgoff_t) - (pte_bitop(pte.pte_low, PTE_FILE_SHIFT1, PTE_FILE_MASK1, 0) + - pte_bitop(pte.pte_low, PTE_FILE_SHIFT2, PTE_FILE_MASK2, PTE_FILE_LSHIFT2) + - pte_bitop(pte.pte_low, PTE_FILE_SHIFT3, -1UL, PTE_FILE_LSHIFT3)); -} - -static __always_inline pte_t pgoff_to_pte(pgoff_t off) -{ - return (pte_t){ - .pte_low = - pte_bitop(off, 0, PTE_FILE_MASK1, PTE_FILE_SHIFT1) + - pte_bitop(off, PTE_FILE_LSHIFT2, PTE_FILE_MASK2, PTE_FILE_SHIFT2) + - pte_bitop(off, PTE_FILE_LSHIFT3, -1UL, PTE_FILE_SHIFT3) + - _PAGE_FILE, - }; -} - /* Encode and de-code a swap entry */ -#define SWP_TYPE_BITS (_PAGE_BIT_FILE - _PAGE_BIT_PRESENT - 1) +#define SWP_TYPE_BITS 5 #define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS) --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -176,18 +176,6 @@ static inline pmd_t native_pmdp_get_and_ #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp) #endif
-/* - * Bits 0, 6 and 7 are taken in the low part of the pte, - * put the 32 bits of offset into the high part. - * - * For soft-dirty tracking 11 bit is taken from - * the low part of pte as well. - */ -#define pte_to_pgoff(pte) ((pte).pte_high) -#define pgoff_to_pte(off) \ - ((pte_t) { { .pte_low = _PAGE_FILE, .pte_high = (off) } }) -#define PTE_FILE_MAX_BITS 32 - /* Encode and de-code a swap entry */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5) #define __swp_type(x) (((x).val) & 0x1f) --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -115,11 +115,6 @@ static inline int pte_write(pte_t pte) return pte_flags(pte) & _PAGE_RW; }
-static inline int pte_file(pte_t pte) -{ - return pte_flags(pte) & _PAGE_FILE; -} - static inline int pte_huge(pte_t pte) { return pte_flags(pte) & _PAGE_PSE; @@ -329,21 +324,6 @@ static inline pmd_t pmd_mksoft_dirty(pmd return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY); }
-static inline pte_t pte_file_clear_soft_dirty(pte_t pte) -{ - return pte_clear_flags(pte, _PAGE_SOFT_DIRTY); -} - -static inline pte_t pte_file_mksoft_dirty(pte_t pte) -{ - return pte_set_flags(pte, _PAGE_SOFT_DIRTY); -} - -static inline int pte_file_soft_dirty(pte_t pte) -{ - return pte_flags(pte) & _PAGE_SOFT_DIRTY; -} - #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
/* --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -155,10 +155,6 @@ static inline int pgd_large(pgd_t pgd) { /* PUD - Level3 access */
/* PMD - Level 2 access */ -#define pte_to_pgoff(pte) ((pte_val((pte)) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT) -#define pgoff_to_pte(off) ((pte_t) { .pte = ((off) << PAGE_SHIFT) | \ - _PAGE_FILE }) -#define PTE_FILE_MAX_BITS __PHYSICAL_MASK_SHIFT
/* PTE - Level 1 access. */
@@ -167,7 +163,7 @@ static inline int pgd_large(pgd_t pgd) { #define pte_unmap(pte) ((void)(pte))/* NOP */
/* Encode and de-code a swap entry */ -#define SWP_TYPE_BITS (_PAGE_BIT_FILE - _PAGE_BIT_PRESENT - 1) +#define SWP_TYPE_BITS 5 #ifdef CONFIG_NUMA_BALANCING /* Automatic NUMA balancing needs to be distinguishable from swap entries */ #define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 2) --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -39,8 +39,6 @@ /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL -/* - set: nonlinear file mapping, saved PTE; unset:swap */ -#define _PAGE_BIT_FILE _PAGE_BIT_DIRTY
#define _PAGE_PRESENT (_AT(pteval_t, 1) << _PAGE_BIT_PRESENT) #define _PAGE_RW (_AT(pteval_t, 1) << _PAGE_BIT_RW) @@ -115,7 +113,6 @@ #define _PAGE_NX (_AT(pteval_t, 0)) #endif
-#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE) #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vincent Pelletier plr.vincent@gmail.com
commit 1816494330a83f2a064499d8ed2797045641f92c upstream.
This change has the following effects, in order of descreasing importance:
1) Prevent a stack buffer overflow
2) Do not append an unnecessary NULL to an anyway binary buffer, which is writing one byte past client_digest when caller is: chap_string_to_hex(client_digest, chap_r, strlen(chap_r));
The latter was found by KASAN (see below) when input value hes expected size (32 hex chars), and further analysis revealed a stack buffer overflow can happen when network-received value is longer, allowing an unauthenticated remote attacker to smash up to 17 bytes after destination buffer (16 bytes attacker-controlled and one null). As switching to hex2bin requires specifying destination buffer length, and does not internally append any null, it solves both issues.
This addresses CVE-2018-14633.
Beyond this:
- Validate received value length and check hex2bin accepted the input, to log this rejection reason instead of just failing authentication.
- Only log received CHAP_R and CHAP_C values once they passed sanity checks.
================================================================== BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod] Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021
CPU: 0 PID: 1021 Comm: kworker/0:0 Tainted: G O 4.17.8kasan.sess.connops+ #2 Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 05/19/2014 Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod] Call Trace: dump_stack+0x71/0xac print_address_description+0x65/0x22e ? chap_string_to_hex+0x32/0x60 [iscsi_target_mod] kasan_report.cold.6+0x241/0x2fd chap_string_to_hex+0x32/0x60 [iscsi_target_mod] chap_server_compute_md5.isra.2+0x2cb/0x860 [iscsi_target_mod] ? chap_binaryhex_to_asciihex.constprop.5+0x50/0x50 [iscsi_target_mod] ? ftrace_caller_op_ptr+0xe/0xe ? __orc_find+0x6f/0xc0 ? unwind_next_frame+0x231/0x850 ? kthread+0x1a0/0x1c0 ? ret_from_fork+0x35/0x40 ? ret_from_fork+0x35/0x40 ? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? deref_stack_reg+0xd0/0xd0 ? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? is_module_text_address+0xa/0x11 ? kernel_text_address+0x4c/0x110 ? __save_stack_trace+0x82/0x100 ? ret_from_fork+0x35/0x40 ? save_stack+0x8c/0xb0 ? 0xffffffffc1660000 ? iscsi_target_do_login+0x155/0x8d0 [iscsi_target_mod] ? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? process_one_work+0x35c/0x640 ? worker_thread+0x66/0x5d0 ? kthread+0x1a0/0x1c0 ? ret_from_fork+0x35/0x40 ? iscsi_update_param_value+0x80/0x80 [iscsi_target_mod] ? iscsit_release_cmd+0x170/0x170 [iscsi_target_mod] chap_main_loop+0x172/0x570 [iscsi_target_mod] ? chap_server_compute_md5.isra.2+0x860/0x860 [iscsi_target_mod] ? rx_data+0xd6/0x120 [iscsi_target_mod] ? iscsit_print_session_params+0xd0/0xd0 [iscsi_target_mod] ? cyc2ns_read_begin.part.2+0x90/0x90 ? _raw_spin_lock_irqsave+0x25/0x50 ? memcmp+0x45/0x70 iscsi_target_do_login+0x875/0x8d0 [iscsi_target_mod] ? iscsi_target_check_first_request.isra.5+0x1a0/0x1a0 [iscsi_target_mod] ? del_timer+0xe0/0xe0 ? memset+0x1f/0x40 ? flush_sigqueue+0x29/0xd0 iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? iscsi_target_nego_release+0x80/0x80 [iscsi_target_mod] ? iscsi_target_restore_sock_callbacks+0x130/0x130 [iscsi_target_mod] process_one_work+0x35c/0x640 worker_thread+0x66/0x5d0 ? flush_rcu_work+0x40/0x40 kthread+0x1a0/0x1c0 ? kthread_bind+0x30/0x30 ret_from_fork+0x35/0x40
The buggy address belongs to the page: page:ffffea0004243bc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x17fffc000000000() raw: 017fffc000000000 0000000000000000 0000000000000000 00000000ffffffff raw: ffffea0004243c20 ffffea0004243ba0 0000000000000000 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff8801090ef680: f2 f2 f2 f2 f2 f2 f2 01 f2 f2 f2 f2 f2 f2 f2 00 ffff8801090ef700: f2 f2 f2 f2 f2 f2 f2 00 02 f2 f2 f2 f2 f2 f2 00
ffff8801090ef780: 00 f2 f2 f2 f2 f2 f2 00 00 f2 f2 f2 f2 f2 f2 00
^ ffff8801090ef800: 00 f2 f2 f2 f2 f2 f2 00 00 00 00 02 f2 f2 f2 f2 ffff8801090ef880: f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 ==================================================================
Signed-off-by: Vincent Pelletier plr.vincent@gmail.com Reviewed-by: Mike Christie mchristi@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Ben Hutchings ben@decadent.org.uk --- drivers/target/iscsi/iscsi_target_auth.c | 30 +++++++++++------------- 1 file changed, 14 insertions(+), 16 deletions(-)
--- a/drivers/target/iscsi/iscsi_target_auth.c +++ b/drivers/target/iscsi/iscsi_target_auth.c @@ -26,18 +26,6 @@ #include "iscsi_target_nego.h" #include "iscsi_target_auth.h"
-static int chap_string_to_hex(unsigned char *dst, unsigned char *src, int len) -{ - int j = DIV_ROUND_UP(len, 2), rc; - - rc = hex2bin(dst, src, j); - if (rc < 0) - pr_debug("CHAP string contains non hex digit symbols\n"); - - dst[j] = '\0'; - return j; -} - static void chap_binaryhex_to_asciihex(char *dst, char *src, int src_len) { int i; @@ -241,9 +229,16 @@ static int chap_server_compute_md5( pr_err("Could not find CHAP_R.\n"); goto out; } + if (strlen(chap_r) != MD5_SIGNATURE_SIZE * 2) { + pr_err("Malformed CHAP_R\n"); + goto out; + } + if (hex2bin(client_digest, chap_r, MD5_SIGNATURE_SIZE) < 0) { + pr_err("Malformed CHAP_R\n"); + goto out; + }
pr_debug("[server] Got CHAP_R=%s\n", chap_r); - chap_string_to_hex(client_digest, chap_r, strlen(chap_r));
tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); if (IS_ERR(tfm)) { @@ -348,9 +343,7 @@ static int chap_server_compute_md5( pr_err("Could not find CHAP_C.\n"); goto out; } - pr_debug("[server] Got CHAP_C=%s\n", challenge); - challenge_len = chap_string_to_hex(challenge_binhex, challenge, - strlen(challenge)); + challenge_len = DIV_ROUND_UP(strlen(challenge), 2); if (!challenge_len) { pr_err("Unable to convert incoming challenge\n"); goto out; @@ -359,6 +352,11 @@ static int chap_server_compute_md5( pr_err("CHAP_C exceeds maximum binary size of 1024 bytes\n"); goto out; } + if (hex2bin(challenge_binhex, challenge, challenge_len) < 0) { + pr_err("Malformed CHAP_C\n"); + goto out; + } + pr_debug("[server] Got CHAP_C=%s\n", challenge); /* * During mutual authentication, the CHAP_C generated by the * initiator must not match the original CHAP_C generated by
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Naoya Horiguchi n-horiguchi@ah.jp.nec.com
commit fafaa4264eba49fd10695c193a82760558d093f4 upstream.
Current implementation of page table walker has a fundamental problem in vma handling, which started when we tried to handle vma(VM_HUGETLB). Because it's done in pgd loop, considering vma boundary makes code complicated and bug-prone.
=46romthe users viewpoint, some user checks some vma-related condition to determine whether the user really does page walk over the vma.
In order to solve these, this patch moves vma check outside pgd loop and introduce a new callback ->test_walk().
Signed-off-by: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Dave Hansen dave.hansen@intel.com Cc: Pavel Emelyanov xemul@parallels.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16 as dependency of L1TF mitigation] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm.h | 15 +++- mm/pagewalk.c | 206 +++++++++++++++++++++++++-------------------- 2 files changed, 129 insertions(+), 92 deletions(-)
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1121,10 +1121,16 @@ void unmap_vmas(struct mmu_gather *tlb, * @pte_entry: if set, called for each non-empty PTE (4th-level) entry * @pte_hole: if set, called for each hole at all levels * @hugetlb_entry: if set, called for each hugetlb entry - * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry - * is used. + * @test_walk: caller specific callback function to determine whether + * we walk over the current vma or not. A positive returned + * value means "do page table walk over the current vma," + * and a negative one means "abort current page table walk + * right now." 0 means "skip the current vma." + * @mm: mm_struct representing the target process of page table walk + * @vma: vma currently walked (NULL if walking outside vmas) + * @private: private data for callbacks' usage * - * (see walk_page_range for more details) + * (see the comment on walk_page_range() for more details) */ struct mm_walk { int (*pmd_entry)(pmd_t *pmd, unsigned long addr, @@ -1136,7 +1142,10 @@ struct mm_walk { int (*hugetlb_entry)(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long next, struct mm_walk *walk); + int (*test_walk)(unsigned long addr, unsigned long next, + struct mm_walk *walk); struct mm_struct *mm; + struct vm_area_struct *vma; void *private; };
--- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -59,7 +59,7 @@ again: continue;
split_huge_page_pmd_mm(walk->mm, addr, pmd); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_trans_unstable(pmd)) goto again; err = walk_pte_range(pmd, addr, next, walk); if (err) @@ -95,6 +95,32 @@ static int walk_pud_range(pgd_t *pgd, un return err; }
+static int walk_pgd_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + pgd_t *pgd; + unsigned long next; + int err = 0; + + pgd = pgd_offset(walk->mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) { + if (walk->pte_hole) + err = walk->pte_hole(addr, next, walk); + if (err) + break; + continue; + } + if (walk->pmd_entry || walk->pte_entry) + err = walk_pud_range(pgd, addr, next, walk); + if (err) + break; + } while (pgd++, addr = next, addr != end); + + return err; +} + #ifdef CONFIG_HUGETLB_PAGE static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, unsigned long end) @@ -103,10 +129,10 @@ static unsigned long hugetlb_entry_end(s return boundary < end ? boundary : end; }
-static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; struct hstate *h = hstate_vma(vma); unsigned long next; unsigned long hmask = huge_page_mask(h); @@ -119,15 +145,14 @@ static int walk_hugetlb_range(struct vm_ if (pte && walk->hugetlb_entry) err = walk->hugetlb_entry(pte, hmask, addr, next, walk); if (err) - return err; + break; } while (addr = next, addr != end);
- return 0; + return err; }
#else /* CONFIG_HUGETLB_PAGE */ -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; @@ -135,112 +160,115 @@ static int walk_hugetlb_range(struct vm_
#endif /* CONFIG_HUGETLB_PAGE */
+/* + * Decide whether we really walk over the current vma on [@start, @end) + * or skip it via the returned value. Return 0 if we do walk over the + * current vma, and return 1 if we skip the vma. Negative values means + * error, where we abort the current walk. + * + * Default check (only VM_PFNMAP check for now) is used when the caller + * doesn't define test_walk() callback. + */ +static int walk_page_test(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma;
+ if (walk->test_walk) + return walk->test_walk(start, end, walk); + + /* + * Do not walk over vma(VM_PFNMAP), because we have no valid struct + * page backing a VM_PFNMAP range. See also commit a9ff785e4437. + */ + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +} + +static int __walk_page_range(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int err = 0; + struct vm_area_struct *vma = walk->vma; + + if (vma && is_vm_hugetlb_page(vma)) { + if (walk->hugetlb_entry) + err = walk_hugetlb_range(start, end, walk); + } else + err = walk_pgd_range(start, end, walk); + + return err; +}
/** - * walk_page_range - walk a memory map's page tables with a callback - * @addr: starting address - * @end: ending address - * @walk: set of callbacks to invoke for each level of the tree + * walk_page_range - walk page table with caller specific callbacks * - * Recursively walk the page table for the memory area in a VMA, - * calling supplied callbacks. Callbacks are called in-order (first - * PGD, first PUD, first PMD, first PTE, second PTE... second PMD, - * etc.). If lower-level callbacks are omitted, walking depth is reduced. + * Recursively walk the page table tree of the process represented by @walk->mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. * - * Each callback receives an entry pointer and the start and end of the - * associated range, and a copy of the original mm_walk for access to - * the ->private or ->mm fields. + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @walk->test_walk() are used for this + * purpose. * - * Usually no locks are taken, but splitting transparent huge page may - * take page table lock. And the bottom level iterator will map PTE - * directories from highmem if necessary. + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @walk->private should be helpful. * - * If any callback returns a non-zero value, the walk is aborted and - * the return value is propagated back to the caller. Otherwise 0 is returned. - * - * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry - * is !NULL. + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold + * @walk->mm->mmap_sem, because these function traverse vma list and/or + * access to vma's data. */ -int walk_page_range(unsigned long addr, unsigned long end, +int walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) { - pgd_t *pgd; - unsigned long next; int err = 0; + unsigned long next; + struct vm_area_struct *vma;
- if (addr >= end) - return err; + if (start >= end) + return -EINVAL;
if (!walk->mm) return -EINVAL;
VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
- pgd = pgd_offset(walk->mm, addr); + vma = find_vma(walk->mm, start); do { - struct vm_area_struct *vma = NULL; - - next = pgd_addr_end(addr, end); + if (!vma) { /* after the last vma */ + walk->vma = NULL; + next = end; + } else if (start < vma->vm_start) { /* outside vma */ + walk->vma = NULL; + next = min(end, vma->vm_start); + } else { /* inside vma */ + walk->vma = vma; + next = min(end, vma->vm_end); + vma = vma->vm_next;
- /* - * This function was not intended to be vma based. - * But there are vma special cases to be handled: - * - hugetlb vma's - * - VM_PFNMAP vma's - */ - vma = find_vma(walk->mm, addr); - if (vma) { - /* - * There are no page structures backing a VM_PFNMAP - * range, so do not allow split_huge_page_pmd(). - */ - if ((vma->vm_start <= addr) && - (vma->vm_flags & VM_PFNMAP)) { - if (walk->pte_hole) - err = walk->pte_hole(addr, next, walk); - if (err) - break; - pgd = pgd_offset(walk->mm, next); - continue; - } - /* - * Handle hugetlb vma individually because pagetable - * walk for the hugetlb page is dependent on the - * architecture and we can't handled it in the same - * manner as non-huge pages. - */ - if (walk->hugetlb_entry && (vma->vm_start <= addr) && - is_vm_hugetlb_page(vma)) { - if (vma->vm_end < next) - next = vma->vm_end; - /* - * Hugepage is very tightly coupled with vma, - * so walk through hugetlb entries within a - * given vma. - */ - err = walk_hugetlb_range(vma, addr, next, walk); - if (err) - break; - pgd = pgd_offset(walk->mm, next); + err = walk_page_test(start, next, walk); + if (err > 0) continue; - } - } - - if (pgd_none_or_clear_bad(pgd)) { - if (walk->pte_hole) - err = walk->pte_hole(addr, next, walk); - if (err) + if (err < 0) break; - pgd++; - continue; } - if (walk->pmd_entry || walk->pte_entry) - err = walk_pud_range(pgd, addr, next, walk); + if (walk->vma || walk->pte_hole) + err = __walk_page_range(start, next, walk); if (err) break; - pgd++; - } while (addr = next, addr < end); - + } while (start = next, start < end); return err; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Linus Torvalds torvalds@linux-foundation.org
commit bcd11afa7adad8d720e7ba5ef58bdcd9775cf45f upstream.
If pages are swapped out, the swap entry is stored in the corresponding PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate on PTE entries which have the present bit set and would treat the swap entry as phsyical address (PFN). To mitigate that the upper bits of the PTE must be set so the PTE points to non existent memory.
The swap entry stores the type and the offset of a swapped out page in the PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware ignores the bits beyond the phsyical address space limit, so to make the mitigation effective its required to start 'offset' at the lowest possible bit so that even large swap offsets do not reach into the physical address space limit bits.
Move offset to bit 9-58 and type to bit 59-63 which are the bits that hardware generally doesn't care about.
That, in turn, means that if you on desktop chip with only 40 bits of physical addressing, now that the offset starts at bit 9, there needs to be 30 bits of offset actually *in use* until bit 39 ends up being set, which means when inverted it will again point into existing memory.
So that's 4 terabyte of swap space (because the offset is counted in pages, so 30 bits of offset is 42 bits of actual coverage). With bigger physical addressing, that obviously grows further, until the limit of the offset is hit (at 50 bits of offset - 62 bits of actual swap file coverage).
This is a preparatory change for the actual swap entry inversion to protect against L1TF.
[ AK: Updated description and minor tweaks. Split into two parts ] [ tglx: Massaged changelog ]
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Andi Kleen ak@linux.intel.com Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: Dave Hansen dave.hansen@intel.com [bwh: Backported to 3.16: Bit 9 may be reserved for PAGE_BIT_NUMA here] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable_64.h | 31 ++++++++++++++++++++----------- 1 file changed, 20 insertions(+), 11 deletions(-)
--- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -167,7 +167,7 @@ static inline int pgd_large(pgd_t pgd) { * * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names - * | OFFSET (15->63) | TYPE (10-14) | 0 |0|0|X|X| X| X|X|SD|0| <- swp entry + * | TYPE (59-63) | OFFSET (10-58) | 0 |0|0|X|X| X| X|X|SD|0| <- swp entry * * G (8) is aliased and used as a PROT_NONE indicator for * !present ptes. We need to start storing swap entries above @@ -181,24 +181,33 @@ static inline int pgd_large(pgd_t pgd) { * Bit 7 in swp entry should be 0 because pmd_present checks not only P, * but also L and G. */ +#define SWP_TYPE_BITS 5 + #ifdef CONFIG_NUMA_BALANCING /* Automatic NUMA balancing needs to be distinguishable from swap entries */ -#define SWP_TYPE_FIRST_SHIFT (_PAGE_BIT_PROTNONE + 2) +#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 2) #else -#define SWP_TYPE_FIRST_SHIFT (_PAGE_BIT_PROTNONE + 1) +#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1) #endif -#define SWP_TYPE_BITS 5 -/* Place the offset above the type: */ -#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS) + +/* We always extract/encode the offset by shifting it all the way up, and then down again */ +#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
-#define __swp_type(x) (((x).val >> (SWP_TYPE_FIRST_BIT)) \ - & ((1U << SWP_TYPE_BITS) - 1)) -#define __swp_offset(x) ((x).val >> SWP_OFFSET_FIRST_BIT) -#define __swp_entry(type, offset) ((swp_entry_t) { \ - ((type) << (SWP_TYPE_FIRST_BIT)) \ - | ((offset) << SWP_OFFSET_FIRST_BIT) }) +/* Extract the high bits for type */ +#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS)) + +/* Shift up (to get rid of type), then down to get value */ +#define __swp_offset(x) ((x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT) + +/* + * Shift the offset up "too far" by TYPE bits, then down again + */ +#define __swp_entry(type, offset) ((swp_entry_t) { \ + ((unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \ + | ((unsigned long)(type) << (64-SWP_TYPE_BITS)) }) + #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) }) #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Naoya Horiguchi n-horiguchi@ah.jp.nec.com
commit eee4818baac0f2b37848fdf90e4b16430dc536ac upstream.
_PAGE_PSE is used to distinguish between a truly non-present (_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and should be treated as present.
But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would cause confusion between one of those PMDs undergoing a THP split, and a soft-dirty PMD. Dropping _PAGE_PSE check in pmd_present() does not work well, because it can hurt optimization of tlb handling in thp split.
Thus, we need to move the bit.
In the current kernel, bits 1-4 are not used in non-present format since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1. Bit 7 is used as reserved (always clear), so please don't use it for other purpose.
Link: http://lkml.kernel.org/r/20170717193955.20207-3-zi.yan@sent.com Signed-off-by: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Signed-off-by: Zi Yan zi.yan@cs.rutgers.edu Acked-by: Dave Hansen dave.hansen@intel.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Anshuman Khandual khandual@linux.vnet.ibm.com Cc: David Nellans dnellans@nvidia.com Cc: Ingo Molnar mingo@elte.hu Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Minchan Kim minchan@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vlastimil Babka vbabka@suse.cz Cc: Andrea Arcangeli aarcange@redhat.com Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: Bit 9 may be reserved for PAGE_BIT_NUMA here] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable_64.h | 12 +++++++++--- arch/x86/include/asm/pgtable_types.h | 10 +++++----- 2 files changed, 14 insertions(+), 8 deletions(-)
--- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -165,15 +165,21 @@ static inline int pgd_large(pgd_t pgd) { /* * Encode and de-code a swap entry * - * | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| <- bit number - * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names - * | OFFSET (15->63) | TYPE (10-14) | 0 |0|X|X|X| X| X|X|X|0| <- swp entry + * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number + * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names + * | OFFSET (15->63) | TYPE (10-14) | 0 |0|0|X|X| X| X|X|SD|0| <- swp entry * * G (8) is aliased and used as a PROT_NONE indicator for * !present ptes. We need to start storing swap entries above * there. We also need to avoid using A and D because of an * erratum where they can be incorrectly set by hardware on * non-present PTEs. + * + * SD (1) in swp entry is used to store soft dirty bit, which helps us + * remember soft dirty over page migration + * + * Bit 7 in swp entry should be 0 because pmd_present checks not only P, + * but also L and G. */ #ifdef CONFIG_NUMA_BALANCING /* Automatic NUMA balancing needs to be distinguishable from swap entries */ --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -94,15 +94,15 @@ /* * Tracking soft dirty bit when a page goes to a swap is tricky. * We need a bit which can be stored in pte _and_ not conflict - * with swap entry format. On x86 bits 6 and 7 are *not* involved - * into swap entry computation, but bit 6 is used for nonlinear - * file mapping, so we borrow bit 7 for soft dirty tracking. + * with swap entry format. On x86 bits 1-4 are *not* involved + * into swap entry computation, but bit 7 is used for thp migration, + * so we borrow bit 1 for soft dirty tracking. * * Please note that this bit must be treated as swap dirty page - * mark if and only if the PTE has present bit clear! + * mark if and only if the PTE/PMD has present bit clear! */ #ifdef CONFIG_MEM_SOFT_DIRTY -#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE +#define _PAGE_SWP_SOFT_DIRTY _PAGE_RW #else #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0)) #endif
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 24f7fc83b9204d20f878c57cb77d261ae825e033 upstream.
Contemporary high performance processors use a common industry-wide optimization known as "Speculative Store Bypass" in which loads from addresses to which a recent store has occurred may (speculatively) see an older value. Intel refers to this feature as "Memory Disambiguation" which is part of their "Smart Memory Access" capability.
Memory Disambiguation can expose a cache side-channel attack against such speculatively read values. An attacker can create exploit code that allows them to read memory outside of a sandbox environment (for example, malicious JavaScript in a web page), or to perform more complex attacks against code running within the same privilege level, e.g. via the stack.
As a first step to mitigate against such attacks, provide two boot command line control knobs:
nospec_store_bypass_disable spec_store_bypass_disable=[off,auto,on]
By default affected x86 processors will power on with Speculative Store Bypass enabled. Hence the provided kernel parameters are written from the point of view of whether to enable a mitigation or not. The parameters are as follows:
- auto - Kernel detects whether your CPU model contains an implementation of Speculative Store Bypass and picks the most appropriate mitigation.
- on - disable Speculative Store Bypass - off - enable Speculative Store Bypass
[ tglx: Reordered the checks so that the whole evaluation is not done when the CPU does not support RDS ]
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: - Renumber the feature bit - Adjust filenames, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/kernel-parameters.txt | 33 +++++++++ arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/nospec-branch.h | 6 ++ arch/x86/kernel/cpu/bugs.c | 103 +++++++++++++++++++++++++++ 4 files changed, 143 insertions(+)
--- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2172,6 +2172,9 @@ bytes respectively. Such letter suffixes allow data leaks with this option, which is equivalent to spectre_v2=off.
+ nospec_store_bypass_disable + [HW] Disable all mitigations for the Speculative Store Bypass vulnerability + noxsave [BUGS=X86] Disables x86 extended register state save and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. @@ -3194,6 +3197,36 @@ bytes respectively. Such letter suffixes Not specifying this option is equivalent to spectre_v2=auto.
+ spec_store_bypass_disable= + [HW] Control Speculative Store Bypass (SSB) Disable mitigation + (Speculative Store Bypass vulnerability) + + Certain CPUs are vulnerable to an exploit against a + a common industry wide performance optimization known + as "Speculative Store Bypass" in which recent stores + to the same memory location may not be observed by + later loads during speculative execution. The idea + is that such stores are unlikely and that they can + be detected prior to instruction retirement at the + end of a particular speculation execution window. + + In vulnerable processors, the speculatively forwarded + store can be used in a cache side channel attack, for + example to read memory to which the attacker does not + directly have access (e.g. inside sandboxed code). + + This parameter controls whether the Speculative Store + Bypass optimization is used. + + on - Unconditionally disable Speculative Store Bypass + off - Unconditionally enable Speculative Store Bypass + auto - Kernel detects whether the CPU model contains an + implementation of Speculative Store Bypass and + picks the most appropriate mitigation + + Not specifying this option is equivalent to + spec_store_bypass_disable=auto. + spia_io_base= [HW,MTD] spia_fio_base= spia_pedr= --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -191,6 +191,7 @@
#define X86_FEATURE_USE_IBPB (7*32+12) /* "" Indirect Branch Prediction Barrier enabled */ #define X86_FEATURE_USE_IBRS_FW (7*32+13) /* "" Use IBRS during runtime firmware calls */ +#define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE (7*32+14) /* "" Disable Speculative Store Bypass. */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -193,6 +193,12 @@ extern u64 x86_spec_ctrl_get_default(voi extern void x86_spec_ctrl_set_guest(u64); extern void x86_spec_ctrl_restore_host(u64);
+/* The Speculative Store Bypass disable variants */ +enum ssb_mitigation { + SPEC_STORE_BYPASS_NONE, + SPEC_STORE_BYPASS_DISABLE, +}; + extern char __indirect_thunk_start[]; extern char __indirect_thunk_end[];
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -26,6 +26,7 @@ #include <asm/intel-family.h>
static void __init spectre_v2_select_mitigation(void); +static void __init ssb_select_mitigation(void);
/* * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any @@ -110,6 +111,12 @@ void __init check_bugs(void) /* Select the proper spectre mitigation before patching alternatives */ spectre_v2_select_mitigation();
+ /* + * Select proper mitigation for any exposure to the Speculative Store + * Bypass vulnerability. + */ + ssb_select_mitigation(); + #ifdef CONFIG_X86_32 /* * Check whether we are able to run this kernel safely on SMP. @@ -396,6 +403,99 @@ retpoline_auto: }
#undef pr_fmt +#define pr_fmt(fmt) "Speculative Store Bypass: " fmt + +static enum ssb_mitigation ssb_mode = SPEC_STORE_BYPASS_NONE; + +/* The kernel command line selection */ +enum ssb_mitigation_cmd { + SPEC_STORE_BYPASS_CMD_NONE, + SPEC_STORE_BYPASS_CMD_AUTO, + SPEC_STORE_BYPASS_CMD_ON, +}; + +static const char *ssb_strings[] = { + [SPEC_STORE_BYPASS_NONE] = "Vulnerable", + [SPEC_STORE_BYPASS_DISABLE] = "Mitigation: Speculative Store Bypass disabled" +}; + +static const struct { + const char *option; + enum ssb_mitigation_cmd cmd; +} ssb_mitigation_options[] = { + { "auto", SPEC_STORE_BYPASS_CMD_AUTO }, /* Platform decides */ + { "on", SPEC_STORE_BYPASS_CMD_ON }, /* Disable Speculative Store Bypass */ + { "off", SPEC_STORE_BYPASS_CMD_NONE }, /* Don't touch Speculative Store Bypass */ +}; + +static enum ssb_mitigation_cmd __init ssb_parse_cmdline(void) +{ + enum ssb_mitigation_cmd cmd = SPEC_STORE_BYPASS_CMD_AUTO; + char arg[20]; + int ret, i; + + if (cmdline_find_option_bool(boot_command_line, "nospec_store_bypass_disable")) { + return SPEC_STORE_BYPASS_CMD_NONE; + } else { + ret = cmdline_find_option(boot_command_line, "spec_store_bypass_disable", + arg, sizeof(arg)); + if (ret < 0) + return SPEC_STORE_BYPASS_CMD_AUTO; + + for (i = 0; i < ARRAY_SIZE(ssb_mitigation_options); i++) { + if (!match_option(arg, ret, ssb_mitigation_options[i].option)) + continue; + + cmd = ssb_mitigation_options[i].cmd; + break; + } + + if (i >= ARRAY_SIZE(ssb_mitigation_options)) { + pr_err("unknown option (%s). Switching to AUTO select\n", arg); + return SPEC_STORE_BYPASS_CMD_AUTO; + } + } + + return cmd; +} + +static enum ssb_mitigation_cmd __init __ssb_select_mitigation(void) +{ + enum ssb_mitigation mode = SPEC_STORE_BYPASS_NONE; + enum ssb_mitigation_cmd cmd; + + if (!boot_cpu_has(X86_FEATURE_RDS)) + return mode; + + cmd = ssb_parse_cmdline(); + if (!boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS) && + (cmd == SPEC_STORE_BYPASS_CMD_NONE || + cmd == SPEC_STORE_BYPASS_CMD_AUTO)) + return mode; + + switch (cmd) { + case SPEC_STORE_BYPASS_CMD_AUTO: + case SPEC_STORE_BYPASS_CMD_ON: + mode = SPEC_STORE_BYPASS_DISABLE; + break; + case SPEC_STORE_BYPASS_CMD_NONE: + break; + } + + if (mode != SPEC_STORE_BYPASS_NONE) + setup_force_cpu_cap(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE); + return mode; +} + +static void ssb_select_mitigation() +{ + ssb_mode = __ssb_select_mitigation(); + + if (boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) + pr_info("%s\n", ssb_strings[ssb_mode]); +} + +#undef pr_fmt
#ifdef CONFIG_SYSFS
@@ -421,6 +521,9 @@ ssize_t cpu_show_common(struct device *d boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "", spectre_v2_module_string());
+ case X86_BUG_SPEC_STORE_BYPASS: + return sprintf(buf, "%s\n", ssb_strings[ssb_mode]); + default: break; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vlastimil Babka vbabka@suse.cz
commit b0a182f875689647b014bc01d36b340217792852 upstream.
Two users have reported [1] that they have an "extremely unlikely" system with more than MAX_PA/2 memory and L1TF mitigation is not effective. In fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to holes in the e820 map, the main region is almost 500MB over the 32GB limit:
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable
Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the 500MB revealed, that there's an off-by-one error in the check in l1tf_select_mitigation().
l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range check in the mitigation path does not take this into account.
Instead of amending the range check, make l1tf_pfn_limit() return the first PFN which is over the limit which is less error prone. Adjust the other users accordingly.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536
Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Reported-by: George Anchev studio@anchev.net Reported-by: Christopher Snowhill kode54@gmail.com Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: "H . Peter Anvin" hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Andi Kleen ak@linux.intel.com Cc: Dave Hansen dave.hansen@intel.com Cc: Michal Hocko mhocko@kernel.org Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/processor.h | 2 +- arch/x86/mm/init.c | 2 +- arch/x86/mm/mmap.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
--- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -167,7 +167,7 @@ extern void fpu_detect(struct cpuinfo_x8
static inline unsigned long long l1tf_pfn_limit(void) { - return BIT_ULL(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1; + return BIT_ULL(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT); }
extern void early_cpu_init(void); --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -710,7 +710,7 @@ unsigned long max_swapfile_size(void)
if (boot_cpu_has_bug(X86_BUG_L1TF)) { /* Limit the swap file size to MAX_PA/2 for L1TF workaround */ - unsigned long long l1tf_limit = l1tf_pfn_limit() + 1; + unsigned long long l1tf_limit = l1tf_pfn_limit(); /* * We encode swap offsets also with 3 bits below those for pfn * which makes the usable limit higher. --- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c @@ -131,7 +131,7 @@ bool pfn_modify_allowed(unsigned long pf /* If it's real memory always allow */ if (pfn_valid(pfn)) return true; - if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN)) + if (pfn >= l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN)) return false; return true; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@xxxxxxxxxxxxx
commit 024d83cadc6b2af027e473720f3c3da97496c318 upstream.
Mikhail reported the following lockdep splat:
WARNING: possible irq lock inversion dependency detected CPU 0/KVM/10284 just changed the state of lock: 000000000d538a88 (&st->lock){+...}, at: speculative_store_bypass_update+0x10b/0x170
but this lock was taken by another, HARDIRQ-safe lock in the past:
(&(&sighand->siglock)->rlock){-.-.}
and interrupts could create inverse lock ordering between them.
Possible interrupt unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&st->lock); local_irq_disable(); lock(&(&sighand->siglock)->rlock); lock(&st->lock); <Interrupt> lock(&(&sighand->siglock)->rlock); *** DEADLOCK ***
The code path which connects those locks is:
speculative_store_bypass_update() ssb_prctl_set() do_seccomp() do_syscall_64()
In svm_vcpu_run() speculative_store_bypass_update() is called with interupts enabled via x86_virt_spec_ctrl_set_guest/host().
This is actually a false positive, because GIF=0 so interrupts are disabled even if IF=1; however, we can easily move the invocations of x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to cure it, and it's a good idea to keep the GIF=0/IF=1 area as small and self-contained as possible.
Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD") Reported-by: Mikhail Gavrilov mikhail.v.gavrilov@gmail.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Mikhail Gavrilov mikhail.v.gavrilov@gmail.com Cc: Joerg Roedel joro@8bytes.org Cc: Paolo Bonzini pbonzini@redhat.com Cc: Radim Krčmář rkrcmar@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Cc: Tom Lendacky thomas.lendacky@amd.com Cc: kvm@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kvm/svm.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
--- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3983,8 +3983,6 @@ static void svm_vcpu_run(struct kvm_vcpu
clgi();
- local_irq_enable(); - /* * If this vCPU has touched SPEC_CTRL, restore the guest's value if * it's non-zero. Since vmentry is serialising on affected CPUs, there @@ -3993,6 +3991,8 @@ static void svm_vcpu_run(struct kvm_vcpu */ x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
+ local_irq_enable(); + asm volatile ( "push %%" _ASM_BP "; \n\t" "mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t" @@ -4115,12 +4115,12 @@ static void svm_vcpu_run(struct kvm_vcpu if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))) svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl); - reload_tss(vcpu);
local_irq_disable();
+ x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl); + vcpu->arch.cr2 = svm->vmcb->save.cr2; vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax; vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 48f7df329474b49d83d0dffec1b6186647f11976 upstream.
Grazvydas Ignotas has reported a regression in remap_file_pages() emulation.
Testcase: #define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv) { unsigned long *p; long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; }
for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) { perror("remap_file_pages"); return -1; }
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) { perror("remap_file_pages"); return -1; }
assert(p[0] == 1);
munmap(p, SIZE);
return 0; }
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target vma covers whole area we want to over map. That assumption is broken by first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation") Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reported-by: Grazvydas Ignotas notasas@gmail.com Tested-by: Grazvydas Ignotas notasas@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- mm/mmap.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-)
--- a/mm/mmap.c +++ b/mm/mmap.c @@ -2656,12 +2656,29 @@ SYSCALL_DEFINE5(remap_file_pages, unsign if (!vma || !(vma->vm_flags & VM_SHARED)) goto out;
- if (start < vma->vm_start || start + size > vma->vm_end) + if (start < vma->vm_start) goto out;
- if (pgoff == linear_page_index(vma, start)) { - ret = 0; - goto out; + if (start + size > vma->vm_end) { + struct vm_area_struct *next; + + for (next = vma->vm_next; next; next = next->vm_next) { + /* hole between vmas ? */ + if (next->vm_start != next->vm_prev->vm_end) + goto out; + + if (next->vm_file != vma->vm_file) + goto out; + + if (next->vm_flags != vma->vm_flags) + goto out; + + if (start + size <= next->vm_end) + break; + } + + if (!next) + goto out; }
prot |= vma->vm_flags & VM_READ ? PROT_READ : 0; @@ -2671,9 +2688,16 @@ SYSCALL_DEFINE5(remap_file_pages, unsign flags &= MAP_NONBLOCK; flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE; if (vma->vm_flags & VM_LOCKED) { + struct vm_area_struct *tmp; flags |= MAP_LOCKED; + /* drop PG_Mlocked flag for over-mapped range */ - munlock_vma_pages_range(vma, start, start + size); + for (tmp = vma; tmp->vm_start >= start + size; + tmp = tmp->vm_next) { + munlock_vma_pages_range(tmp, + max(tmp->vm_start, start), + min(tmp->vm_end, start + size)); + } }
file = get_file(vma->vm_file);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit be6fcb5478e95bb1c91f489121238deb3abca46a upstream.
x86_spec_ctrL_mask is intended to mask out bits from a MSR_SPEC_CTRL value which are not to be modified. However the implementation is not really used and the bitmask was inverted to make a check easier, which was removed in "x86/bugs: Remove x86_spec_ctrl_set()"
Aside of that it is missing the STIBP bit if it is supported by the platform, so if the mask would be used in x86_virt_spec_ctrl() then it would prevent a guest from setting STIBP.
Add the STIBP bit if supported and use the mask in x86_virt_spec_ctrl() to sanitize the value which is supplied by the guest.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -41,7 +41,7 @@ EXPORT_SYMBOL_GPL(x86_spec_ctrl_base); * The vendor and possibly platform specific bits which can be modified in * x86_spec_ctrl_base. */ -static u64 x86_spec_ctrl_mask = ~SPEC_CTRL_IBRS; +static u64 x86_spec_ctrl_mask = SPEC_CTRL_IBRS;
/* * AMD specific MSR info for Speculative Store Bypass control. @@ -125,6 +125,10 @@ void __init check_bugs(void) if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
+ /* Allow STIBP in MSR_SPEC_CTRL if supported */ + if (boot_cpu_has(X86_FEATURE_STIBP)) + x86_spec_ctrl_mask |= SPEC_CTRL_STIBP; + /* Select the proper spectre mitigation before patching alternatives */ spectre_v2_select_mitigation();
@@ -197,18 +201,26 @@ static enum spectre_v2_mitigation spectr void x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool setguest) { + u64 msrval, guestval, hostval = x86_spec_ctrl_base; struct thread_info *ti = current_thread_info(); - u64 msr, host = x86_spec_ctrl_base;
/* Is MSR_SPEC_CTRL implemented ? */ if (static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { + /* + * Restrict guest_spec_ctrl to supported values. Clear the + * modifiable bits in the host base value and or the + * modifiable bits from the guest value. + */ + guestval = hostval & ~x86_spec_ctrl_mask; + guestval |= guest_spec_ctrl & x86_spec_ctrl_mask; + /* SSBD controlled in MSR_SPEC_CTRL */ if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) - host |= ssbd_tif_to_spec_ctrl(ti->flags); + hostval |= ssbd_tif_to_spec_ctrl(ti->flags);
- if (host != guest_spec_ctrl) { - msr = setguest ? guest_spec_ctrl : host; - wrmsrl(MSR_IA32_SPEC_CTRL, msr); + if (hostval != guestval) { + msrval = setguest ? guestval : hostval; + wrmsrl(MSR_IA32_SPEC_CTRL, msrval); } } } @@ -530,7 +542,7 @@ static enum ssb_mitigation __init __ssb_ switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: x86_spec_ctrl_base |= SPEC_CTRL_SSBD; - x86_spec_ctrl_mask &= ~SPEC_CTRL_SSBD; + x86_spec_ctrl_mask |= SPEC_CTRL_SSBD; wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); break; case X86_VENDOR_AMD:
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Dave Airlie airlied@redhat.com
commit 8ef4227615e158faa4ee85a1d6466782f7e22f2f upstream.
A recent change to the mm code in: 87744ab3832b mm: fix cache mode tracking in vm_insert_mixed()
started enforcing checking the memory type against the registered list for amixed pfn insertion mappings. It happens that the drm drivers for a number of gpus relied on this being broken. Currently the driver only inserted VRAM mappings into the tracking table when they came from the kernel, and userspace mappings never landed in the table. This led to a regression where all the mapping end up as UC instead of WC now.
I've considered a number of solutions but since this needs to be fixed in fixes and not next, and some of the solutions were going to introduce overhead that hadn't been there before I didn't consider them viable at this stage. These mainly concerned hooking into the TTM io reserve APIs, but these API have a bunch of fast paths I didn't want to unwind to add this to.
The solution I've decided on is to add a new API like the arch_phys_wc APIs (these would have worked but wc_del didn't take a range), and use them from the drivers to add a WC compatible mapping to the table for all VRAM on those GPUs. This means we can then create userspace mapping that won't get degraded to UC.
v1.1: use CONFIG_X86_PAT + add some comments in io.h
Cc: Toshi Kani toshi.kani@hp.com Cc: Borislav Petkov bp@alien8.de Cc: H. Peter Anvin hpa@zytor.com Cc: Andy Lutomirski luto@kernel.org Cc: Denys Vlasenko dvlasenk@redhat.com Cc: Brian Gerst brgerst@gmail.com Cc: x86@kernel.org Cc: mcgrof@suse.com Cc: Dan Williams dan.j.williams@intel.com Acked-by: Ingo Molnar mingo@kernel.org Reviewed-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Dave Airlie airlied@redhat.com [bwh: Backported to 3.16: Memory types have type unsigned long, and the constant is named _PAGE_CACHE_WC instead of _PAGE_CACHE_MODE_WC.] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/io.h | 6 ++++++ arch/x86/mm/pat.c | 14 ++++++++++++++ include/linux/io.h | 22 ++++++++++++++++++++++ 3 files changed, 42 insertions(+)
--- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -340,4 +340,10 @@ extern void arch_phys_wc_del(int handle) #define arch_phys_wc_add arch_phys_wc_add #endif
+#ifdef CONFIG_X86_PAT +extern int arch_io_reserve_memtype_wc(resource_size_t start, resource_size_t size); +extern void arch_io_free_memtype_wc(resource_size_t start, resource_size_t size); +#define arch_io_reserve_memtype_wc arch_io_reserve_memtype_wc +#endif + #endif /* _ASM_X86_IO_H */ --- a/arch/x86/mm/pat.c +++ b/arch/x86/mm/pat.c @@ -481,6 +481,20 @@ void io_free_memtype(resource_size_t sta free_memtype(start, end); }
+int arch_io_reserve_memtype_wc(resource_size_t start, resource_size_t size) +{ + unsigned long type = _PAGE_CACHE_WC; + + return io_reserve_memtype(start, start + size, &type); +} +EXPORT_SYMBOL(arch_io_reserve_memtype_wc); + +void arch_io_free_memtype_wc(resource_size_t start, resource_size_t size) +{ + io_free_memtype(start, start + size); +} +EXPORT_SYMBOL(arch_io_free_memtype_wc); + pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { --- a/include/linux/io.h +++ b/include/linux/io.h @@ -101,4 +101,26 @@ static inline void arch_phys_wc_del(int #define arch_phys_wc_add arch_phys_wc_add #endif
+/* + * On x86 PAT systems we have memory tracking that keeps track of + * the allowed mappings on memory ranges. This tracking works for + * all the in-kernel mapping APIs (ioremap*), but where the user + * wishes to map a range from a physical device into user memory + * the tracking won't be updated. This API is to be used by + * drivers which remap physical device pages into userspace, + * and wants to make sure they are mapped WC and not UC. + */ +#ifndef arch_io_reserve_memtype_wc +static inline int arch_io_reserve_memtype_wc(resource_size_t base, + resource_size_t size) +{ + return 0; +} + +static inline void arch_io_free_memtype_wc(resource_size_t base, + resource_size_t size) +{ +} +#endif + #endif /* _LINUX_IO_H */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 27ba0644ea9dfe6e7693abc85837b60e40583b96 upstream.
We don't create non-linear mappings anymore. Let's drop code which handles them in rmap.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -317,10 +317,10 @@ maps this page at its virtual address. about doing this.
The idea is, first at flush_dcache_page() time, if - page->mapping->i_mmap is an empty tree and ->i_mmap_nonlinear - an empty list, just mark the architecture private page flag bit. - Later, in update_mmu_cache(), a check is made of this flag bit, - and if set the flush is done and the flag bit is cleared. + page->mapping->i_mmap is an empty tree, just mark the architecture + private page flag bit. Later, in update_mmu_cache(), a check is + made of this flag bit, and if set the flush is done and the flag + bit is cleared.
IMPORTANT NOTE: It is often important, if you defer the flush, that the actual flush occurs on the same CPU --- a/fs/inode.c +++ b/fs/inode.c @@ -352,7 +352,6 @@ void address_space_init_once(struct addr INIT_LIST_HEAD(&mapping->private_list); spin_lock_init(&mapping->private_lock); mapping->i_mmap = RB_ROOT; - INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); } EXPORT_SYMBOL(address_space_init_once);
--- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -395,7 +395,6 @@ struct address_space { spinlock_t tree_lock; /* and lock protecting it */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root i_mmap; /* tree of private and shared mappings */ - struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ struct mutex i_mmap_mutex; /* protect tree, count, list */ /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ @@ -467,8 +466,7 @@ int mapping_tagged(struct address_space */ static inline int mapping_mapped(struct address_space *mapping) { - return !RB_EMPTY_ROOT(&mapping->i_mmap) || - !list_empty(&mapping->i_mmap_nonlinear); + return !RB_EMPTY_ROOT(&mapping->i_mmap); }
/* --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1728,12 +1728,6 @@ struct vm_area_struct *vma_interval_tree for (vma = vma_interval_tree_iter_first(root, start, last); \ vma; vma = vma_interval_tree_iter_next(vma, start, last))
-static inline void vma_nonlinear_insert(struct vm_area_struct *vma, - struct list_head *list) -{ - list_add_tail(&vma->shared.nonlinear, list); -} - void anon_vma_interval_tree_insert(struct anon_vma_chain *node, struct rb_root *root); void anon_vma_interval_tree_remove(struct anon_vma_chain *node, --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -272,15 +272,13 @@ struct vm_area_struct {
/* * For areas with an address space and backing store, - * linkage into the address_space->i_mmap interval tree, or - * linkage of vma in the address_space->i_mmap_nonlinear list. + * linkage into the address_space->i_mmap interval tree. */ union { struct { struct rb_node rb; unsigned long rb_subtree_last; } linear; - struct list_head nonlinear; } shared;
/* --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -232,7 +232,6 @@ int page_mapped_in_vma(struct page *page * arg: passed to rmap_one() and invalid_vma() * rmap_one: executed on each vma where page is mapped * done: for checking traversing termination condition - * file_nonlinear: for handling file nonlinear mapping * anon_lock: for getting anon_lock by optimized way rather than default * invalid_vma: for skipping uninterested vma */ @@ -241,7 +240,6 @@ struct rmap_walk_control { int (*rmap_one)(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg); int (*done)(struct page *page); - int (*file_nonlinear)(struct page *, struct address_space *, void *arg); struct anon_vma *(*anon_lock)(struct page *page); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; --- a/kernel/fork.c +++ b/kernel/fork.c @@ -430,12 +430,8 @@ static int dup_mmap(struct mm_struct *mm mapping->i_mmap_writable++; flush_dcache_mmap_lock(mapping); /* insert tmp into the share list, just after mpnt */ - if (unlikely(tmp->vm_flags & VM_NONLINEAR)) - vma_nonlinear_insert(tmp, - &mapping->i_mmap_nonlinear); - else - vma_interval_tree_insert_after(tmp, mpnt, - &mapping->i_mmap); + vma_interval_tree_insert_after(tmp, mpnt, + &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); mutex_unlock(&mapping->i_mmap_mutex); } --- a/mm/migrate.c +++ b/mm/migrate.c @@ -181,37 +181,6 @@ out: }
/* - * Congratulations to trinity for discovering this bug. - * mm/fremap.c's remap_file_pages() accepts any range within a single vma to - * convert that vma to VM_NONLINEAR; and generic_file_remap_pages() will then - * replace the specified range by file ptes throughout (maybe populated after). - * If page migration finds a page within that range, while it's still located - * by vma_interval_tree rather than lost to i_mmap_nonlinear list, no problem: - * zap_pte() clears the temporary migration entry before mmap_sem is dropped. - * But if the migrating page is in a part of the vma outside the range to be - * remapped, then it will not be cleared, and remove_migration_ptes() needs to - * deal with it. Fortunately, this part of the vma is of course still linear, - * so we just need to use linear location on the nonlinear list. - */ -static int remove_linear_migration_ptes_from_nonlinear(struct page *page, - struct address_space *mapping, void *arg) -{ - struct vm_area_struct *vma; - /* hugetlbfs does not support remap_pages, so no huge pgoff worries */ - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - unsigned long addr; - - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); - if (addr >= vma->vm_start && addr < vma->vm_end) - remove_migration_pte(page, vma, addr, arg); - } - return SWAP_AGAIN; -} - -/* * Get rid of all migration entries and replace them by * references to the indicated page. */ @@ -220,7 +189,6 @@ static void remove_migration_ptes(struct struct rmap_walk_control rwc = { .rmap_one = remove_migration_pte, .arg = old, - .file_nonlinear = remove_linear_migration_ptes_from_nonlinear, };
rmap_walk(new, &rwc); --- a/mm/mmap.c +++ b/mm/mmap.c @@ -219,10 +219,7 @@ static void __remove_shared_vm_struct(st mapping->i_mmap_writable--;
flush_dcache_mmap_lock(mapping); - if (unlikely(vma->vm_flags & VM_NONLINEAR)) - list_del_init(&vma->shared.nonlinear); - else - vma_interval_tree_remove(vma, &mapping->i_mmap); + vma_interval_tree_remove(vma, &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); }
@@ -639,10 +636,7 @@ static void __vma_link_file(struct vm_ar mapping->i_mmap_writable++;
flush_dcache_mmap_lock(mapping); - if (unlikely(vma->vm_flags & VM_NONLINEAR)) - vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); - else - vma_interval_tree_insert(vma, &mapping->i_mmap); + vma_interval_tree_insert(vma, &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); } } @@ -777,14 +771,11 @@ again: remove_next = 1 + (end > next->
if (file) { mapping = file->f_mapping; - if (!(vma->vm_flags & VM_NONLINEAR)) { - root = &mapping->i_mmap; - uprobe_munmap(vma, vma->vm_start, vma->vm_end); - - if (adjust_next) - uprobe_munmap(next, next->vm_start, - next->vm_end); - } + root = &mapping->i_mmap; + uprobe_munmap(vma, vma->vm_start, vma->vm_end); + + if (adjust_next) + uprobe_munmap(next, next->vm_start, next->vm_end);
mutex_lock(&mapping->i_mmap_mutex); if (insert) { @@ -3187,8 +3178,7 @@ static void vm_lock_mapping(struct mm_st * * mmap_sem in write mode is required in order to block all operations * that could modify pagetables and free pages without need of - * altering the vma layout (for example populate_range() with - * nonlinear vmas). It's also needed in write mode to avoid new + * altering the vma layout. It's also needed in write mode to avoid new * anon_vmas to be associated with existing vmas. * * A single task can't take more than one mm_take_all_locks() in a row --- a/mm/rmap.c +++ b/mm/rmap.c @@ -597,9 +597,8 @@ unsigned long page_address_in_vma(struct if (!vma->anon_vma || !page__anon_vma || vma->anon_vma->root != page__anon_vma->root) return -EFAULT; - } else if (page->mapping && !(vma->vm_flags & VM_NONLINEAR)) { - if (!vma->vm_file || - vma->vm_file->f_mapping != page->mapping) + } else if (page->mapping) { + if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping) return -EFAULT; } else return -EFAULT; @@ -1286,7 +1285,6 @@ static int try_to_unmap_one(struct page if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pte, swp_pte); - BUG_ON(pte_file(*pte)); } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) { /* Establish migration entry for a file page */ @@ -1328,207 +1326,6 @@ out_mlock: return ret; }
-/* - * objrmap doesn't work for nonlinear VMAs because the assumption that - * offset-into-file correlates with offset-into-virtual-addresses does not hold. - * Consequently, given a particular page and its ->index, we cannot locate the - * ptes which are mapping that page without an exhaustive linear search. - * - * So what this code does is a mini "virtual scan" of each nonlinear VMA which - * maps the file to which the target page belongs. The ->vm_private_data field - * holds the current cursor into that scan. Successive searches will circulate - * around the vma's virtual address space. - * - * So as more replacement pressure is applied to the pages in a nonlinear VMA, - * more scanning pressure is placed against them as well. Eventually pages - * will become fully unmapped and are eligible for eviction. - * - * For very sparsely populated VMAs this is a little inefficient - chances are - * there there won't be many ptes located within the scan cluster. In this case - * maybe we could scan further - to the end of the pte page, perhaps. - * - * Mlocked pages: check VM_LOCKED under mmap_sem held for read, if we can - * acquire it without blocking. If vma locked, mlock the pages in the cluster, - * rather than unmapping them. If we encounter the "check_page" that vmscan is - * trying to unmap, return SWAP_MLOCK, else default SWAP_AGAIN. - */ -#define CLUSTER_SIZE min(32*PAGE_SIZE, PMD_SIZE) -#define CLUSTER_MASK (~(CLUSTER_SIZE - 1)) - -static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, - struct vm_area_struct *vma, struct page *check_page) -{ - struct mm_struct *mm = vma->vm_mm; - pmd_t *pmd; - pte_t *pte; - pte_t pteval; - spinlock_t *ptl; - struct page *page; - unsigned long address; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ - unsigned long end; - int ret = SWAP_AGAIN; - int locked_vma = 0; - - address = (vma->vm_start + cursor) & CLUSTER_MASK; - end = address + CLUSTER_SIZE; - if (address < vma->vm_start) - address = vma->vm_start; - if (end > vma->vm_end) - end = vma->vm_end; - - pmd = mm_find_pmd(mm, address); - if (!pmd) - return ret; - - mmun_start = address; - mmun_end = end; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); - - /* - * If we can acquire the mmap_sem for read, and vma is VM_LOCKED, - * keep the sem while scanning the cluster for mlocking pages. - */ - if (down_read_trylock(&vma->vm_mm->mmap_sem)) { - locked_vma = (vma->vm_flags & VM_LOCKED); - if (!locked_vma) - up_read(&vma->vm_mm->mmap_sem); /* don't need it */ - } - - pte = pte_offset_map_lock(mm, pmd, address, &ptl); - - /* Update high watermark before we lower rss */ - update_hiwater_rss(mm); - - for (; address < end; pte++, address += PAGE_SIZE) { - if (!pte_present(*pte)) - continue; - page = vm_normal_page(vma, address, *pte); - BUG_ON(!page || PageAnon(page)); - - if (locked_vma) { - if (page == check_page) { - /* we know we have check_page locked */ - mlock_vma_page(page); - ret = SWAP_MLOCK; - } else if (trylock_page(page)) { - /* - * If we can lock the page, perform mlock. - * Otherwise leave the page alone, it will be - * eventually encountered again later. - */ - mlock_vma_page(page); - unlock_page(page); - } - continue; /* don't unmap */ - } - - if (ptep_clear_flush_young_notify(vma, address, pte)) - continue; - - /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); - - /* If nonlinear, store the file page offset in the pte. */ - if (page->index != linear_page_index(vma, address)) { - pte_t ptfile = pgoff_to_pte(page->index); - if (pte_soft_dirty(pteval)) - ptfile = pte_file_mksoft_dirty(ptfile); - set_pte_at(mm, address, pte, ptfile); - } - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); - - page_remove_rmap(page); - page_cache_release(page); - dec_mm_counter(mm, MM_FILEPAGES); - (*mapcount)--; - } - pte_unmap_unlock(pte - 1, ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); - if (locked_vma) - up_read(&vma->vm_mm->mmap_sem); - return ret; -} - -static int try_to_unmap_nonlinear(struct page *page, - struct address_space *mapping, void *arg) -{ - struct vm_area_struct *vma; - int ret = SWAP_AGAIN; - unsigned long cursor; - unsigned long max_nl_cursor = 0; - unsigned long max_nl_size = 0; - unsigned int mapcount; - - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - cursor = (unsigned long) vma->vm_private_data; - if (cursor > max_nl_cursor) - max_nl_cursor = cursor; - cursor = vma->vm_end - vma->vm_start; - if (cursor > max_nl_size) - max_nl_size = cursor; - } - - if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */ - return SWAP_FAIL; - } - - /* - * We don't try to search for this page in the nonlinear vmas, - * and page_referenced wouldn't have found it anyway. Instead - * just walk the nonlinear vmas trying to age and unmap some. - * The mapcount of the page we came in with is irrelevant, - * but even so use it as a guide to how hard we should try? - */ - mapcount = page_mapcount(page); - if (!mapcount) - return ret; - - cond_resched(); - - max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; - if (max_nl_cursor == 0) - max_nl_cursor = CLUSTER_SIZE; - - do { - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - cursor = (unsigned long) vma->vm_private_data; - while (cursor < max_nl_cursor && - cursor < vma->vm_end - vma->vm_start) { - if (try_to_unmap_cluster(cursor, &mapcount, - vma, page) == SWAP_MLOCK) - ret = SWAP_MLOCK; - cursor += CLUSTER_SIZE; - vma->vm_private_data = (void *) cursor; - if ((int)mapcount <= 0) - return ret; - } - vma->vm_private_data = (void *) max_nl_cursor; - } - cond_resched(); - max_nl_cursor += CLUSTER_SIZE; - } while (max_nl_cursor <= max_nl_size); - - /* - * Don't loop forever (perhaps all the remaining pages are - * in locked vmas). Reset cursor on all unreserved nonlinear - * vmas, now forgetting on which ones it had fallen behind. - */ - list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.nonlinear) - vma->vm_private_data = NULL; - - return ret; -} - bool is_vma_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); @@ -1574,7 +1371,6 @@ int try_to_unmap(struct page *page, enum .rmap_one = try_to_unmap_one, .arg = (void *)flags, .done = page_not_mapped, - .file_nonlinear = try_to_unmap_nonlinear, .anon_lock = page_lock_anon_vma_read, };
@@ -1620,12 +1416,6 @@ int try_to_munlock(struct page *page) .rmap_one = try_to_unmap_one, .arg = (void *)TTU_MUNLOCK, .done = page_not_mapped, - /* - * We don't bother to try to find the munlocked page in - * nonlinears. It's costly. Instead, later, page reclaim logic - * may call try_to_unmap() and recover PG_mlocked lazily. - */ - .file_nonlinear = NULL, .anon_lock = page_lock_anon_vma_read,
}; @@ -1753,14 +1543,6 @@ static int rmap_walk_file(struct page *p goto done; }
- if (!rwc->file_nonlinear) - goto done; - - if (list_empty(&mapping->i_mmap_nonlinear)) - goto done; - - ret = rwc->file_nonlinear(page, mapping, rwc->arg); - done: mutex_unlock(&mapping->i_mmap_mutex); return ret; --- a/mm/swap.c +++ b/mm/swap.c @@ -1103,10 +1103,8 @@ void __init swap_setup(void)
if (bdi_init(swapper_spaces[0].backing_dev_info)) panic("Failed to init swap bdi"); - for (i = 0; i < MAX_SWAPFILES; i++) { + for (i = 0; i < MAX_SWAPFILES; i++) spin_lock_init(&swapper_spaces[i].tree_lock); - INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear); - } #endif
/* Use a smaller cluster for small-memory machines */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Tom Lendacky thomas.lendacky@amd.com
commit bc226f07dcd3c9ef0b7f6236fe356ea4a9cb4769 upstream.
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD.
[ tglx: Folded the migration fixup from Paolo Bonzini ]
Signed-off-by: Tom Lendacky thomas.lendacky@amd.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - There is no SMM support or cpu_has_high_real_mode_segbase operation - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kernel/cpu/common.c | 3 ++- arch/x86/kvm/cpuid.c | 11 +++++++++-- arch/x86/kvm/cpuid.h | 8 ++++++++ arch/x86/kvm/svm.c | 23 +++++++++++++++++++++++ arch/x86/kvm/vmx.c | 12 ++++++++++++ arch/x86/kvm/x86.c | 7 +++---- 7 files changed, 58 insertions(+), 7 deletions(-)
--- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -671,6 +671,7 @@ struct kvm_x86_ops { int (*hardware_setup)(void); /* __init */ void (*hardware_unsetup)(void); /* __exit */ bool (*cpu_has_accelerated_tpr)(void); + bool (*has_emulated_msr)(int index); void (*cpuid_update)(struct kvm_vcpu *vcpu);
/* Create, but do not attach this VCPU */ --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -700,7 +700,8 @@ static void init_speculation_control(str if (cpu_has(c, X86_FEATURE_INTEL_STIBP)) set_cpu_cap(c, X86_FEATURE_STIBP);
- if (cpu_has(c, X86_FEATURE_SPEC_CTRL_SSBD)) + if (cpu_has(c, X86_FEATURE_SPEC_CTRL_SSBD) || + cpu_has(c, X86_FEATURE_VIRT_SSBD)) set_cpu_cap(c, X86_FEATURE_SSBD);
if (cpu_has(c, X86_FEATURE_AMD_IBRS)) { --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -302,7 +302,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 0x80000008.ebx */ const u32 kvm_cpuid_8000_0008_ebx_x86_features = - F(AMD_IBPB) | F(AMD_IBRS); + F(AMD_IBPB) | F(AMD_IBRS) | F(VIRT_SSBD);
/* cpuid 0xC0000001.edx */ const u32 kvm_supported_word5_x86_features = @@ -524,13 +524,20 @@ static inline int __do_cpuid_ent(struct g_phys_as = phys_as; entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; - /* IBRS and IBPB aren't necessarily present in hardware cpuid */ + /* + * IBRS, IBPB and VIRT_SSBD aren't necessarily present in + * hardware cpuid + */ if (boot_cpu_has(X86_FEATURE_AMD_IBPB)) entry->ebx |= F(AMD_IBPB); if (boot_cpu_has(X86_FEATURE_AMD_IBRS)) entry->ebx |= F(AMD_IBRS); + if (boot_cpu_has(X86_FEATURE_VIRT_SSBD)) + entry->ebx |= F(VIRT_SSBD); entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features; cpuid_mask(&entry->ebx, 11); + if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + entry->ebx |= F(VIRT_SSBD); break; } case 0x80000019: --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -134,5 +134,13 @@ static inline bool guest_cpuid_has_arch_ return best && (best->edx & bit(X86_FEATURE_ARCH_CAPABILITIES)); }
+static inline bool guest_cpuid_has_virt_ssbd(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0); + return best && (best->ebx & bit(X86_FEATURE_VIRT_SSBD)); +} +
#endif --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3151,6 +3151,13 @@ static int svm_get_msr(struct kvm_vcpu *
msr_info->data = svm->spec_ctrl; break; + case MSR_AMD64_VIRT_SPEC_CTRL: + if (!msr_info->host_initiated && + !guest_cpuid_has_virt_ssbd(vcpu)) + return 1; + + msr_info->data = svm->virt_spec_ctrl; + break; case MSR_IA32_UCODE_REV: msr_info->data = 0x01000065; break; @@ -3266,6 +3273,16 @@ static int svm_set_msr(struct kvm_vcpu * break; set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1); break; + case MSR_AMD64_VIRT_SPEC_CTRL: + if (!msr->host_initiated && + !guest_cpuid_has_virt_ssbd(vcpu)) + return 1; + + if (data & ~SPEC_CTRL_SSBD) + return 1; + + svm->virt_spec_ctrl = data; + break; case MSR_STAR: svm->vmcb->save.star = data; break; @@ -4202,6 +4219,11 @@ static bool svm_cpu_has_accelerated_tpr( return false; }
+static bool svm_has_emulated_msr(int index) +{ + return true; +} + static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; @@ -4468,6 +4490,7 @@ static struct kvm_x86_ops svm_x86_ops = .hardware_enable = svm_hardware_enable, .hardware_disable = svm_hardware_disable, .cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr, + .has_emulated_msr = svm_has_emulated_msr,
.vcpu_create = svm_create_vcpu, .vcpu_free = svm_free_vcpu, --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7356,6 +7356,17 @@ static void vmx_handle_external_intr(str local_irq_enable(); }
+static bool vmx_has_emulated_msr(int index) +{ + switch (index) { + case MSR_AMD64_VIRT_SPEC_CTRL: + /* This is AMD only. */ + return false; + default: + return true; + } +} + static bool vmx_mpx_supported(void) { return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) && @@ -9034,6 +9045,7 @@ static struct kvm_x86_ops vmx_x86_ops = .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, .cpu_has_accelerated_tpr = report_flexpriority, + .has_emulated_msr = vmx_has_emulated_msr,
.vcpu_create = vmx_create_vcpu, .vcpu_free = vmx_free_vcpu, --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -906,6 +906,7 @@ static u32 emulated_msrs[] = { MSR_IA32_MISC_ENABLE, MSR_IA32_MCG_STATUS, MSR_IA32_MCG_CTL, + MSR_AMD64_VIRT_SPEC_CTRL, };
static unsigned num_emulated_msrs; @@ -4039,10 +4040,8 @@ static void kvm_init_msr_list(void) num_msrs_to_save = j;
for (i = j = 0; i < ARRAY_SIZE(emulated_msrs); i++) { - switch (emulated_msrs[i]) { - default: - break; - } + if (!kvm_x86_ops->has_emulated_msr(emulated_msrs[i])) + continue;
if (j < i) emulated_msrs[j] = emulated_msrs[i];
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 6e76d4b20bf6b514408ab5bd07f4a76723259b64 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Martin Schwidefsky schwidefsky@de.ibm.com Cc: Heiko Carstens heiko.carstens@de.ibm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/s390/include/asm/pgtable.h | 29 ++++------------------------- 1 file changed, 4 insertions(+), 25 deletions(-)
--- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -237,10 +237,10 @@ extern unsigned long MODULES_END; _PAGE_DIRTY | _PAGE_YOUNG)
/* - * handle_pte_fault uses pte_present, pte_none and pte_file to find out the - * pte type WITHOUT holding the page table lock. The _PAGE_PRESENT bit - * is used to distinguish present from not-present ptes. It is changed only - * with the page table lock held. + * handle_pte_fault uses pte_present and pte_none to find out the pte type + * WITHOUT holding the page table lock. The _PAGE_PRESENT bit is used to + * distinguish present from not-present ptes. It is changed only with the page + * table lock held. * * The following table gives the different possible bit combinations for * the pte hardware and software bits in the last 12 bits of a pte: @@ -267,7 +267,6 @@ extern unsigned long MODULES_END; * * pte_present is true for the bit pattern .xx...xxxxx1, (pte & 0x001) == 0x001 * pte_none is true for the bit pattern .10...xxxx00, (pte & 0x603) == 0x400 - * pte_file is true for the bit pattern .11...xxxxx0, (pte & 0x601) == 0x600 * pte_swap is true for the bit pattern .10...xxxx10, (pte & 0x603) == 0x402 */
@@ -644,13 +643,6 @@ static inline int pte_swap(pte_t pte) == (_PAGE_INVALID | _PAGE_TYPE); }
-static inline int pte_file(pte_t pte) -{ - /* Bit pattern: (pte & 0x601) == 0x600 */ - return (pte_val(pte) & (_PAGE_INVALID | _PAGE_PROTECT | _PAGE_PRESENT)) - == (_PAGE_INVALID | _PAGE_PROTECT); -} - static inline int pte_special(pte_t pte) { return (pte_val(pte) & _PAGE_SPECIAL); @@ -1710,19 +1702,6 @@ static inline pte_t mk_swap_pte(unsigned #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#ifndef CONFIG_64BIT -# define PTE_FILE_MAX_BITS 26 -#else /* CONFIG_64BIT */ -# define PTE_FILE_MAX_BITS 59 -#endif /* CONFIG_64BIT */ - -#define pte_to_pgoff(__pte) \ - ((((__pte).pte >> 12) << 7) + (((__pte).pte >> 1) & 0x7f)) - -#define pgoff_to_pte(__off) \ - ((pte_t) { ((((__off) & 0x7f) << 1) + (((__off) >> 7) << 12)) \ - | _PAGE_INVALID | _PAGE_PROTECT }) - #endif /* !__ASSEMBLY__ */
#define kern_addr_valid(addr) (1)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 7eb8956a7fec3c1f0abc2a5517dada99ccc8a961 upstream.
The availability of the SPEC_CTRL MSR is enumerated by a CPUID bit on Intel and implied by IBRS or STIBP support on AMD. That's just confusing and in case an AMD CPU has IBRS not supported because the underlying problem has been fixed but has another bit valid in the SPEC_CTRL MSR, the thing falls apart.
Add a synthetic feature bit X86_FEATURE_MSR_SPEC_CTRL to denote the availability on both Intel and AMD.
While at it replace the boot_cpu_has() checks with static_cpu_has() where possible. This prevents late microcode loading from exposing SPEC_CTRL, but late loading is already very limited as it does not reevaluate the mitigation options and other bits and pieces. Having static_cpu_has() is the simplest and least fragile solution.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - Use the next available bit number in CPU feature word 7 - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/kernel/cpu/bugs.c | 18 +++++++++++------- arch/x86/kernel/cpu/common.c | 9 +++++++-- arch/x86/kernel/cpu/intel.c | 1 + 4 files changed, 20 insertions(+), 9 deletions(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -196,6 +196,7 @@ #define X86_FEATURE_IBRS (7*32+16) /* Indirect Branch Restricted Speculation */ #define X86_FEATURE_IBPB (7*32+17) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP (7*32+18) /* Single Thread Indirect Branch Predictors */ +#define X86_FEATURE_MSR_SPEC_CTRL (7*32+19) /* "" MSR SPEC_CTRL is implemented */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -121,7 +121,7 @@ void __init check_bugs(void) * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD * init code as it is not enumerated and depends on the family. */ - if (boot_cpu_has(X86_FEATURE_IBRS)) + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
/* Select the proper spectre mitigation before patching alternatives */ @@ -206,7 +206,7 @@ u64 x86_spec_ctrl_get_default(void) { u64 msrval = x86_spec_ctrl_base;
- if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + if (static_cpu_has(X86_FEATURE_SPEC_CTRL)) msrval |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags); return msrval; } @@ -216,10 +216,12 @@ void x86_spec_ctrl_set_guest(u64 guest_s { u64 host = x86_spec_ctrl_base;
- if (!boot_cpu_has(X86_FEATURE_IBRS)) + /* Is MSR_SPEC_CTRL implemented ? */ + if (!static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) return;
- if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + /* Intel controls SSB in MSR_SPEC_CTRL */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL)) host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags);
if (host != guest_spec_ctrl) @@ -231,10 +233,12 @@ void x86_spec_ctrl_restore_host(u64 gues { u64 host = x86_spec_ctrl_base;
- if (!boot_cpu_has(X86_FEATURE_IBRS)) + /* Is MSR_SPEC_CTRL implemented ? */ + if (!static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) return;
- if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + /* Intel controls SSB in MSR_SPEC_CTRL */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL)) host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags);
if (host != guest_spec_ctrl) @@ -668,7 +672,7 @@ int arch_prctl_spec_ctrl_get(struct task
void x86_spec_ctrl_setup_ap(void) { - if (boot_cpu_has(X86_FEATURE_IBRS)) + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) x86_spec_ctrl_set(x86_spec_ctrl_base & ~x86_spec_ctrl_mask);
if (ssb_mode == SPEC_STORE_BYPASS_DISABLE) --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -694,19 +694,24 @@ static void init_speculation_control(str if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) { set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_IBPB); + set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); }
if (cpu_has(c, X86_FEATURE_INTEL_STIBP)) set_cpu_cap(c, X86_FEATURE_STIBP);
- if (cpu_has(c, X86_FEATURE_AMD_IBRS)) + if (cpu_has(c, X86_FEATURE_AMD_IBRS)) { set_cpu_cap(c, X86_FEATURE_IBRS); + set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + }
if (cpu_has(c, X86_FEATURE_AMD_IBPB)) set_cpu_cap(c, X86_FEATURE_IBPB);
- if (cpu_has(c, X86_FEATURE_AMD_STIBP)) + if (cpu_has(c, X86_FEATURE_AMD_STIBP)) { set_cpu_cap(c, X86_FEATURE_STIBP); + set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + } }
void get_cpu_cap(struct cpuinfo_x86 *c) --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -119,6 +119,7 @@ static void early_init_intel(struct cpui setup_clear_cpu_cap(X86_FEATURE_IBPB); setup_clear_cpu_cap(X86_FEATURE_STIBP); setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL); + setup_clear_cpu_cap(X86_FEATURE_MSR_SPEC_CTRL); setup_clear_cpu_cap(X86_FEATURE_INTEL_STIBP); setup_clear_cpu_cap(X86_FEATURE_SSBD); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit b007ea798f5c568d3f464d37288220ef570f062c upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also adjust __SWP_TYPE_SHIFT, effectively increase size of possible swap file to 128G.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Russell King linux@arm.linux.org.uk Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/arm/include/asm/pgtable-2level.h | 1 - arch/arm/include/asm/pgtable-3level.h | 1 - arch/arm/include/asm/pgtable-nommu.h | 2 -- arch/arm/include/asm/pgtable.h | 20 +++----------------- arch/arm/mm/proc-macros.S | 2 +- 5 files changed, 4 insertions(+), 22 deletions(-)
--- a/arch/arm/include/asm/pgtable-2level.h +++ b/arch/arm/include/asm/pgtable-2level.h @@ -118,7 +118,6 @@ #define L_PTE_VALID (_AT(pteval_t, 1) << 0) /* Valid */ #define L_PTE_PRESENT (_AT(pteval_t, 1) << 0) #define L_PTE_YOUNG (_AT(pteval_t, 1) << 1) -#define L_PTE_FILE (_AT(pteval_t, 1) << 2) /* only when !PRESENT */ #define L_PTE_DIRTY (_AT(pteval_t, 1) << 6) #define L_PTE_RDONLY (_AT(pteval_t, 1) << 7) #define L_PTE_USER (_AT(pteval_t, 1) << 8) --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -77,7 +77,6 @@ */ #define L_PTE_VALID (_AT(pteval_t, 1) << 0) /* Valid */ #define L_PTE_PRESENT (_AT(pteval_t, 3) << 0) /* Present */ -#define L_PTE_FILE (_AT(pteval_t, 1) << 2) /* only when !PRESENT */ #define L_PTE_USER (_AT(pteval_t, 1) << 6) /* AP[1] */ #define L_PTE_SHARED (_AT(pteval_t, 3) << 8) /* SH[1:0], inner shareable */ #define L_PTE_YOUNG (_AT(pteval_t, 1) << 10) /* AF */ --- a/arch/arm/include/asm/pgtable-nommu.h +++ b/arch/arm/include/asm/pgtable-nommu.h @@ -54,8 +54,6 @@
typedef pte_t *pte_addr_t;
-static inline int pte_file(pte_t pte) { return 0; } - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -279,12 +279,12 @@ static inline pte_t pte_modify(pte_t pte * * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * <--------------- offset ----------------------> < type -> 0 0 0 + * <--------------- offset ------------------------> < type -> 0 0 * - * This gives us up to 31 swap files and 64GB per swap file. Note that + * This gives us up to 31 swap files and 128GB per swap file. Note that * the offset field is always non-zero. */ -#define __SWP_TYPE_SHIFT 3 +#define __SWP_TYPE_SHIFT 2 #define __SWP_TYPE_BITS 5 #define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1) #define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT) @@ -303,20 +303,6 @@ static inline pte_t pte_modify(pte_t pte */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS)
-/* - * Encode and decode a file entry. File entries are stored in the Linux - * page tables as follows: - * - * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * <----------------------- offset ------------------------> 1 0 0 - */ -#define pte_file(pte) (pte_val(pte) & L_PTE_FILE) -#define pte_to_pgoff(x) (pte_val(x) >> 3) -#define pgoff_to_pte(x) __pte(((x) << 3) | L_PTE_FILE) - -#define PTE_FILE_MAX_BITS 29 - /* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ /* FIXME: this is not correct */ #define kern_addr_valid(addr) (1) --- a/arch/arm/mm/proc-macros.S +++ b/arch/arm/mm/proc-macros.S @@ -98,7 +98,7 @@ #endif #if !defined (CONFIG_ARM_LPAE) && \ (L_PTE_XN+L_PTE_USER+L_PTE_RDONLY+L_PTE_DIRTY+L_PTE_YOUNG+\ - L_PTE_FILE+L_PTE_PRESENT) > L_PTE_SHARED + L_PTE_PRESENT) > L_PTE_SHARED #error Invalid Linux PTE bit settings #endif #endif /* CONFIG_MMU */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vlastimil Babka vbabka@suse.cz
commit 9df9516940a61d29aedf4d91b483ca6597e7d480 upstream.
On 32bit PAE kernels on 64bit hardware with enough physical bits, l1tf_pfn_limit() will overflow unsigned long. This in turn affects max_swapfile_size() and can lead to swapon returning -EINVAL. This has been observed in a 32bit guest with 42 bits physical address size, where max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces the following warning to dmesg:
[ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k
Fix this by using unsigned long long instead.
Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Reported-by: Dominique Leuenberger dimstar@suse.de Reported-by: Adrian Schroeter adrian@suse.de Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Andi Kleen ak@linux.intel.com Acked-by: Michal Hocko mhocko@suse.com Cc: "H . Peter Anvin" hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Dave Hansen dave.hansen@intel.com Cc: Michal Hocko mhocko@kernel.org Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/processor.h | 4 ++-- arch/x86/mm/init.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)
--- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -165,9 +165,9 @@ extern const struct seq_operations cpuin extern void cpu_detect(struct cpuinfo_x86 *c); extern void fpu_detect(struct cpuinfo_x86 *c);
-static inline unsigned long l1tf_pfn_limit(void) +static inline unsigned long long l1tf_pfn_limit(void) { - return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1; + return BIT_ULL(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1; }
extern void early_cpu_init(void); --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -710,7 +710,7 @@ unsigned long max_swapfile_size(void)
if (boot_cpu_has_bug(X86_BUG_L1TF)) { /* Limit the swap file size to MAX_PA/2 for L1TF workaround */ - unsigned long l1tf_limit = l1tf_pfn_limit() + 1; + unsigned long long l1tf_limit = l1tf_pfn_limit() + 1; /* * We encode swap offsets also with 3 bits below those for pfn * which makes the usable limit higher. @@ -718,7 +718,7 @@ unsigned long max_swapfile_size(void) #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT; #endif - pages = min_t(unsigned long, l1tf_limit, pages); + pages = min_t(unsigned long long, l1tf_limit, pages); } return pages; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Linus Torvalds torvalds@linux-foundation.org
commit 2f22b4cd45b67b3496f4aa4c7180a1271c6452f6 upstream.
With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting side effects allow to read the memory the PTE is pointing too, if its values are still in the L1 cache.
For swapped out pages Linux uses unmapped PTEs and stores a swap entry into them.
To protect against L1TF it must be ensured that the swap entry is not pointing to valid memory, which requires setting higher bits (between bit 36 and bit 45) that are inside the CPUs physical address space, but outside any real memory.
To do this invert the offset to make sure the higher bits are always set, as long as the swap file is not too big.
Note there is no workaround for 32bit !PAE, or on systems which have more than MAX_PA/2 worth of memory. The later case is very unlikely to happen on real systems.
[AK: updated description and minor tweaks by. Split out from the original patch ]
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Andi Kleen ak@linux.intel.com Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: Dave Hansen dave.hansen@intel.com [bwh: Backported to 3.16: Bit 9 may be reserved for PAGE_BIT_NUMA here] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable_64.h | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
--- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -167,7 +167,7 @@ static inline int pgd_large(pgd_t pgd) { * * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names - * | TYPE (59-63) | OFFSET (10-58) | 0 |0|0|X|X| X| X|X|SD|0| <- swp entry + * | TYPE (59-63) | ~OFFSET (10-58) | 0 |0|0|X|X| X| X|X|SD|0| <- swp entry * * G (8) is aliased and used as a PROT_NONE indicator for * !present ptes. We need to start storing swap entries above @@ -180,6 +180,9 @@ static inline int pgd_large(pgd_t pgd) { * * Bit 7 in swp entry should be 0 because pmd_present checks not only P, * but also L and G. + * + * The offset is inverted by a binary not operation to make the high + * physical bits set. */ #define SWP_TYPE_BITS 5
@@ -199,13 +202,15 @@ static inline int pgd_large(pgd_t pgd) { #define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
/* Shift up (to get rid of type), then down to get value */ -#define __swp_offset(x) ((x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT) +#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
/* * Shift the offset up "too far" by TYPE bits, then down again + * The offset is inverted by a binary not operation to make the high + * physical bits set. */ #define __swp_entry(type, offset) ((swp_entry_t) { \ - ((unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \ + (~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \ | ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill@shutemov.name
commit 1eeda0abf4425c91e7ce3ca32f1908c3a51bf84e upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Geert Uytterhoeven geert@linux-m68k.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/m68k/include/asm/mcf_pgtable.h | 23 ++--------------------- arch/m68k/include/asm/motorola_pgtable.h | 15 --------------- arch/m68k/include/asm/pgtable_no.h | 2 -- arch/m68k/include/asm/sun3_pgtable.h | 15 --------------- 4 files changed, 2 insertions(+), 53 deletions(-)
--- a/arch/m68k/include/asm/mcf_pgtable.h +++ b/arch/m68k/include/asm/mcf_pgtable.h @@ -35,7 +35,6 @@ * hitting hardware. */ #define CF_PAGE_DIRTY 0x00000001 -#define CF_PAGE_FILE 0x00000200 #define CF_PAGE_ACCESSED 0x00001000
#define _PAGE_CACHE040 0x020 /* 68040 cache mode, cachable, copyback */ @@ -243,11 +242,6 @@ static inline int pte_young(pte_t pte) return pte_val(pte) & CF_PAGE_ACCESSED; }
-static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & CF_PAGE_FILE; -} - static inline int pte_special(pte_t pte) { return 0; @@ -391,26 +385,13 @@ static inline void cache_page(void *vadd *ptep = pte_mkcache(*ptep); }
-#define PTE_FILE_MAX_BITS 21 -#define PTE_FILE_SHIFT 11 - -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte_val(pte) >> PTE_FILE_SHIFT; -} - -static inline pte_t pgoff_to_pte(unsigned pgoff) -{ - return __pte((pgoff << PTE_FILE_SHIFT) + CF_PAGE_FILE); -} - /* * Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */ #define __swp_type(x) ((x).val & 0xFF) -#define __swp_offset(x) ((x).val >> PTE_FILE_SHIFT) +#define __swp_offset(x) ((x).val >> 11) #define __swp_entry(typ, off) ((swp_entry_t) { (typ) | \ - (off << PTE_FILE_SHIFT) }) + (off << 11) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) (__pte((x).val))
--- a/arch/m68k/include/asm/motorola_pgtable.h +++ b/arch/m68k/include/asm/motorola_pgtable.h @@ -28,7 +28,6 @@ #define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_NOCACHE)
#define _PAGE_PROTNONE 0x004 -#define _PAGE_FILE 0x008 /* pagecache or swap? */
#ifndef __ASSEMBLY__
@@ -168,7 +167,6 @@ static inline void pgd_set(pgd_t *pgdp, static inline int pte_write(pte_t pte) { return !(pte_val(pte) & _PAGE_RONLY); } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_RONLY; return pte; } @@ -266,19 +264,6 @@ static inline void cache_page(void *vadd } }
-#define PTE_FILE_MAX_BITS 28 - -static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte.pte >> 4; -} - -static inline pte_t pgoff_to_pte(unsigned off) -{ - pte_t pte = { (off << 4) + _PAGE_FILE }; - return pte; -} - /* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */ #define __swp_type(x) (((x).val >> 4) & 0xff) #define __swp_offset(x) ((x).val >> 12) --- a/arch/m68k/include/asm/pgtable_no.h +++ b/arch/m68k/include/asm/pgtable_no.h @@ -37,8 +37,6 @@ extern void paging_init(void); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-static inline int pte_file(pte_t pte) { return 0; } - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. --- a/arch/m68k/include/asm/sun3_pgtable.h +++ b/arch/m68k/include/asm/sun3_pgtable.h @@ -38,8 +38,6 @@ #define _PAGE_PRESENT (SUN3_PAGE_VALID) #define _PAGE_ACCESSED (SUN3_PAGE_ACCESSED)
-#define PTE_FILE_MAX_BITS 28 - /* Compound page protection values. */ //todo: work out which ones *should* have SUN3_PAGE_NOCACHE and fix... // is it just PAGE_KERNEL and PAGE_SHARED? @@ -168,7 +166,6 @@ static inline void pgd_clear (pgd_t *pgd static inline int pte_write(pte_t pte) { return pte_val(pte) & SUN3_PAGE_WRITEABLE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & SUN3_PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; } static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; } @@ -202,18 +199,6 @@ static inline pmd_t *pmd_offset (pgd_t * return (pmd_t *) pgd; }
-static inline unsigned long pte_to_pgoff(pte_t pte) -{ - return pte.pte & SUN3_PAGE_PGNUM_MASK; -} - -static inline pte_t pgoff_to_pte(unsigned off) -{ - pte_t pte = { off + SUN3_PAGE_ACCESSED }; - return pte; -} - - /* Find an entry in the third-level pagetable. */ #define pte_index(address) ((address >> PAGE_SHIFT) & (PTRS_PER_PTE-1)) #define pte_offset_kernel(pmd, address) ((pte_t *) __pmd_page(*pmd) + pte_index(address))
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Jiri Kosina jkosina@suse.cz
commit d66d8ff3d21667b41eddbe86b35ab411e40d8c5f upstream.
__ssb_select_mitigation() returns one of the members of enum ssb_mitigation, not ssb_mitigation_cmd; fix the prototype to reflect that.
Fixes: 24f7fc83b9204 ("x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation") Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -505,7 +505,7 @@ static enum ssb_mitigation_cmd __init ss return cmd; }
-static enum ssb_mitigation_cmd __init __ssb_select_mitigation(void) +static enum ssb_mitigation __init __ssb_select_mitigation(void) { enum ssb_mitigation mode = SPEC_STORE_BYPASS_NONE; enum ssb_mitigation_cmd cmd;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 17dbca119312b4e8173d4e25ff64262119fcef38 upstream.
L1TF core kernel workarounds are cheap and normally always enabled, However they still should be reported in sysfs if the system is vulnerable or mitigated. Add the necessary CPU feature/bug bits.
- Extend the existing checks for Meltdowns to determine if the system is vulnerable. All CPUs which are not vulnerable to Meltdown are also not vulnerable to L1TF
- Check for 32bit non PAE and emit a warning as there is no practical way for mitigation due to the limited physical address bits
- If the system has more than MAX_PA/2 physical memory the invert page workarounds don't protect the system against the L1TF attack anymore, because an inverted physical address will also point to valid memory. Print a warning in this case and report that the system is vulnerable.
Add a function which returns the PFN limit for the L1TF mitigation, which will be used in follow up patches for sanity and range checks.
[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ] [ dwmw2: Backport to 4.9 (cpufeatures.h, E820) ]
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Dave Hansen dave.hansen@intel.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - Assign the next available bits from feature word 7 and bug word 0 - CONFIG_PGTABLE_LEVELS is not defined; use other config symbols in the condition - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 3 ++- arch/x86/include/asm/processor.h | 5 ++++ arch/x86/kernel/cpu/bugs.c | 40 ++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/common.c | 20 +++++++++++++++ drivers/base/cpu.c | 8 ++++++ include/linux/cpu.h | 2 ++ 6 files changed, 77 insertions(+), 1 deletion(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -199,6 +199,7 @@ #define X86_FEATURE_MSR_SPEC_CTRL (7*32+19) /* "" MSR SPEC_CTRL is implemented */ #define X86_FEATURE_SSBD (7*32+20) /* Speculative Store Bypass Disable */ #define X86_FEATURE_ZEN (7*32+21) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_L1TF_PTEINV (7*32+22) /* "" L1TF workaround PTE inversion */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ @@ -271,6 +272,7 @@ #define X86_BUG_SPECTRE_V1 X86_BUG(6) /* CPU is affected by Spectre variant 1 attack with conditional branches */ #define X86_BUG_SPECTRE_V2 X86_BUG(7) /* CPU is affected by Spectre variant 2 attack with indirect branches */ #define X86_BUG_SPEC_STORE_BYPASS X86_BUG(8) /* CPU is affected by speculative store bypass attack */ +#define X86_BUG_L1TF X86_BUG(9) /* CPU is affected by L1 Terminal Fault */
#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
--- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -165,6 +165,11 @@ extern const struct seq_operations cpuin extern void cpu_detect(struct cpuinfo_x86 *c); extern void fpu_detect(struct cpuinfo_x86 *c);
+static inline unsigned long l1tf_pfn_limit(void) +{ + return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1; +} + extern void early_cpu_init(void); extern void identify_boot_cpu(void); extern void identify_secondary_cpu(struct cpuinfo_x86 *); --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -26,9 +26,11 @@ #include <asm/pgtable.h> #include <asm/cacheflush.h> #include <asm/intel-family.h> +#include <asm/e820.h>
static void __init spectre_v2_select_mitigation(void); static void __init ssb_select_mitigation(void); +static void __init l1tf_select_mitigation(void);
/* * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any @@ -138,6 +140,8 @@ void __init check_bugs(void) */ ssb_select_mitigation();
+ l1tf_select_mitigation(); + #ifdef CONFIG_X86_32 /* * Check whether we are able to run this kernel safely on SMP. @@ -266,6 +270,32 @@ static void x86_amd_ssb_disable(void) wrmsrl(MSR_AMD64_LS_CFG, msrval); }
+static void __init l1tf_select_mitigation(void) +{ + u64 half_pa; + + if (!boot_cpu_has_bug(X86_BUG_L1TF)) + return; + +#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE) + pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n"); + return; +#endif + + /* + * This is extremely unlikely to happen because almost all + * systems have far more MAX_PA/2 than RAM can be fit into + * DIMM slots. + */ + half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT; + if (e820_any_mapped(half_pa, ULLONG_MAX - half_pa, E820_RAM)) { + pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n"); + return; + } + + setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV); +} + #ifdef RETPOLINE static bool spectre_v2_bad_module;
@@ -718,6 +748,11 @@ static ssize_t cpu_show_common(struct de case X86_BUG_SPEC_STORE_BYPASS: return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
+ case X86_BUG_L1TF: + if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV)) + return sprintf(buf, "Mitigation: Page Table Inversion\n"); + break; + default: break; } @@ -744,4 +779,9 @@ ssize_t cpu_show_spec_store_bypass(struc { return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS); } + +ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_L1TF); +} #endif --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -842,6 +842,21 @@ static const __initconst struct x86_cpu_ {} };
+static const __initconst struct x86_cpu_id cpu_no_l1tf[] = { + /* in addition to cpu_no_speculation */ + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MOOREFIELD }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_DENVERTON }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GEMINI_LAKE }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM }, + {} +}; + static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = 0; @@ -867,6 +882,11 @@ static void __init cpu_set_bug_bits(stru return;
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN); + + if (x86_match_cpu(cpu_no_l1tf)) + return; + + setup_force_cpu_bug(X86_BUG_L1TF); }
/* --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -444,16 +444,24 @@ ssize_t __weak cpu_show_spec_store_bypas return sprintf(buf, "Not affected\n"); }
+ssize_t __weak cpu_show_l1tf(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "Not affected\n"); +} + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL); static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL); static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL); static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL); +static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_meltdown.attr, &dev_attr_spectre_v1.attr, &dev_attr_spectre_v2.attr, &dev_attr_spec_store_bypass.attr, + &dev_attr_l1tf.attr, NULL };
--- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -47,6 +47,8 @@ extern ssize_t cpu_show_spectre_v2(struc struct device_attribute *attr, char *buf); extern ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *attr, char *buf); +extern ssize_t cpu_show_l1tf(struct device *dev, + struct device_attribute *attr, char *buf);
#ifdef CONFIG_HOTPLUG_CPU extern void unregister_cpu(struct cpu *cpu);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Borislav Petkov bp@suse.de
commit e7c587da125291db39ddf1f49b18e5970adbac17 upstream.
Intel and AMD have different CPUID bits hence for those use synthetic bits which get set on the respective vendor's in init_speculation_control(). So that debacles like what the commit message of
c65732e4f721 ("x86/cpu: Restore CPUID_8000_0008_EBX reload")
talks about don't happen anymore.
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Tested-by: Jörg Otte jrg.otte@gmail.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Link: https://lkml.kernel.org/r/20180504161815.GG9257@pd.tnic Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - Use the next available bit numbers in CPU feature word 7 - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 9 ++++++--- arch/x86/kernel/cpu/common.c | 14 ++++++++++---- arch/x86/kvm/cpuid.c | 10 +++++----- arch/x86/kvm/cpuid.h | 4 ++-- 4 files changed, 23 insertions(+), 14 deletions(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -193,6 +193,9 @@ #define X86_FEATURE_USE_IBRS_FW (7*32+13) /* "" Use IBRS during runtime firmware calls */ #define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE (7*32+14) /* "" Disable Speculative Store Bypass. */ #define X86_FEATURE_AMD_SSBD (7*32+15) /* "" AMD SSBD implementation */ +#define X86_FEATURE_IBRS (7*32+16) /* Indirect Branch Restricted Speculation */ +#define X86_FEATURE_IBPB (7*32+17) /* Indirect Branch Prediction Barrier */ +#define X86_FEATURE_STIBP (7*32+18) /* Single Thread Indirect Branch Predictors */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ @@ -246,9 +249,9 @@ #define X86_FEATURE_SSBD (10*32+31) /* Speculative Store Bypass Disable */
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 11 */ -#define X86_FEATURE_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ -#define X86_FEATURE_IBRS (11*32+14) /* Indirect Branch Restricted Speculation */ -#define X86_FEATURE_STIBP (11*32+15) /* Single Thread Indirect Branch Predictors */ +#define X86_FEATURE_AMD_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ +#define X86_FEATURE_AMD_IBRS (11*32+14) /* Indirect Branch Restricted Speculation */ +#define X86_FEATURE_AMD_STIBP (11*32+15) /* Single Thread Indirect Branch Predictors */
/* * BUG word(s) --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -690,17 +690,23 @@ static void init_speculation_control(str * and they also have a different bit for STIBP support. Also, * a hypervisor might have set the individual AMD bits even on * Intel CPUs, for finer-grained selection of what's available. - * - * We use the AMD bits in 0x8000_0008 EBX as the generic hardware - * features, which are visible in /proc/cpuinfo and used by the - * kernel. So set those accordingly from the Intel bits. */ if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) { set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_IBPB); } + if (cpu_has(c, X86_FEATURE_INTEL_STIBP)) set_cpu_cap(c, X86_FEATURE_STIBP); + + if (cpu_has(c, X86_FEATURE_AMD_IBRS)) + set_cpu_cap(c, X86_FEATURE_IBRS); + + if (cpu_has(c, X86_FEATURE_AMD_IBPB)) + set_cpu_cap(c, X86_FEATURE_IBPB); + + if (cpu_has(c, X86_FEATURE_AMD_STIBP)) + set_cpu_cap(c, X86_FEATURE_STIBP); }
void get_cpu_cap(struct cpuinfo_x86 *c) --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -302,7 +302,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 0x80000008.ebx */ const u32 kvm_cpuid_8000_0008_ebx_x86_features = - F(IBPB) | F(IBRS); + F(AMD_IBPB) | F(AMD_IBRS);
/* cpuid 0xC0000001.edx */ const u32 kvm_supported_word5_x86_features = @@ -525,10 +525,10 @@ static inline int __do_cpuid_ent(struct entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; /* IBRS and IBPB aren't necessarily present in hardware cpuid */ - if (boot_cpu_has(X86_FEATURE_IBPB)) - entry->ebx |= F(IBPB); - if (boot_cpu_has(X86_FEATURE_IBRS)) - entry->ebx |= F(IBRS); + if (boot_cpu_has(X86_FEATURE_AMD_IBPB)) + entry->ebx |= F(AMD_IBPB); + if (boot_cpu_has(X86_FEATURE_AMD_IBRS)) + entry->ebx |= F(AMD_IBRS); entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features; cpuid_mask(&entry->ebx, 11); break; --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -109,7 +109,7 @@ static inline bool guest_cpuid_has_ibpb( struct kvm_cpuid_entry2 *best;
best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0); - if (best && (best->ebx & bit(X86_FEATURE_IBPB))) + if (best && (best->ebx & bit(X86_FEATURE_AMD_IBPB))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL)); @@ -120,7 +120,7 @@ static inline bool guest_cpuid_has_spec_ struct kvm_cpuid_entry2 *best;
best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0); - if (best && (best->ebx & bit(X86_FEATURE_IBRS))) + if (best && (best->ebx & bit(X86_FEATURE_AMD_IBRS))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_SSBD)));
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 0aa48468d00959c8a37cd3ac727284f4f7359151 upstream.
The X86_FEATURE_SSBD is an synthetic CPU feature - that is it bit location has no relevance to the real CPUID 0x7.EBX[31] bit position. For that we need the new CPU feature name.
Fixes: 52817587e706 ("x86/cpufeatures: Disentangle SSBD enumeration") Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: kvm@vger.kernel.org Cc: "Radim Krčmář" rkrcmar@redhat.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Paolo Bonzini pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180521215449.26423-2-konrad.wilk@oracle.com [bwh: Backported to 3.16: - Fix guest_cpuid_has_spec_ctrl() as well - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -318,7 +318,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 7.0.edx*/ const u32 kvm_cpuid_7_0_edx_x86_features = - F(SPEC_CTRL) | F(SSBD) | F(ARCH_CAPABILITIES); + F(SPEC_CTRL) | F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -123,7 +123,7 @@ static inline bool guest_cpuid_has_spec_ if (best && (best->ebx & bit(X86_FEATURE_AMD_IBRS))) return true; best = kvm_find_cpuid_entry(vcpu, 7, 0); - return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_SSBD))); + return best && (best->edx & (bit(X86_FEATURE_SPEC_CTRL) | bit(X86_FEATURE_SPEC_CTRL_SSBD))); }
static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit ccbcd2674472a978b48c91c1fbfb66c0ff959f24 upstream.
AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care about the bit position of the SSBD bit and thus facilitate migration. Also, the sibling coordination on Family 17H CPUs can only be done on the host.
Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an extra argument for the VIRT_SPEC_CTRL MSR.
Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU data structure which is going to be used in later patches for the actual implementation.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/spec-ctrl.h | 9 ++++++--- arch/x86/kernel/cpu/bugs.c | 20 ++++++++++++++++++-- arch/x86/kvm/svm.c | 11 +++++++++-- arch/x86/kvm/vmx.c | 5 +++-- 4 files changed, 36 insertions(+), 9 deletions(-)
--- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -10,10 +10,13 @@ * the guest has, while on VMEXIT we restore the host view. This * would be easier if SPEC_CTRL were architecturally maskable or * shadowable for guests but this is not (currently) the case. - * Takes the guest view of SPEC_CTRL MSR as a parameter. + * Takes the guest view of SPEC_CTRL MSR as a parameter and also + * the guest's version of VIRT_SPEC_CTRL, if emulated. */ -extern void x86_spec_ctrl_set_guest(u64); -extern void x86_spec_ctrl_restore_host(u64); +extern void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl, + u64 guest_virt_spec_ctrl); +extern void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl, + u64 guest_virt_spec_ctrl);
/* AMD specific Speculative Store Bypass MSR data */ extern u64 x86_amd_ls_cfg_base; --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -212,7 +212,15 @@ u64 x86_spec_ctrl_get_default(void) } EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default);
-void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl) +/** + * x86_spec_ctrl_set_guest - Set speculation control registers for the guest + * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL + * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL + * (may get translated to MSR_AMD64_LS_CFG bits) + * + * Avoids writing to the MSR if the content/bits are the same + */ +void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) { u64 host = x86_spec_ctrl_base;
@@ -229,7 +237,15 @@ void x86_spec_ctrl_set_guest(u64 guest_s } EXPORT_SYMBOL_GPL(x86_spec_ctrl_set_guest);
-void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl) +/** + * x86_spec_ctrl_restore_host - Restore host speculation control registers + * @guest_spec_ctrl: The guest content of MSR_SPEC_CTRL + * @guest_virt_spec_ctrl: The guest controlled bits of MSR_VIRT_SPEC_CTRL + * (may get translated to MSR_AMD64_LS_CFG bits) + * + * Avoids writing to the MSR if the content/bits are the same + */ +void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl) { u64 host = x86_spec_ctrl_base;
--- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -148,6 +148,12 @@ struct vcpu_svm { } host;
u64 spec_ctrl; + /* + * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be + * translated into the appropriate L2_CFG bits on the host to + * perform speculative control. + */ + u64 virt_spec_ctrl;
u32 *msrpm;
@@ -1230,6 +1236,7 @@ static void svm_vcpu_reset(struct kvm_vc u32 eax = 1;
svm->spec_ctrl = 0; + svm->virt_spec_ctrl = 0;
init_vmcb(svm);
@@ -3967,7 +3974,7 @@ static void svm_vcpu_run(struct kvm_vcpu * is no need to worry about the conditional branch over the wrmsr * being speculatively taken. */ - x86_spec_ctrl_set_guest(svm->spec_ctrl); + x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
asm volatile ( "push %%" _ASM_BP "; \n\t" @@ -4091,7 +4098,7 @@ static void svm_vcpu_run(struct kvm_vcpu if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))) svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- x86_spec_ctrl_restore_host(svm->spec_ctrl); + x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl);
reload_tss(vcpu);
--- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7540,9 +7540,10 @@ static void __noclone vmx_vcpu_run(struc * is no need to worry about the conditional branch over the wrmsr * being speculatively taken. */ - x86_spec_ctrl_set_guest(vmx->spec_ctrl); + x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
vmx->__launched = vmx->loaded_vmcs->launched; + asm( /* Store host registers */ "push %%" _ASM_DX "; push %%" _ASM_BP ";" @@ -7673,7 +7674,7 @@ static void __noclone vmx_vcpu_run(struc if (unlikely(!msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL))) vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- x86_spec_ctrl_restore_host(vmx->spec_ctrl); + x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0);
/* Eliminate branch target predictions from guest mode */ vmexit_fill_RSB();
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit b849a812f7eb92e96d1c8239b06581b2cfd8b275 upstream.
Use PR_SPEC_FORCE_DISABLE in seccomp() because seccomp does not allow to widen restrictions.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- kernel/seccomp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -217,7 +217,7 @@ static inline void spec_mitigate(struct int state = arch_prctl_spec_ctrl_get(task, which);
if (state > 0 && (state & PR_SPEC_PRCTL)) - arch_prctl_spec_ctrl_set(task, which, PR_SPEC_DISABLE); + arch_prctl_spec_ctrl_set(task, which, PR_SPEC_FORCE_DISABLE); }
static inline void seccomp_assign_mode(unsigned long seccomp_mode)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 780fc5642f59b6c6e2b05794de60b2d2ad5f040e upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Paul Mackerras paulus@samba.org Cc: Michael Ellerman mpe@ellerman.id.au Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/powerpc/include/asm/pgtable-ppc32.h | 9 ++------- arch/powerpc/include/asm/pgtable-ppc64.h | 5 +---- arch/powerpc/include/asm/pgtable.h | 1 - arch/powerpc/include/asm/pte-40x.h | 1 - arch/powerpc/include/asm/pte-44x.h | 5 ----- arch/powerpc/include/asm/pte-8xx.h | 1 - arch/powerpc/include/asm/pte-book3e.h | 1 - arch/powerpc/include/asm/pte-fsl-booke.h | 3 --- arch/powerpc/include/asm/pte-hash32.h | 1 - arch/powerpc/include/asm/pte-hash64.h | 1 - arch/powerpc/mm/pgtable_64.c | 2 +- 11 files changed, 4 insertions(+), 26 deletions(-)
--- a/arch/powerpc/include/asm/pgtable-ppc32.h +++ b/arch/powerpc/include/asm/pgtable-ppc32.h @@ -314,8 +314,8 @@ static inline void __ptep_set_access_fla /* * Encode and decode a swap entry. * Note that the bits we use in a PTE for representing a swap entry - * must not include the _PAGE_PRESENT bit, the _PAGE_FILE bit, or the - *_PAGE_HASHPTE bit (if used). -- paulus + * must not include the _PAGE_PRESENT bit or the _PAGE_HASHPTE bit (if used). + * -- paulus */ #define __swp_type(entry) ((entry).val & 0x1f) #define __swp_offset(entry) ((entry).val >> 5) @@ -323,11 +323,6 @@ static inline void __ptep_set_access_fla #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 })
-/* Encode and decode a nonlinear file mapping entry */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte) >> 3) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 3) | _PAGE_FILE }) - /* * No page table caches to initialise */ --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -352,9 +352,6 @@ static inline void __ptep_set_access_fla #define __swp_entry(type, offset) ((swp_entry_t){((type)<< 1)|((offset)<<8)}) #define __pte_to_swp_entry(pte) ((swp_entry_t){pte_val(pte) >> PTE_RPN_SHIFT}) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << PTE_RPN_SHIFT }) -#define pte_to_pgoff(pte) (pte_val(pte) >> PTE_RPN_SHIFT) -#define pgoff_to_pte(off) ((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE}) -#define PTE_FILE_MAX_BITS (BITS_PER_LONG - PTE_RPN_SHIFT)
void pgtable_cache_add(unsigned shift, void (*ctor)(void *)); void pgtable_cache_init(void); @@ -389,7 +386,7 @@ void pgtable_cache_init(void); * The last three bits are intentionally left to zero. This memory location * are also used as normal page PTE pointers. So if we have any pointers * left around while we collapse a hugepage, we need to make sure - * _PAGE_PRESENT and _PAGE_FILE bits of that are zero when we look at them + * _PAGE_PRESENT bit of that is zero when we look at them */ static inline unsigned int hpte_valid(unsigned char *hpte_slot_array, int index) { --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -32,7 +32,6 @@ struct mm_struct; static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; } static inline int pte_none(pte_t pte) { return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; } static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } --- a/arch/powerpc/include/asm/pte-40x.h +++ b/arch/powerpc/include/asm/pte-40x.h @@ -38,7 +38,6 @@ */
#define _PAGE_GUARDED 0x001 /* G: page is guarded from prefetch */ -#define _PAGE_FILE 0x001 /* when !present: nonlinear file mapping */ #define _PAGE_PRESENT 0x002 /* software: PTE contains a translation */ #define _PAGE_NO_CACHE 0x004 /* I: caching is inhibited */ #define _PAGE_WRITETHRU 0x008 /* W: caching is write-through */ --- a/arch/powerpc/include/asm/pte-44x.h +++ b/arch/powerpc/include/asm/pte-44x.h @@ -44,9 +44,6 @@ * - PRESENT *must* be in the bottom three bits because swap cache * entries use the top 29 bits for TLB2. * - * - FILE *must* be in the bottom three bits because swap cache - * entries use the top 29 bits for TLB2. - * * - CACHE COHERENT bit (M) has no effect on original PPC440 cores, * because it doesn't support SMP. However, some later 460 variants * have -some- form of SMP support and so I keep the bit there for @@ -68,7 +65,6 @@ * * There are three protection bits available for SWAP entry: * _PAGE_PRESENT - * _PAGE_FILE * _PAGE_HASHPTE (if HW has) * * So those three bits have to be inside of 0-2nd LSB of PTE. @@ -77,7 +73,6 @@
#define _PAGE_PRESENT 0x00000001 /* S: PTE valid */ #define _PAGE_RW 0x00000002 /* S: Write permission */ -#define _PAGE_FILE 0x00000004 /* S: nonlinear file mapping */ #define _PAGE_EXEC 0x00000004 /* H: Execute permission */ #define _PAGE_ACCESSED 0x00000008 /* S: Page referenced */ #define _PAGE_DIRTY 0x00000010 /* S: Page dirty */ --- a/arch/powerpc/include/asm/pte-8xx.h +++ b/arch/powerpc/include/asm/pte-8xx.h @@ -29,7 +29,6 @@
/* Definitions for 8xx embedded chips. */ #define _PAGE_PRESENT 0x0001 /* Page is valid */ -#define _PAGE_FILE 0x0002 /* when !present: nonlinear file mapping */ #define _PAGE_NO_CACHE 0x0002 /* I: cache inhibit */ #define _PAGE_SHARED 0x0004 /* No ASID (context) compare */ #define _PAGE_SPECIAL 0x0008 /* SW entry, forced to 0 by the TLB miss */ --- a/arch/powerpc/include/asm/pte-book3e.h +++ b/arch/powerpc/include/asm/pte-book3e.h @@ -10,7 +10,6 @@
/* Architected bits */ #define _PAGE_PRESENT 0x000001 /* software: pte contains a translation */ -#define _PAGE_FILE 0x000002 /* (!present only) software: pte holds file offset */ #define _PAGE_SW1 0x000002 #define _PAGE_BAP_SR 0x000004 #define _PAGE_BAP_UR 0x000008 --- a/arch/powerpc/include/asm/pte-fsl-booke.h +++ b/arch/powerpc/include/asm/pte-fsl-booke.h @@ -13,14 +13,11 @@ - PRESENT *must* be in the bottom three bits because swap cache entries use the top 29 bits.
- - FILE *must* be in the bottom three bits because swap cache - entries use the top 29 bits. */
/* Definitions for FSL Book-E Cores */ #define _PAGE_PRESENT 0x00001 /* S: PTE contains a translation */ #define _PAGE_USER 0x00002 /* S: User page (maps to UR) */ -#define _PAGE_FILE 0x00002 /* S: when !present: nonlinear file mapping */ #define _PAGE_RW 0x00004 /* S: Write permission (SW) */ #define _PAGE_DIRTY 0x00008 /* S: Page dirty */ #define _PAGE_EXEC 0x00010 /* H: SX permission */ --- a/arch/powerpc/include/asm/pte-hash32.h +++ b/arch/powerpc/include/asm/pte-hash32.h @@ -18,7 +18,6 @@
#define _PAGE_PRESENT 0x001 /* software: pte contains a translation */ #define _PAGE_HASHPTE 0x002 /* hash_page has made an HPTE for this pte */ -#define _PAGE_FILE 0x004 /* when !present: nonlinear file mapping */ #define _PAGE_USER 0x004 /* usermode access allowed */ #define _PAGE_GUARDED 0x008 /* G: prohibit speculative access */ #define _PAGE_COHERENT 0x010 /* M: enforce memory coherence (SMP systems) */ --- a/arch/powerpc/include/asm/pte-hash64.h +++ b/arch/powerpc/include/asm/pte-hash64.h @@ -16,7 +16,6 @@ */ #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ #define _PAGE_USER 0x0002 /* matches one of the PP bits */ -#define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ #define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ #define _PAGE_GUARDED 0x0008 /* We can derive Memory coherence from _PAGE_NO_CACHE */ --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -807,7 +807,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_ { pmd_t pmd; /* - * For a valid pte, we would have _PAGE_PRESENT or _PAGE_FILE always + * For a valid pte, we would have _PAGE_PRESENT always * set. We use this to check THP page at pmd level. * leaf pte for huge page, bottom two bits != 00 */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Tom Lendacky thomas.lendacky@amd.com
commit 11fb0683493b2da112cd64c9dada221b52463bf7 upstream.
Some AMD processors only support a non-architectural means of enabling speculative store bypass disable (SSBD). To allow a simplified view of this to a guest, an architectural definition has been created through a new CPUID bit, 0x80000008_EBX[25], and a new MSR, 0xc001011f. With this, a hypervisor can virtualize the existence of this definition and provide an architectural method for using SSBD to a guest.
Add the new CPUID feature, the new MSR and update the existing SSBD support to use this MSR when present.
Signed-off-by: Tom Lendacky thomas.lendacky@amd.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - This CPUID word is feature word 11 - Adjust filenames, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/uapi/asm/msr-index.h | 2 ++ arch/x86/kernel/cpu/bugs.c | 4 +++- arch/x86/kernel/process.c | 13 ++++++++++++- 4 files changed, 18 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -255,6 +255,7 @@ #define X86_FEATURE_AMD_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_AMD_IBRS (11*32+14) /* Indirect Branch Restricted Speculation */ #define X86_FEATURE_AMD_STIBP (11*32+15) /* Single Thread Indirect Branch Predictors */ +#define X86_FEATURE_VIRT_SSBD (11*32+25) /* Virtualized Speculative Store Bypass Disable */
/* * BUG word(s) --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -225,6 +225,8 @@ #define MSR_AMD64_IBSBRTARGET 0xc001103b #define MSR_AMD64_IBS_REG_COUNT_MAX 8 /* includes MSR_AMD64_IBSBRTARGET */
+#define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f + /* Fam 16h MSRs */ #define MSR_F16H_L2I_PERF_CTL 0xc0010230 #define MSR_F16H_L2I_PERF_CTR 0xc0010231 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -266,7 +266,9 @@ static void x86_amd_ssb_disable(void) { u64 msrval = x86_amd_ls_cfg_base | x86_amd_ls_cfg_ssbd_mask;
- if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + if (boot_cpu_has(X86_FEATURE_VIRT_SSBD)) + wrmsrl(MSR_AMD64_VIRT_SPEC_CTRL, SPEC_CTRL_SSBD); + else if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD)) wrmsrl(MSR_AMD64_LS_CFG, msrval); }
--- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -326,6 +326,15 @@ static __always_inline void amd_set_core } #endif
+static __always_inline void amd_set_ssb_virt_state(unsigned long tifn) +{ + /* + * SSBD has the same definition in SPEC_CTRL and VIRT_SPEC_CTRL, + * so ssbd_tif_to_spec_ctrl() just works. + */ + wrmsrl(MSR_AMD64_VIRT_SPEC_CTRL, ssbd_tif_to_spec_ctrl(tifn)); +} + static __always_inline void intel_set_ssb_state(unsigned long tifn) { u64 msr = x86_spec_ctrl_base | ssbd_tif_to_spec_ctrl(tifn); @@ -335,7 +344,9 @@ static __always_inline void intel_set_ss
static __always_inline void __speculative_store_bypass_update(unsigned long tifn) { - if (static_cpu_has(X86_FEATURE_LS_CFG_SSBD)) + if (static_cpu_has(X86_FEATURE_VIRT_SSBD)) + amd_set_ssb_virt_state(tifn); + else if (static_cpu_has(X86_FEATURE_LS_CFG_SSBD)) amd_set_core_ssb_state(tifn); else intel_set_ssb_state(tifn);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 42e4089c7890725fcd329999252dc489b72f2921 upstream.
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page table entry. This sets the high bits in the CPU's address space, thus making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical address space. If such an high mapping was mapped to unprivileged users they could attack low memory by setting such a mapping to PROT_NONE. This could happen through a special device driver which is not access protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to minimize any impact this is only enforced if the mapping actually refers to a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are accessed a lot of state has been changed and it would be difficult to undo on an error. Since this is a uncommon case use a separate early page talk walk pass for MMIO PROT_NONE mappings that checks for this condition early. For non MMIO and non PROT_NONE there are no changes.
[dwmw2: Backport to 4.9] [groeck: Backport to 4.4]
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Dave Hansen dave.hansen@intel.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable.h | 8 ++++++ arch/x86/mm/mmap.c | 21 +++++++++++++++ include/asm-generic/pgtable.h | 12 +++++++++ mm/memory.c | 29 +++++++++++++++----- mm/mprotect.c | 49 ++++++++++++++++++++++++++++++++++ 5 files changed, 112 insertions(+), 7 deletions(-)
--- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -940,6 +940,14 @@ static inline pte_t pte_swp_clear_soft_d } #endif
+#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1 +extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot); + +static inline bool arch_has_pfn_modify_check(void) +{ + return boot_cpu_has_bug(X86_BUG_L1TF); +} + #include <asm-generic/pgtable.h> #endif /* __ASSEMBLY__ */
--- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c @@ -114,3 +114,24 @@ void arch_pick_mmap_layout(struct mm_str mm->get_unmapped_area = arch_get_unmapped_area_topdown; } } + +/* + * Only allow root to set high MMIO mappings to PROT_NONE. + * This prevents an unpriv. user to set them to PROT_NONE and invert + * them, then pointing to valid memory for L1TF speculation. + * + * Note: for locked down kernels may want to disable the root override. + */ +bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot) +{ + if (!boot_cpu_has_bug(X86_BUG_L1TF)) + return true; + if (!__pte_needs_invert(pgprot_val(prot))) + return true; + /* If it's real memory always allow */ + if (pfn_valid(pfn)) + return true; + if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN)) + return false; + return true; +} --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -812,4 +812,16 @@ static inline void pmdp_set_numa(struct #define io_remap_pfn_range remap_pfn_range #endif
+#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED +static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot) +{ + return true; +} + +static inline bool arch_has_pfn_modify_check(void) +{ + return false; +} +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ --- a/mm/memory.c +++ b/mm/memory.c @@ -1614,6 +1614,9 @@ int vm_insert_pfn_prot(struct vm_area_st if (track_pfn_insert(vma, &pgprot, pfn)) return -EINVAL;
+ if (!pfn_modify_allowed(pfn, pgprot)) + return -EACCES; + ret = insert_pfn(vma, addr, pfn, pgprot);
return ret; @@ -1632,6 +1635,9 @@ int vm_insert_mixed(struct vm_area_struc if (track_pfn_insert(vma, &pgprot, pfn)) return -EINVAL;
+ if (!pfn_modify_allowed(pfn, pgprot)) + return -EACCES; + /* * If we don't have pte special, then we have to use the pfn_valid() * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must* @@ -1660,6 +1666,7 @@ static int remap_pte_range(struct mm_str { pte_t *pte; spinlock_t *ptl; + int err = 0;
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) @@ -1667,12 +1674,16 @@ static int remap_pte_range(struct mm_str arch_enter_lazy_mmu_mode(); do { BUG_ON(!pte_none(*pte)); + if (!pfn_modify_allowed(pfn, prot)) { + err = -EACCES; + break; + } set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); - return 0; + return err; }
static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, @@ -1681,6 +1692,7 @@ static inline int remap_pmd_range(struct { pmd_t *pmd; unsigned long next; + int err;
pfn -= addr >> PAGE_SHIFT; pmd = pmd_alloc(mm, pud, addr); @@ -1689,9 +1701,10 @@ static inline int remap_pmd_range(struct VM_BUG_ON(pmd_trans_huge(*pmd)); do { next = pmd_addr_end(addr, end); - if (remap_pte_range(mm, pmd, addr, next, - pfn + (addr >> PAGE_SHIFT), prot)) - return -ENOMEM; + err = remap_pte_range(mm, pmd, addr, next, + pfn + (addr >> PAGE_SHIFT), prot); + if (err) + return err; } while (pmd++, addr = next, addr != end); return 0; } @@ -1702,6 +1715,7 @@ static inline int remap_pud_range(struct { pud_t *pud; unsigned long next; + int err;
pfn -= addr >> PAGE_SHIFT; pud = pud_alloc(mm, pgd, addr); @@ -1709,9 +1723,10 @@ static inline int remap_pud_range(struct return -ENOMEM; do { next = pud_addr_end(addr, end); - if (remap_pmd_range(mm, pud, addr, next, - pfn + (addr >> PAGE_SHIFT), prot)) - return -ENOMEM; + err = remap_pmd_range(mm, pud, addr, next, + pfn + (addr >> PAGE_SHIFT), prot); + if (err) + return err; } while (pud++, addr = next, addr != end); return 0; } --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -258,6 +258,42 @@ unsigned long change_protection(struct v return pages; }
+static int prot_none_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ? + 0 : -EACCES; +} + +static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ? + 0 : -EACCES; +} + +static int prot_none_test(unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + return 0; +} + +static int prot_none_walk(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long newflags) +{ + pgprot_t new_pgprot = vm_get_page_prot(newflags); + struct mm_walk prot_none_walk = { + .pte_entry = prot_none_pte_entry, + .hugetlb_entry = prot_none_hugetlb_entry, + .test_walk = prot_none_test, + .mm = current->mm, + .private = &new_pgprot, + }; + + return walk_page_range(start, end, &prot_none_walk); +} + int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags) @@ -276,6 +312,19 @@ mprotect_fixup(struct vm_area_struct *vm }
/* + * Do PROT_NONE PFN permission checks here when we can still + * bail out without undoing a lot of state. This is a rather + * uncommon case, so doesn't need to be very optimized. + */ + if (arch_has_pfn_modify_check() && + (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && + (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) { + error = prot_none_walk(vma, start, end, newflags); + if (error) + return error; + } + + /* * If we make a private mapping writable we increase our commit; * but (without finer accounting) cannot reduce our commit if we * make it unwritable again. hugetlb mapping were accounted for
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Dave Hansen dave.hansen@linux.intel.com
commit 00839ee3b299303c6a5e26a0a2485427a3afcbbf upstream.
This erratum can result in Accessed/Dirty getting set by the hardware when we do not expect them to be (on !Present PTEs).
Instead of trying to fix them up after this happens, we just allow the bits to get set and try to ignore them. We do this by shifting the layout of the bits we use for swap offset/type in our 64-bit PTEs.
It looks like this:
bitnrs: | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| names: | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| before: | OFFSET (9-63) |0|X|X| TYPE(1-5) |0| after: | OFFSET (14-63) | TYPE (9-13) |0|X|X|X| X| X|X|X|0|
Note that D was already a don't care (X) even before. We just move TYPE up and turn its old spot (which could be hit by the A bit) into all don't cares.
We take 5 bits away from the offset, but that still leaves us with 50 bits which lets us index into a 62-bit swapfile (4 EiB). I think that's probably fine for the moment. We could theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it doesn't gain us anything.
Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Andy Lutomirski luto@kernel.org Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave@sr71.net Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Luis R. Rodriguez mcgrof@suse.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Toshi Kani toshi.kani@hp.com Cc: dave.hansen@intel.com Cc: linux-mm@kvack.org Cc: mhocko@suse.com Link: http://lkml.kernel.org/r/20160708001911.9A3FD2B6@viggo.jf.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: Bit 9 may be reserved for PAGE_BIT_NUMA, which no longer exists upstream. Adjust the bit numbers accordingly, incorporating commit ace7fab7a6cd "x86/mm: Fix swap entry comment and macro".] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -162,23 +162,37 @@ static inline int pgd_large(pgd_t pgd) { #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address)) #define pte_unmap(pte) ((void)(pte))/* NOP */
-/* Encode and de-code a swap entry */ -#define SWP_TYPE_BITS 5 +/* + * Encode and de-code a swap entry + * + * | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| <- bit number + * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names + * | OFFSET (15->63) | TYPE (10-14) | 0 |0|X|X|X| X| X|X|X|0| <- swp entry + * + * G (8) is aliased and used as a PROT_NONE indicator for + * !present ptes. We need to start storing swap entries above + * there. We also need to avoid using A and D because of an + * erratum where they can be incorrectly set by hardware on + * non-present PTEs. + */ #ifdef CONFIG_NUMA_BALANCING /* Automatic NUMA balancing needs to be distinguishable from swap entries */ -#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 2) +#define SWP_TYPE_FIRST_SHIFT (_PAGE_BIT_PROTNONE + 2) #else -#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1) +#define SWP_TYPE_FIRST_SHIFT (_PAGE_BIT_PROTNONE + 1) #endif +#define SWP_TYPE_BITS 5 +/* Place the offset above the type: */ +#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
-#define __swp_type(x) (((x).val >> (_PAGE_BIT_PRESENT + 1)) \ +#define __swp_type(x) (((x).val >> (SWP_TYPE_FIRST_BIT)) \ & ((1U << SWP_TYPE_BITS) - 1)) -#define __swp_offset(x) ((x).val >> SWP_OFFSET_SHIFT) +#define __swp_offset(x) ((x).val >> SWP_OFFSET_FIRST_BIT) #define __swp_entry(type, offset) ((swp_entry_t) { \ - ((type) << (_PAGE_BIT_PRESENT + 1)) \ - | ((offset) << SWP_OFFSET_SHIFT) }) + ((type) << (SWP_TYPE_FIRST_BIT)) \ + | ((offset) << SWP_OFFSET_FIRST_BIT) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) }) #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 52817587e706686fcdb27f14c1b000c92f266c96 upstream.
The SSBD enumeration is similarly to the other bits magically shared between Intel and AMD though the mechanisms are different.
Make X86_FEATURE_SSBD synthetic and set it depending on the vendor specific features or family dependent setup.
Change the Intel bit to X86_FEATURE_SPEC_CTRL_SSBD to denote that SSBD is controlled via MSR_SPEC_CTRL and fix up the usage sites.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - Use the next available bit number in CPU feature word 7 - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 5 +++-- arch/x86/kernel/cpu/amd.c | 7 +------ arch/x86/kernel/cpu/bugs.c | 10 +++++----- arch/x86/kernel/cpu/common.c | 3 +++ arch/x86/kernel/cpu/intel.c | 1 + arch/x86/kernel/process.c | 2 +- 6 files changed, 14 insertions(+), 14 deletions(-)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -192,11 +192,12 @@ #define X86_FEATURE_USE_IBPB (7*32+12) /* "" Indirect Branch Prediction Barrier enabled */ #define X86_FEATURE_USE_IBRS_FW (7*32+13) /* "" Use IBRS during runtime firmware calls */ #define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE (7*32+14) /* "" Disable Speculative Store Bypass. */ -#define X86_FEATURE_AMD_SSBD (7*32+15) /* "" AMD SSBD implementation */ +#define X86_FEATURE_LS_CFG_SSBD (7*32+15) /* "" AMD SSBD implementation */ #define X86_FEATURE_IBRS (7*32+16) /* Indirect Branch Restricted Speculation */ #define X86_FEATURE_IBPB (7*32+17) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP (7*32+18) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_MSR_SPEC_CTRL (7*32+19) /* "" MSR SPEC_CTRL is implemented */ +#define X86_FEATURE_SSBD (7*32+20) /* Speculative Store Bypass Disable */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ @@ -247,7 +248,7 @@ #define X86_FEATURE_SPEC_CTRL (10*32+26) /* "" Speculation Control (IBRS + IBPB) */ #define X86_FEATURE_INTEL_STIBP (10*32+27) /* "" Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ARCH_CAPABILITIES (10*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */ -#define X86_FEATURE_SSBD (10*32+31) /* Speculative Store Bypass Disable */ +#define X86_FEATURE_SPEC_CTRL_SSBD (10*32+31) /* "" Speculative Store Bypass Disable */
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 11 */ #define X86_FEATURE_AMD_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */ --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -486,8 +486,8 @@ static void bsp_init_amd(struct cpuinfo_ * avoid RMW. If that faults, do not enable SSBD. */ if (!rdmsrl_safe(MSR_AMD64_LS_CFG, &x86_amd_ls_cfg_base)) { + setup_force_cpu_cap(X86_FEATURE_LS_CFG_SSBD); setup_force_cpu_cap(X86_FEATURE_SSBD); - setup_force_cpu_cap(X86_FEATURE_AMD_SSBD); x86_amd_ls_cfg_ssbd_mask = 1ULL << bit; } } @@ -801,11 +801,6 @@ static void init_amd(struct cpuinfo_x86 set_cpu_bug(c, X86_BUG_AMD_APIC_C1E);
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy); - - if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) { - set_cpu_cap(c, X86_FEATURE_SSBD); - set_cpu_cap(c, X86_FEATURE_AMD_SSBD); - } }
#ifdef CONFIG_X86_32 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -220,8 +220,8 @@ void x86_spec_ctrl_set_guest(u64 guest_s if (!static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) return;
- /* Intel controls SSB in MSR_SPEC_CTRL */ - if (static_cpu_has(X86_FEATURE_SPEC_CTRL)) + /* SSBD controlled in MSR_SPEC_CTRL */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags);
if (host != guest_spec_ctrl) @@ -237,8 +237,8 @@ void x86_spec_ctrl_restore_host(u64 gues if (!static_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) return;
- /* Intel controls SSB in MSR_SPEC_CTRL */ - if (static_cpu_has(X86_FEATURE_SPEC_CTRL)) + /* SSBD controlled in MSR_SPEC_CTRL */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) host |= ssbd_tif_to_spec_ctrl(current_thread_info()->flags);
if (host != guest_spec_ctrl) @@ -250,7 +250,7 @@ static void x86_amd_ssb_disable(void) { u64 msrval = x86_amd_ls_cfg_base | x86_amd_ls_cfg_ssbd_mask;
- if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) + if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD)) wrmsrl(MSR_AMD64_LS_CFG, msrval); }
--- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -700,6 +700,9 @@ static void init_speculation_control(str if (cpu_has(c, X86_FEATURE_INTEL_STIBP)) set_cpu_cap(c, X86_FEATURE_STIBP);
+ if (cpu_has(c, X86_FEATURE_SPEC_CTRL_SSBD)) + set_cpu_cap(c, X86_FEATURE_SSBD); + if (cpu_has(c, X86_FEATURE_AMD_IBRS)) { set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -122,6 +122,7 @@ static void early_init_intel(struct cpui setup_clear_cpu_cap(X86_FEATURE_MSR_SPEC_CTRL); setup_clear_cpu_cap(X86_FEATURE_INTEL_STIBP); setup_clear_cpu_cap(X86_FEATURE_SSBD); + setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL_SSBD); }
/* --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -221,7 +221,7 @@ static __always_inline void __speculativ { u64 msr;
- if (static_cpu_has(X86_FEATURE_AMD_SSBD)) { + if (static_cpu_has(X86_FEATURE_LS_CFG_SSBD)) { msr = x86_amd_ls_cfg_base | ssbd_tif_to_amd_ls_cfg(tifn); wrmsrl(MSR_AMD64_LS_CFG, msr); } else {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Kees Cook keescook@chromium.org
commit 7bbf1373e228840bb0295a2ca26d548ef37f448e upstream.
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than current.
This is needed both for /proc/$pid/status queries and for seccomp (since thread-syncing can trigger seccomp in non-current threads).
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 27 ++++++++++++++++----------- include/linux/nospec.h | 7 +++++-- kernel/sys.c | 9 +++++---- 3 files changed, 26 insertions(+), 17 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -568,31 +568,35 @@ static void ssb_select_mitigation()
#undef pr_fmt
-static int ssb_prctl_set(unsigned long ctrl) +static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl) { - bool rds = !!test_tsk_thread_flag(current, TIF_RDS); + bool rds = !!test_tsk_thread_flag(task, TIF_RDS);
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL) return -ENXIO;
if (ctrl == PR_SPEC_ENABLE) - clear_tsk_thread_flag(current, TIF_RDS); + clear_tsk_thread_flag(task, TIF_RDS); else - set_tsk_thread_flag(current, TIF_RDS); + set_tsk_thread_flag(task, TIF_RDS);
- if (rds != !!test_tsk_thread_flag(current, TIF_RDS)) + /* + * If being set on non-current task, delay setting the CPU + * mitigation until it is next scheduled. + */ + if (task == current && rds != !!test_tsk_thread_flag(task, TIF_RDS)) speculative_store_bypass_update();
return 0; }
-static int ssb_prctl_get(void) +static int ssb_prctl_get(struct task_struct *task) { switch (ssb_mode) { case SPEC_STORE_BYPASS_DISABLE: return PR_SPEC_DISABLE; case SPEC_STORE_BYPASS_PRCTL: - if (test_tsk_thread_flag(current, TIF_RDS)) + if (test_tsk_thread_flag(task, TIF_RDS)) return PR_SPEC_PRCTL | PR_SPEC_DISABLE; return PR_SPEC_PRCTL | PR_SPEC_ENABLE; default: @@ -602,24 +606,25 @@ static int ssb_prctl_get(void) } }
-int arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl) +int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, + unsigned long ctrl) { if (ctrl != PR_SPEC_ENABLE && ctrl != PR_SPEC_DISABLE) return -ERANGE;
switch (which) { case PR_SPEC_STORE_BYPASS: - return ssb_prctl_set(ctrl); + return ssb_prctl_set(task, ctrl); default: return -ENODEV; } }
-int arch_prctl_spec_ctrl_get(unsigned long which) +int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which) { switch (which) { case PR_SPEC_STORE_BYPASS: - return ssb_prctl_get(); + return ssb_prctl_get(task); default: return -ENODEV; } --- a/include/linux/nospec.h +++ b/include/linux/nospec.h @@ -7,6 +7,8 @@ #define _LINUX_NOSPEC_H #include <asm/barrier.h>
+struct task_struct; + /** * array_index_mask_nospec() - generate a ~0 mask when index < size, 0 otherwise * @index: array element index @@ -57,7 +59,8 @@ static inline unsigned long array_index_ })
/* Speculation control prctl */ -int arch_prctl_spec_ctrl_get(unsigned long which); -int arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl); +int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which); +int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, + unsigned long ctrl);
#endif /* _LINUX_NOSPEC_H */ --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1834,12 +1834,13 @@ static int prctl_get_tid_address(struct } #endif
-int __weak arch_prctl_spec_ctrl_get(unsigned long which) +int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which) { return -EINVAL; }
-int __weak arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl) +int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, + unsigned long ctrl) { return -EINVAL; } @@ -2025,12 +2026,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi case PR_GET_SPECULATION_CTRL: if (arg3 || arg4 || arg5) return -EINVAL; - error = arch_prctl_spec_ctrl_get(arg2); + error = arch_prctl_spec_ctrl_get(me, arg2); break; case PR_SET_SPECULATION_CTRL: if (arg4 || arg5) return -EINVAL; - error = arch_prctl_spec_ctrl_set(arg2, arg3); + error = arch_prctl_spec_ctrl_set(me, arg2, arg3); break; default: error = -EINVAL;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 5064c8e19dc215afae8ffae95570e7f22062d49c upstream.
All users are gone.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Arnd Bergmann arnd@arndb.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/asm-generic/pgtable.h | 15 --------------- 1 file changed, 15 deletions(-)
--- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -445,21 +445,6 @@ static inline pte_t pte_swp_clear_soft_d { return pte; } - -static inline pte_t pte_file_clear_soft_dirty(pte_t pte) -{ - return pte; -} - -static inline pte_t pte_file_mksoft_dirty(pte_t pte) -{ - return pte; -} - -static inline int pte_file_soft_dirty(pte_t pte) -{ - return 0; -} #endif
#ifndef __HAVE_PFNMAP_TRACKING
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Kees Cook keescook@chromium.org
commit f21b53b20c754021935ea43364dbf53778eeba32 upstream.
Unless explicitly opted out of, anything running under seccomp will have SSB mitigations enabled. Choosing the "prctl" mode will disable this.
[ tglx: Adjusted it to the new arch_seccomp_spec_mitigate() mechanism ]
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- Documentation/kernel-parameters.txt | 26 ++++++++++++++-------- arch/x86/include/asm/nospec-branch.h | 1 + arch/x86/kernel/cpu/bugs.c | 32 ++++++++++++++++++++-------- 3 files changed, 41 insertions(+), 18 deletions(-)
--- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3218,19 +3218,27 @@ bytes respectively. Such letter suffixes This parameter controls whether the Speculative Store Bypass optimization is used.
- on - Unconditionally disable Speculative Store Bypass - off - Unconditionally enable Speculative Store Bypass - auto - Kernel detects whether the CPU model contains an - implementation of Speculative Store Bypass and - picks the most appropriate mitigation. - prctl - Control Speculative Store Bypass per thread - via prctl. Speculative Store Bypass is enabled - for a process by default. The state of the control - is inherited on fork. + on - Unconditionally disable Speculative Store Bypass + off - Unconditionally enable Speculative Store Bypass + auto - Kernel detects whether the CPU model contains an + implementation of Speculative Store Bypass and + picks the most appropriate mitigation. If the + CPU is not vulnerable, "off" is selected. If the + CPU is vulnerable the default mitigation is + architecture and Kconfig dependent. See below. + prctl - Control Speculative Store Bypass per thread + via prctl. Speculative Store Bypass is enabled + for a process by default. The state of the control + is inherited on fork. + seccomp - Same as "prctl" above, but all seccomp threads + will disable SSB unless they explicitly opt out.
Not specifying this option is equivalent to spec_store_bypass_disable=auto.
+ Default mitigations: + X86: If CONFIG_SECCOMP=y "seccomp", otherwise "prctl" + spia_io_base= [HW,MTD] spia_fio_base= spia_pedr= --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -188,6 +188,7 @@ enum ssb_mitigation { SPEC_STORE_BYPASS_NONE, SPEC_STORE_BYPASS_DISABLE, SPEC_STORE_BYPASS_PRCTL, + SPEC_STORE_BYPASS_SECCOMP, };
extern char __indirect_thunk_start[]; --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -453,22 +453,25 @@ enum ssb_mitigation_cmd { SPEC_STORE_BYPASS_CMD_AUTO, SPEC_STORE_BYPASS_CMD_ON, SPEC_STORE_BYPASS_CMD_PRCTL, + SPEC_STORE_BYPASS_CMD_SECCOMP, };
static const char *ssb_strings[] = { [SPEC_STORE_BYPASS_NONE] = "Vulnerable", [SPEC_STORE_BYPASS_DISABLE] = "Mitigation: Speculative Store Bypass disabled", - [SPEC_STORE_BYPASS_PRCTL] = "Mitigation: Speculative Store Bypass disabled via prctl" + [SPEC_STORE_BYPASS_PRCTL] = "Mitigation: Speculative Store Bypass disabled via prctl", + [SPEC_STORE_BYPASS_SECCOMP] = "Mitigation: Speculative Store Bypass disabled via prctl and seccomp", };
static const struct { const char *option; enum ssb_mitigation_cmd cmd; } ssb_mitigation_options[] = { - { "auto", SPEC_STORE_BYPASS_CMD_AUTO }, /* Platform decides */ - { "on", SPEC_STORE_BYPASS_CMD_ON }, /* Disable Speculative Store Bypass */ - { "off", SPEC_STORE_BYPASS_CMD_NONE }, /* Don't touch Speculative Store Bypass */ - { "prctl", SPEC_STORE_BYPASS_CMD_PRCTL }, /* Disable Speculative Store Bypass via prctl */ + { "auto", SPEC_STORE_BYPASS_CMD_AUTO }, /* Platform decides */ + { "on", SPEC_STORE_BYPASS_CMD_ON }, /* Disable Speculative Store Bypass */ + { "off", SPEC_STORE_BYPASS_CMD_NONE }, /* Don't touch Speculative Store Bypass */ + { "prctl", SPEC_STORE_BYPASS_CMD_PRCTL }, /* Disable Speculative Store Bypass via prctl */ + { "seccomp", SPEC_STORE_BYPASS_CMD_SECCOMP }, /* Disable Speculative Store Bypass via prctl and seccomp */ };
static enum ssb_mitigation_cmd __init ssb_parse_cmdline(void) @@ -518,8 +521,15 @@ static enum ssb_mitigation_cmd __init __
switch (cmd) { case SPEC_STORE_BYPASS_CMD_AUTO: - /* Choose prctl as the default mode */ - mode = SPEC_STORE_BYPASS_PRCTL; + case SPEC_STORE_BYPASS_CMD_SECCOMP: + /* + * Choose prctl+seccomp as the default mode if seccomp is + * enabled. + */ + if (IS_ENABLED(CONFIG_SECCOMP)) + mode = SPEC_STORE_BYPASS_SECCOMP; + else + mode = SPEC_STORE_BYPASS_PRCTL; break; case SPEC_STORE_BYPASS_CMD_ON: mode = SPEC_STORE_BYPASS_DISABLE; @@ -567,12 +577,14 @@ static void ssb_select_mitigation() }
#undef pr_fmt +#define pr_fmt(fmt) "Speculation prctl: " fmt
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl) { bool update;
- if (ssb_mode != SPEC_STORE_BYPASS_PRCTL) + if (ssb_mode != SPEC_STORE_BYPASS_PRCTL && + ssb_mode != SPEC_STORE_BYPASS_SECCOMP) return -ENXIO;
switch (ctrl) { @@ -620,7 +632,8 @@ int arch_prctl_spec_ctrl_set(struct task #ifdef CONFIG_SECCOMP void arch_seccomp_spec_mitigate(struct task_struct *task) { - ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE); + if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP) + ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE); } #endif
@@ -629,6 +642,7 @@ static int ssb_prctl_get(struct task_str switch (ssb_mode) { case SPEC_STORE_BYPASS_DISABLE: return PR_SPEC_DISABLE; + case SPEC_STORE_BYPASS_SECCOMP: case SPEC_STORE_BYPASS_PRCTL: if (task_spec_ssb_force_disable(task)) return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit d83a08db5ba6072caa658745881f4baa9bad6a08 upstream.
Nobody uses it anymore.
[akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Wu Fengguang fengguang.wu@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- fs/9p/vfs_file.c | 2 -- fs/btrfs/file.c | 1 - fs/ceph/addr.c | 1 - fs/cifs/file.c | 1 - fs/ext4/file.c | 1 - fs/f2fs/file.c | 1 - fs/fuse/file.c | 1 - fs/gfs2/file.c | 1 - fs/nfs/file.c | 1 - fs/nilfs2/file.c | 1 - fs/ocfs2/mmap.c | 1 - fs/ubifs/file.c | 1 - fs/xfs/xfs_file.c | 1 - include/linux/fs.h | 6 ------ include/linux/mm.h | 3 --- mm/filemap.c | 1 - mm/filemap_xip.c | 1 - mm/shmem.c | 1 - 18 files changed, 26 deletions(-)
--- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -831,7 +831,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = v9fs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static const struct vm_operations_struct v9fs_mmap_file_vm_ops = { @@ -839,7 +838,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = v9fs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
--- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2025,7 +2025,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = btrfs_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1327,7 +1327,6 @@ out: static struct vm_operations_struct ceph_vmops = { .fault = ceph_filemap_fault, .page_mkwrite = ceph_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
int ceph_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -3111,7 +3111,6 @@ static struct vm_operations_struct cifs_ .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = cifs_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
int cifs_file_strict_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -195,7 +195,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ext4_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -87,7 +87,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = f2fs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int get_parent_ino(struct inode *inode, nid_t *pino) --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2138,7 +2138,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = fuse_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -496,7 +496,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = gfs2_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
/** --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -618,7 +618,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = nfs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int nfs_need_sync_write(struct file *filp, struct inode *inode) --- a/fs/nilfs2/file.c +++ b/fs/nilfs2/file.c @@ -136,7 +136,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = nilfs_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/ocfs2/mmap.c +++ b/fs/ocfs2/mmap.c @@ -173,7 +173,6 @@ out: static const struct vm_operations_struct ocfs2_file_vm_ops = { .fault = ocfs2_fault, .page_mkwrite = ocfs2_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
int ocfs2_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1556,7 +1556,6 @@ static const struct vm_operations_struct .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ubifs_vm_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma) --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1481,5 +1481,4 @@ static const struct vm_operations_struct .fault = xfs_filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = xfs_filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, }; --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2430,12 +2430,6 @@ extern int sb_min_blocksize(struct super
extern int generic_file_mmap(struct file *, struct vm_area_struct *); extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); -static inline int generic_file_remap_pages(struct vm_area_struct *vma, - unsigned long addr, unsigned long size, pgoff_t pgoff) -{ - BUG(); - return 0; -} int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *); extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *); --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -268,9 +268,6 @@ struct vm_operations_struct { int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, const nodemask_t *to, unsigned long flags); #endif - /* called by sys_remap_file_pages() to populate non-linear mapping */ - int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr, - unsigned long size, pgoff_t pgoff); };
struct mmu_gather; --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2078,7 +2078,6 @@ const struct vm_operations_struct generi .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
/* This is used for a general mmap of a disk file */ --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -306,7 +306,6 @@ out: static const struct vm_operations_struct xip_file_vm_ops = { .fault = xip_file_fault, .page_mkwrite = filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, };
int xip_file_mmap(struct file * file, struct vm_area_struct * vma) --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2793,7 +2793,6 @@ static const struct vm_operations_struct .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif - .remap_pages = generic_file_remap_pages, };
static struct dentry *shmem_mount(struct file_system_type *fs_type,
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 8b70beac99466b6d164de9fe647b3567e6f17e3a upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/sh/include/asm/pgtable_32.h | 30 ++++-------------------------- arch/sh/include/asm/pgtable_64.h | 9 +-------- 2 files changed, 5 insertions(+), 34 deletions(-)
--- a/arch/sh/include/asm/pgtable_32.h +++ b/arch/sh/include/asm/pgtable_32.h @@ -26,8 +26,6 @@ * and timing control which (together with bit 0) are moved into the * old-style PTEA on the parts that support it. * - * XXX: Leave the _PAGE_FILE and _PAGE_WT overhaul for a rainy day. - * * SH-X2 MMUs and extended PTEs * * SH-X2 supports an extended mode TLB with split data arrays due to the @@ -51,7 +49,6 @@ #define _PAGE_PRESENT 0x100 /* V-bit : page is valid */ #define _PAGE_PROTNONE 0x200 /* software: if not present */ #define _PAGE_ACCESSED 0x400 /* software: page referenced */ -#define _PAGE_FILE _PAGE_WT /* software: pagecache or swap? */ #define _PAGE_SPECIAL 0x800 /* software: special page */
#define _PAGE_SZ_MASK (_PAGE_SZ0 | _PAGE_SZ1) @@ -105,14 +102,13 @@ static inline unsigned long copy_ptea_at /* Mask which drops unused bits from the PTEL value */ #if defined(CONFIG_CPU_SH3) #define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED| \ - _PAGE_FILE | _PAGE_SZ1 | \ - _PAGE_HW_SHARED) + _PAGE_SZ1 | _PAGE_HW_SHARED) #elif defined(CONFIG_X2TLB) /* Get rid of the legacy PR/SZ bits when using extended mode */ #define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED | \ - _PAGE_FILE | _PAGE_PR_MASK | _PAGE_SZ_MASK) + _PAGE_PR_MASK | _PAGE_SZ_MASK) #else -#define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED | _PAGE_FILE) +#define _PAGE_CLEAR_FLAGS (_PAGE_PROTNONE | _PAGE_ACCESSED) #endif
#define _PAGE_FLAGS_HARDWARE_MASK (phys_addr_mask() & ~(_PAGE_CLEAR_FLAGS)) @@ -343,7 +339,6 @@ static inline void set_pte(pte_t *ptep, #define pte_not_present(pte) (!((pte).pte_low & _PAGE_PRESENT)) #define pte_dirty(pte) ((pte).pte_low & _PAGE_DIRTY) #define pte_young(pte) ((pte).pte_low & _PAGE_ACCESSED) -#define pte_file(pte) ((pte).pte_low & _PAGE_FILE) #define pte_special(pte) ((pte).pte_low & _PAGE_SPECIAL)
#ifdef CONFIG_X2TLB @@ -445,7 +440,6 @@ static inline pte_t pte_modify(pte_t pte * Encode and de-code a swap entry * * Constraints: - * _PAGE_FILE at bit 0 * _PAGE_PRESENT at bit 8 * _PAGE_PROTNONE at bit 9 * @@ -453,9 +447,7 @@ static inline pte_t pte_modify(pte_t pte * swap offset into bits 10:30. For the 64-bit PTE case, we keep the * preserved bits in the low 32-bits and use the upper 32 as the swap * offset (along with a 5-bit type), following the same approach as x86 - * PAE. This keeps the logic quite simple, and allows for a full 32 - * PTE_FILE_MAX_BITS, as opposed to the 29-bits we're constrained with - * in the pte_low case. + * PAE. This keeps the logic quite simple. * * As is evident by the Alpha code, if we ever get a 64-bit unsigned * long (swp_entry_t) to match up with the 64-bit PTEs, this all becomes @@ -471,13 +463,6 @@ static inline pte_t pte_modify(pte_t pte #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })
-/* - * Encode and decode a nonlinear file mapping entry - */ -#define pte_to_pgoff(pte) ((pte).pte_high) -#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) }) - -#define PTE_FILE_MAX_BITS 32 #else #define __swp_type(x) ((x).val & 0xff) #define __swp_offset(x) ((x).val >> 10) @@ -485,13 +470,6 @@ static inline pte_t pte_modify(pte_t pte
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 1 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 1 }) - -/* - * Encode and decode a nonlinear file mapping entry - */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte) >> 1) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 1) | _PAGE_FILE }) #endif
#endif /* __ASSEMBLY__ */ --- a/arch/sh/include/asm/pgtable_64.h +++ b/arch/sh/include/asm/pgtable_64.h @@ -107,7 +107,6 @@ static __inline__ void set_pte(pte_t *pt #define _PAGE_DEVICE 0x001 /* CB0: if uncacheable, 1->device (i.e. no write-combining or reordering at bus level) */ #define _PAGE_CACHABLE 0x002 /* CB1: uncachable/cachable */ #define _PAGE_PRESENT 0x004 /* software: page referenced */ -#define _PAGE_FILE 0x004 /* software: only when !present */ #define _PAGE_SIZE0 0x008 /* SZ0-bit : size of page */ #define _PAGE_SIZE1 0x010 /* SZ1-bit : size of page */ #define _PAGE_SHARED 0x020 /* software: reflects PTEH's SH */ @@ -129,7 +128,7 @@ static __inline__ void set_pte(pte_t *pt #define _PAGE_WIRED _PAGE_EXT(0x001) /* software: wire the tlb entry */ #define _PAGE_SPECIAL _PAGE_EXT(0x002)
-#define _PAGE_CLEAR_FLAGS (_PAGE_PRESENT | _PAGE_FILE | _PAGE_SHARED | \ +#define _PAGE_CLEAR_FLAGS (_PAGE_PRESENT | _PAGE_SHARED | \ _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_WIRED)
/* Mask which drops software flags */ @@ -260,7 +259,6 @@ static __inline__ void set_pte(pte_t *pt */ static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_special(pte_t pte){ return pte_val(pte) & _PAGE_SPECIAL; }
@@ -304,11 +302,6 @@ static inline pte_t pte_modify(pte_t pte #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-/* Encode and decode a nonlinear file mapping entry */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte)) -#define pgoff_to_pte(off) ((pte_t) { (off) | _PAGE_FILE }) - #endif /* !__ASSEMBLY__ */
#define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 40171798fe11a6dc1d963058b097b2c4c9d34a9c upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Guan Xuetao gxt@mprc.pku.edu.cn Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/unicore32/include/asm/pgtable-hwdef.h | 1 - arch/unicore32/include/asm/pgtable.h | 14 -------------- 2 files changed, 15 deletions(-)
--- a/arch/unicore32/include/asm/pgtable-hwdef.h +++ b/arch/unicore32/include/asm/pgtable-hwdef.h @@ -44,7 +44,6 @@ #define PTE_TYPE_INVALID (3 << 0)
#define PTE_PRESENT (1 << 2) -#define PTE_FILE (1 << 3) /* only when !PRESENT */ #define PTE_YOUNG (1 << 3) #define PTE_DIRTY (1 << 4) #define PTE_CACHEABLE (1 << 5) --- a/arch/unicore32/include/asm/pgtable.h +++ b/arch/unicore32/include/asm/pgtable.h @@ -283,20 +283,6 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD #define MAX_SWAPFILES_CHECK() \ BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS)
-/* - * Encode and decode a file entry. File entries are stored in the Linux - * page tables as follows: - * - * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * <----------------------- offset ----------------------> 1 0 0 0 - */ -#define pte_file(pte) (pte_val(pte) & PTE_FILE) -#define pte_to_pgoff(x) (pte_val(x) >> 4) -#define pgoff_to_pte(x) __pte(((x) << 4) | PTE_FILE) - -#define PTE_FILE_MAX_BITS 28 - /* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ /* FIXME: this is not correct */ #define kern_addr_valid(addr) (1)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 0768f91530ff46683e0b372df14fd79fe8d156e5 upstream.
Some cases in THP like: - MADV_FREE - mprotect - split
mark the PMD non present for temporarily to prevent races. The window for an L1TF attack in these contexts is very small, but it wants to be fixed for correctness sake.
Use the proper low level functions for pmd/pud_mknotpresent() to address this.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: - Drop change to pud_mknotpresent() - pmd_mknotpresent() does not touch _PAGE_NONE] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable.h | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
--- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -308,11 +308,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pm return pmd_set_flags(pmd, _PAGE_RW); }
-static inline pmd_t pmd_mknotpresent(pmd_t pmd) -{ - return pmd_clear_flags(pmd, _PAGE_PRESENT); -} - static inline pud_t pud_set_flags(pud_t pud, pudval_t set) { pudval_t v = native_pud_val(pud); @@ -393,6 +388,12 @@ static inline pud_t pfn_pud(unsigned lon return __pud(pfn | massage_pgprot(pgprot)); }
+static inline pmd_t pmd_mknotpresent(pmd_t pmd) +{ + return pfn_pmd(pmd_pfn(pmd), + __pgprot(pmd_flags(pmd) & ~_PAGE_PRESENT)); +} + static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Jiri Kosina jkosina@suse.cz
commit 7bb4d366cba992904bffa4820d24e70a3de93e76 upstream.
cpu_show_common() is not used outside of arch/x86/kernel/cpu/bugs.c, so make it static.
Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -677,7 +677,7 @@ void x86_spec_ctrl_setup_ap(void)
#ifdef CONFIG_SYSFS
-ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr, +static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr, char *buf, unsigned int bug) { if (!boot_cpu_has_bug(bug))
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Markus Trippelsdorf markus@trippelsdorf.de
commit 7ebb916782949621ff6819acf373a06902df7679 upstream.
gcc-7 warns:
In file included from arch/x86/tools/relocs_64.c:17:0: arch/x86/tools/relocs.c: In function ‘process_64’: arch/x86/tools/relocs.c:953:2: warning: argument 1 null where non-null expected [-Wnonnull] qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from arch/x86/tools/relocs.h:6:0, from arch/x86/tools/relocs_64.c:1: /usr/include/stdlib.h:741:13: note: in a call to function ‘qsort’ declared here extern void qsort
This happens because relocs16 is not used for ELF_BITS == 64, so there is no point in trying to sort it.
Make the sort_relocs(&relocs16) call 32bit only.
Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de Link: http://lkml.kernel.org/r/20161215124513.GA289@x4 Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: Also make sort_relocs(&relocs64) conditional, which was done upstream in commit 6d24c5f72dfb "x86-64: Handle PC-relative relocations on per-CPU data".] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/tools/relocs.c +++ b/arch/x86/tools/relocs.c @@ -984,9 +984,12 @@ static void emit_relocs(int as_text, int die("Segment relocations found but --realmode not specified\n");
/* Order the relocations for more efficient processing */ - sort_relocs(&relocs16); sort_relocs(&relocs32); +#if ELF_BITS == 64 sort_relocs(&relocs64); +#else + sort_relocs(&relocs16); +#endif
/* Print the relocations */ if (as_text) {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 8d55da810f1fabcf1d4c0bbc46205e5f2c0fa84b upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: "James E.J. Bottomley" jejb@parisc-linux.org Cc: Helge Deller deller@gmx.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/parisc/include/asm/pgtable.h | 10 ---------- 1 file changed, 10 deletions(-)
--- a/arch/parisc/include/asm/pgtable.h +++ b/arch/parisc/include/asm/pgtable.h @@ -146,7 +146,6 @@ extern void purge_tlb_entries(struct mm_ #define _PAGE_GATEWAY_BIT 28 /* (0x008) privilege promotion allowed */ #define _PAGE_DMB_BIT 27 /* (0x010) Data Memory Break enable (B bit) */ #define _PAGE_DIRTY_BIT 26 /* (0x020) Page Dirty (D bit) */ -#define _PAGE_FILE_BIT _PAGE_DIRTY_BIT /* overload this bit */ #define _PAGE_REFTRAP_BIT 25 /* (0x040) Page Ref. Trap enable (T bit) */ #define _PAGE_NO_CACHE_BIT 24 /* (0x080) Uncached Page (U bit) */ #define _PAGE_ACCESSED_BIT 23 /* (0x100) Software: Page Accessed */ @@ -167,13 +166,6 @@ extern void purge_tlb_entries(struct mm_ /* PFN_PTE_SHIFT defines the shift of a PTE value to access the PFN field */ #define PFN_PTE_SHIFT 12
- -/* this is how many bits may be used by the file functions */ -#define PTE_FILE_MAX_BITS (BITS_PER_LONG - PTE_SHIFT) - -#define pte_to_pgoff(pte) (pte_val(pte) >> PTE_SHIFT) -#define pgoff_to_pte(off) ((pte_t) { ((off) << PTE_SHIFT) | _PAGE_FILE }) - #define _PAGE_READ (1 << xlate_pabit(_PAGE_READ_BIT)) #define _PAGE_WRITE (1 << xlate_pabit(_PAGE_WRITE_BIT)) #define _PAGE_RW (_PAGE_READ | _PAGE_WRITE) @@ -186,7 +178,6 @@ extern void purge_tlb_entries(struct mm_ #define _PAGE_ACCESSED (1 << xlate_pabit(_PAGE_ACCESSED_BIT)) #define _PAGE_PRESENT (1 << xlate_pabit(_PAGE_PRESENT_BIT)) #define _PAGE_USER (1 << xlate_pabit(_PAGE_USER_BIT)) -#define _PAGE_FILE (1 << xlate_pabit(_PAGE_FILE_BIT))
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY | _PAGE_ACCESSED) #define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY) @@ -344,7 +335,6 @@ static inline void pgd_clear(pgd_t * pgd static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit f22cc87f6c1f771b57c407555cfefd811cdd9507 upstream.
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present mapping, but the inversion logic explicitely checks for !PRESENT and PROT_NONE.
Remove the PROT_NONE check and make the inversion unconditional for all not present mappings.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable-invert.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/include/asm/pgtable-invert.h +++ b/arch/x86/include/asm/pgtable-invert.h @@ -6,7 +6,7 @@
static inline bool __pte_needs_invert(u64 val) { - return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) == _PAGE_PROTNONE; + return !(val & _PAGE_PRESENT); }
/* Get a mask to xor with the page table entry to get the correct pfn. */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Kees Cook keescook@chromium.org
commit da029c11e6b12f321f36dac8771e833b65cec962 upstream.
To avoid pathological stack usage or the need to special-case setuid execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: replaced code is slightly different] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- fs/exec.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
--- a/fs/exec.c +++ b/fs/exec.c @@ -206,8 +206,7 @@ static struct page *get_arg_page(struct
if (write) { unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start; - unsigned long ptr_size; - struct rlimit *rlim; + unsigned long ptr_size, limit;
/* * Since the stack will hold pointers to the strings, we @@ -236,14 +235,16 @@ static struct page *get_arg_page(struct return page;
/* - * Limit to 1/4-th the stack size for the argv+env strings. + * Limit to 1/4 of the max stack size or 3/4 of _STK_LIM + * (whichever is smaller) for the argv+env strings. * This ensures that: * - the remaining binfmt code will not run out of stack space, * - the program will have a reasonable amount of stack left * to work from. */ - rlim = current->signal->rlim; - if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) + limit = _STK_LIM / 4 * 3; + limit = min(limit, rlimit(RLIMIT_STACK) / 4); + if (size > limit) goto fail; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit eb12f4872a3845a8803f689646dea5b92a30aff7 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Chris Metcalf cmetcalf@ezchip.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/tile/include/asm/pgtable.h | 11 ----------- arch/tile/mm/homecache.c | 4 ---- 2 files changed, 15 deletions(-)
--- a/arch/tile/include/asm/pgtable.h +++ b/arch/tile/include/asm/pgtable.h @@ -285,17 +285,6 @@ extern void start_mm_caching(struct mm_s extern void check_mm_caching(struct mm_struct *prev, struct mm_struct *next);
/* - * Support non-linear file mappings (see sys_remap_file_pages). - * This is defined by CLIENT1 set but CLIENT0 and _PAGE_PRESENT clear, and the - * file offset in the 32 high bits. - */ -#define _PAGE_FILE HV_PTE_CLIENT1 -#define PTE_FILE_MAX_BITS 32 -#define pte_file(pte) (hv_pte_get_client1(pte) && !hv_pte_get_client0(pte)) -#define pte_to_pgoff(pte) ((pte).val >> 32) -#define pgoff_to_pte(off) ((pte_t) { (((long long)(off)) << 32) | _PAGE_FILE }) - -/* * Encode and de-code a swap entry (see <linux/swapops.h>). * We put the swap file type+offset in the 32 high bits; * I believe we can just leave the low bits clear. --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -265,10 +265,6 @@ static int pte_to_home(pte_t pte) /* Update the home of a PTE if necessary (can also be used for a pgprot_t). */ pte_t pte_set_home(pte_t pte, int home) { - /* Check for non-linear file mapping "PTEs" and pass them through. */ - if (pte_file(pte)) - return pte; - #if CHIP_HAS_MMIO() /* Check for MMIO mappings and pass them through. */ if (hv_pte_get_mode(pte) == HV_PTE_MODE_MMIO)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Ben Hutchings ben@decadent.org.uk
I noticed that in the upstream kernel PTI is exposed in /proc/cpuinfo and in the 4.4 and 4.9 stable branches KAISER is exposed similarly, but for some reason the backport to 3.16 hid it. Change to match the other branches.
Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -203,7 +203,7 @@ #define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ /* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ -#define X86_FEATURE_KAISER (7*32+31) /* "" CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */ +#define X86_FEATURE_KAISER (7*32+31) /* CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */
/* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW (8*32+ 0) /* Intel TPR Shadow */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Tyler Hicks tyhicks@canonical.com
The irda_bind() function allocates memory for self->ias_obj without checking to see if the socket is already bound. A userspace process could repeatedly bind the socket, have each new object added into the LM-IAS database, and lose the reference to the old object assigned to the socket to exhaust memory resources. This patch errors out of the bind operation when self->ias_obj is already assigned.
CVE-2018-6554
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Tyler Hicks tyhicks@canonical.com Reviewed-by: Seth Arnold seth.arnold@canonical.com Reviewed-by: Stefan Bader stefan.bader@canonical.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- net/irda/af_irda.c | 7 +++++++ 1 file changed, 7 insertions(+)
--- a/net/irda/af_irda.c +++ b/net/irda/af_irda.c @@ -786,6 +786,13 @@ static int irda_bind(struct socket *sock return -EINVAL;
lock_sock(sk); + + /* Ensure that the socket is not already bound */ + if (self->ias_obj) { + err = -EINVAL; + goto out; + } + #ifdef CONFIG_IRDA_ULTRA /* Special care for Ultra sockets */ if ((sk->sk_type == SOCK_DGRAM) &&
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 3513006a5691ae3629eef9ddef0b71a47c40dfbc upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Jeff Dike jdike@addtoit.com Cc: Richard Weinberger richard@nod.at Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/um/include/asm/pgtable-2level.h | 9 --------- arch/um/include/asm/pgtable-3level.h | 20 -------------------- arch/um/include/asm/pgtable.h | 9 --------- 3 files changed, 38 deletions(-)
--- a/arch/um/include/asm/pgtable-2level.h +++ b/arch/um/include/asm/pgtable-2level.h @@ -41,13 +41,4 @@ static inline void pgd_mkuptodate(pgd_t #define pfn_pte(pfn, prot) __pte(pfn_to_phys(pfn) | pgprot_val(prot)) #define pfn_pmd(pfn, prot) __pmd(pfn_to_phys(pfn) | pgprot_val(prot))
-/* - * Bits 0 through 4 are taken - */ -#define PTE_FILE_MAX_BITS 27 - -#define pte_to_pgoff(pte) (pte_val(pte) >> 5) - -#define pgoff_to_pte(off) ((pte_t) { ((off) << 5) + _PAGE_FILE }) - #endif --- a/arch/um/include/asm/pgtable-3level.h +++ b/arch/um/include/asm/pgtable-3level.h @@ -112,25 +112,5 @@ static inline pmd_t pfn_pmd(pfn_t page_n return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)); }
-/* - * Bits 0 through 3 are taken in the low part of the pte, - * put the 32 bits of offset into the high part. - */ -#define PTE_FILE_MAX_BITS 32 - -#ifdef CONFIG_64BIT - -#define pte_to_pgoff(p) ((p).pte >> 32) - -#define pgoff_to_pte(off) ((pte_t) { ((off) << 32) | _PAGE_FILE }) - -#else - -#define pte_to_pgoff(pte) ((pte).pte_high) - -#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) }) - -#endif - #endif
--- a/arch/um/include/asm/pgtable.h +++ b/arch/um/include/asm/pgtable.h @@ -18,7 +18,6 @@ #define _PAGE_ACCESSED 0x080 #define _PAGE_DIRTY 0x100 /* If _PAGE_PRESENT is clear, we use these: */ -#define _PAGE_FILE 0x008 /* nonlinear file mapping, saved PTE; unset:swap */ #define _PAGE_PROTNONE 0x010 /* if the user mapped it with PROT_NONE; pte_present gives true */
@@ -151,14 +150,6 @@ static inline int pte_write(pte_t pte) !(pte_get_bits(pte, _PAGE_PROTNONE))); }
-/* - * The following only works if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_get_bits(pte, _PAGE_FILE); -} - static inline int pte_dirty(pte_t pte) { return pte_get_bits(pte, _PAGE_DIRTY);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vlastimil Babka vbabka@suse.cz
commit 792adb90fa724ce07c0171cbc96b9215af4b1045 upstream.
The introduction of generic_max_swapfile_size and arch-specific versions has broken linking on x86 with CONFIG_SWAP=n due to undefined reference to 'generic_max_swapfile_size'. Fix it by compiling the x86-specific max_swapfile_size() only with CONFIG_SWAP=y.
Reported-by: Tomas Pruzina pruzinat@gmail.com Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/mm/init.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -701,6 +701,7 @@ void __init zone_sizes_init(void) free_area_init_nodes(max_zone_pfns); }
+#ifdef CONFIG_SWAP unsigned long max_swapfile_size(void) { unsigned long pages; @@ -713,3 +714,4 @@ unsigned long max_swapfile_size(void) } return pages; } +#endif
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 1da4b35b001481df99a6dcab12d5d39a876f7056 upstream.
We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs} which is unused now. Let's drop it.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -436,7 +436,6 @@ struct mem_size_stats { unsigned long anonymous; unsigned long anonymous_thp; unsigned long swap; - unsigned long nonlinear; u64 pss; };
@@ -446,7 +445,6 @@ static void smaps_pte_entry(pte_t ptent, { struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = mss->vma; - pgoff_t pgoff = linear_page_index(vma, addr); struct page *page = NULL; int mapcount;
@@ -459,9 +457,6 @@ static void smaps_pte_entry(pte_t ptent, mss->swap += ptent_size; else if (is_migration_entry(swpent)) page = migration_entry_to_page(swpent); - } else if (pte_file(ptent)) { - if (pte_to_pgoff(ptent) != pgoff) - mss->nonlinear += ptent_size; }
if (!page) @@ -470,9 +465,6 @@ static void smaps_pte_entry(pte_t ptent, if (PageAnon(page)) mss->anonymous += ptent_size;
- if (page->index != pgoff) - mss->nonlinear += ptent_size; - mss->resident += ptent_size; /* Accumulate the size in pages that have been accessed. */ if (pte_young(ptent) || PageReferenced(page)) @@ -554,7 +546,6 @@ static void show_smap_vma_flags(struct s [ilog2(VM_ACCOUNT)] = "ac", [ilog2(VM_NORESERVE)] = "nr", [ilog2(VM_HUGETLB)] = "ht", - [ilog2(VM_NONLINEAR)] = "nl", [ilog2(VM_ARCH_1)] = "ar", [ilog2(VM_DONTDUMP)] = "dd", #ifdef CONFIG_MEM_SOFT_DIRTY @@ -628,10 +619,6 @@ static int show_smap(struct seq_file *m, (vma->vm_flags & VM_LOCKED) ? (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
- if (vma->vm_flags & VM_NONLINEAR) - seq_printf(m, "Nonlinear: %8lu kB\n", - mss.nonlinear >> 10); - show_smap_vma_flags(m, vma);
if (m->count < m->size) /* vma is copied successfully */ @@ -735,8 +722,6 @@ static inline void clear_soft_dirty(stru ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY); } else if (is_swap_pte(ptent)) { ptent = pte_swp_clear_soft_dirty(ptent); - } else if (pte_file(ptent)) { - ptent = pte_file_clear_soft_dirty(ptent); }
set_pte_at(vma->vm_mm, addr, pte, ptent);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 636a002b704e0a36cefb5f4cf0293fab858fc46c upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Tony Luck tony.luck@intel.com Cc: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/ia64/include/asm/pgtable.h | 25 +++++-------------------- 1 file changed, 5 insertions(+), 20 deletions(-)
--- a/arch/ia64/include/asm/pgtable.h +++ b/arch/ia64/include/asm/pgtable.h @@ -57,9 +57,6 @@ #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_PROTNONE (__IA64_UL(1) << 63)
-/* Valid only for a PTE with the present bit cleared: */ -#define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */ - #define _PFN_MASK _PAGE_PPN_MASK /* Mask of bits which may be changed by pte_modify(); the odd bits are there for _PAGE_PROTNONE */ #define _PAGE_CHG_MASK (_PAGE_P | _PAGE_PROTNONE | _PAGE_PL_MASK | _PAGE_AR_MASK | _PAGE_ED) @@ -300,7 +297,6 @@ extern unsigned long VMALLOC_END; #define pte_exec(pte) ((pte_val(pte) & _PAGE_AR_RX) != 0) #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) -#define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) #define pte_special(pte) 0
/* @@ -472,27 +468,16 @@ extern void paging_init (void); * * Format of swap pte: * bit 0 : present bit (must be zero) - * bit 1 : _PAGE_FILE (must be zero) - * bits 2- 8: swap-type - * bits 9-62: swap offset - * bit 63 : _PAGE_PROTNONE bit - * - * Format of file pte: - * bit 0 : present bit (must be zero) - * bit 1 : _PAGE_FILE (must be one) - * bits 2-62: file_offset/PAGE_SIZE + * bits 1- 7: swap-type + * bits 8-62: swap offset * bit 63 : _PAGE_PROTNONE bit */ -#define __swp_type(entry) (((entry).val >> 2) & 0x7f) -#define __swp_offset(entry) (((entry).val << 1) >> 10) -#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 2) | ((long) (offset) << 9) }) +#define __swp_type(entry) (((entry).val >> 1) & 0x7f) +#define __swp_offset(entry) (((entry).val << 1) >> 9) +#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 1) | ((long) (offset) << 8) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#define PTE_FILE_MAX_BITS 61 -#define pte_to_pgoff(pte) ((pte_val(pte) << 1) >> 3) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 2) | _PAGE_FILE }) - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc..
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 2bc6ff14d46745a7728ed4ed90c5e0edca91f52e upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Steven Miao realmz6@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/blackfin/include/asm/pgtable.h | 5 ----- 1 file changed, 5 deletions(-)
--- a/arch/blackfin/include/asm/pgtable.h +++ b/arch/blackfin/include/asm/pgtable.h @@ -45,11 +45,6 @@ extern void paging_init(void); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-static inline int pte_file(pte_t pte) -{ - return 0; -} - #define set_pte(pteptr, pteval) (*(pteptr) = pteval) #define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 0270be3e34efb05a88bc4c422572ece038ef3608 upstream.
The upcoming support for the virtual SPEC_CTRL MSR on AMD needs to reuse speculative_store_bypass_update() to avoid code duplication. Add an argument for supplying a thread info (TIF) value and create a wrapper speculative_store_bypass_update_current() which is used at the existing call site.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/spec-ctrl.h | 7 ++++++- arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kernel/process.c | 4 ++-- 3 files changed, 9 insertions(+), 4 deletions(-)
--- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -42,6 +42,11 @@ extern void speculative_store_bypass_ht_ static inline void speculative_store_bypass_ht_init(void) { } #endif
-extern void speculative_store_bypass_update(void); +extern void speculative_store_bypass_update(unsigned long tif); + +static inline void speculative_store_bypass_update_current(void) +{ + speculative_store_bypass_update(current_thread_info()->flags); +}
#endif --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -635,7 +635,7 @@ static int ssb_prctl_set(struct task_str * mitigation until it is next scheduled. */ if (task == current && update) - speculative_store_bypass_update(); + speculative_store_bypass_update_current();
return 0; } --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -352,10 +352,10 @@ static __always_inline void __speculativ intel_set_ssb_state(tifn); }
-void speculative_store_bypass_update(void) +void speculative_store_bypass_update(unsigned long tif) { preempt_disable(); - __speculative_store_bypass_update(current_thread_info()->flags); + __speculative_store_bypass_update(tif); preempt_enable(); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit d9ecee281b8f89da6d3203be62802eda991e37cc upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Max Filippov jcmvbkbc@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/xtensa/include/asm/pgtable.h | 10 ---------- 1 file changed, 10 deletions(-)
--- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -89,8 +89,6 @@ * (PAGE_NONE)| PPN | 0 | 00 | ADW | 01 | 11 | 11 | * +-----------------------------------------+ * swap | index | type | 01 | 11 | 00 | - * +- - - - - - - - - - - - - - - - - - - - -+ - * file | file offset | 01 | 11 | 10 | * +-----------------------------------------+ * * For T1050 hardware and earlier the layout differs for present and (PAGE_NONE) @@ -111,7 +109,6 @@ * index swap offset / PAGE_SIZE (bit 11-31: 21 bits -> 8 GB) * (note that the index is always non-zero) * type swap type (5 bits -> 32 types) - * file offset 26-bit offset into the file, in increments of PAGE_SIZE * * Notes: * - (PROT_NONE) is a special case of 'present' but causes an exception for @@ -144,7 +141,6 @@ #define _PAGE_HW_VALID 0x00 #define _PAGE_NONE 0x0f #endif -#define _PAGE_FILE (1<<1) /* file mapped page, only if !present */
#define _PAGE_USER (1<<4) /* user access (ring=1) */
@@ -260,7 +256,6 @@ static inline void pgtable_cache_init(vo static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITABLE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) @@ -388,11 +383,6 @@ ptep_set_wrprotect(struct mm_struct *mm, #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#define PTE_FILE_MAX_BITS 26 -#define pte_to_pgoff(pte) (pte_val(pte) >> 6) -#define pgoff_to_pte(off) \ - ((pte_t) { ((off) << 6) | _PAGE_CA_INVALID | _PAGE_FILE | _PAGE_USER }) - #endif /* !defined (__ASSEMBLY__) */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit ffed645e3be0e32f8e9ab068d257aee8d0fe8eec upstream.
Fixes: 7bb4d366c ("x86/bugs: Make cpu_show_common() static") Fixes: 24f7fc83b ("x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation") Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -568,7 +568,7 @@ static enum ssb_mitigation __init __ssb_ return mode; }
-static void ssb_select_mitigation() +static void ssb_select_mitigation(void) { ssb_mode = __ssb_select_mitigation();
@@ -678,7 +678,7 @@ void x86_spec_ctrl_setup_ap(void) #ifdef CONFIG_SYSFS
static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr, - char *buf, unsigned int bug) + char *buf, unsigned int bug) { if (!boot_cpu_has_bug(bug)) return sprintf(buf, "Not affected\n");
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit ca5bfa7b390017f053d7581bc701518b87bc3d43 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: David Howells dhowells@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/frv/include/asm/pgtable.h | 27 +++++---------------------- 1 file changed, 5 insertions(+), 22 deletions(-)
--- a/arch/frv/include/asm/pgtable.h +++ b/arch/frv/include/asm/pgtable.h @@ -62,10 +62,6 @@ typedef pte_t *pte_addr_t; #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#ifndef __ASSEMBLY__ -static inline int pte_file(pte_t pte) { return 0; } -#endif - #define ZERO_PAGE(vaddr) ({ BUG(); NULL; })
#define swapper_pg_dir ((pgd_t *) NULL) @@ -298,7 +294,6 @@ static inline pmd_t *pmd_offset(pud_t *d
#define _PAGE_RESERVED_MASK (xAMPRx_RESERVED8 | xAMPRx_RESERVED13)
-#define _PAGE_FILE 0x002 /* set:pagecache unset:swap */ #define _PAGE_PROTNONE 0x000 /* If not present */
#define _PAGE_CHG_MASK (PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY) @@ -463,27 +458,15 @@ static inline pte_t pte_modify(pte_t pte * Handle swap and file entries * - the PTE is encoded in the following format: * bit 0: Must be 0 (!_PAGE_PRESENT) - * bit 1: Type: 0 for swap, 1 for file (_PAGE_FILE) - * bits 2-7: Swap type - * bits 8-31: Swap offset - * bits 2-31: File pgoff - */ -#define __swp_type(x) (((x).val >> 2) & 0x1f) -#define __swp_offset(x) ((x).val >> 8) -#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 2) | ((offset) << 8) }) + * bits 1-6: Swap type + * bits 7-31: Swap offset + */ +#define __swp_type(x) (((x).val >> 1) & 0x1f) +#define __swp_offset(x) ((x).val >> 7) +#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 1) | ((offset) << 7) }) #define __pte_to_swp_entry(_pte) ((swp_entry_t) { (_pte).pte }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-static inline int pte_file(pte_t pte) -{ - return pte.pte & _PAGE_FILE; -} - -#define PTE_FILE_MAX_BITS 29 - -#define pte_to_pgoff(PTE) ((PTE).pte >> 2) -#define pgoff_to_pte(off) __pte((off) << 2 | _PAGE_FILE) - /* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ #define PageSkip(page) (0) #define kern_addr_valid(addr) (1)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Tyler Hicks tyhicks@canonical.com
The irda_setsockopt() function conditionally allocates memory for a new self->ias_object or, in some cases, reuses the existing self->ias_object. Existing objects were incorrectly reinserted into the LM_IAS database which corrupted the doubly linked list used for the hashbin implementation of the LM_IAS database. When combined with a memory leak in irda_bind(), this issue could be leveraged to create a use-after-free vulnerability in the hashbin list. This patch fixes the issue by only inserting newly allocated objects into the database.
CVE-2018-6555
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Tyler Hicks tyhicks@canonical.com Reviewed-by: Seth Arnold seth.arnold@canonical.com Reviewed-by: Stefan Bader stefan.bader@canonical.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- net/irda/af_irda.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
--- a/net/irda/af_irda.c +++ b/net/irda/af_irda.c @@ -2053,7 +2053,11 @@ static int irda_setsockopt(struct socket err = -EINVAL; goto out; } - irias_insert_object(ias_obj); + + /* Only insert newly allocated objects */ + if (free_ias) + irias_insert_object(ias_obj); + kfree(ias_opt); break; case IRLMP_IAS_DEL:
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andy Whitcroft apw@canonical.com
commit 65eea8edc315589d6c993cf12dbb5d0e9ef1fe4e upstream.
The final field of a floppy_struct is the field "name", which is a pointer to a string in kernel memory. The kernel pointer should not be copied to user memory. The FDGETPRM ioctl copies a floppy_struct to user memory, including this "name" field. This pointer cannot be used by the user and it will leak a kernel address to user-space, which will reveal the location of kernel code and data and undermine KASLR protection.
Model this code after the compat ioctl which copies the returned data to a previously cleared temporary structure on the stack (excluding the name pointer) and copy out to userspace from there. As we already have an inparam union with an appropriate member and that memory is already cleared even for read only calls make use of that as a temporary store.
Based on an initial patch by Brian Belleville.
CVE-2018-7755 Signed-off-by: Andy Whitcroft apw@canonical.com
Broke up long line.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Ben Hutchings ben@decadent.org.uk --- drivers/block/floppy.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -3459,6 +3459,9 @@ static int fd_locked_ioctl(struct block_ (struct floppy_struct **)&outparam); if (ret) return ret; + memcpy(&inparam.g, outparam, + offsetof(struct floppy_struct, name)); + outparam = &inparam.g; break; case FDMSGON: UDP->flags |= FTD_MSG;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit b816157a5366550c5ee29a6431ba1abb88721266 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Richard Henderson rth@twiddle.net Cc: Ivan Kokshaysky ink@jurassic.park.msu.ru Cc: Matt Turner mattst88@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/alpha/include/asm/pgtable.h | 7 ------- 1 file changed, 7 deletions(-)
--- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -73,7 +73,6 @@ struct vm_area_struct; /* .. and these are ours ... */ #define _PAGE_DIRTY 0x20000 #define _PAGE_ACCESSED 0x40000 -#define _PAGE_FILE 0x80000 /* set:pagecache, unset:swap */
/* * NOTE! The "accessed" bit isn't necessarily exact: it can be kept exactly @@ -268,7 +267,6 @@ extern inline void pgd_clear(pgd_t * pgd extern inline int pte_write(pte_t pte) { return !(pte_val(pte) & _PAGE_FOW); } extern inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -extern inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } extern inline int pte_special(pte_t pte) { return 0; }
extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_FOW; return pte; } @@ -345,11 +343,6 @@ extern inline pte_t mk_swap_pte(unsigned #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#define pte_to_pgoff(pte) (pte_val(pte) >> 32) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 32) | _PAGE_FILE }) - -#define PTE_FILE_MAX_BITS 32 - #ifndef CONFIG_DISCONTIGMEM #define kern_addr_valid(addr) (1) #endif
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Naoya Horiguchi n-horiguchi@ah.jp.nec.com
commit 0b1fbfe50006c41014cc25660c0e735d21c34939 upstream.
Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead.
Signed-off-by: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Dave Hansen dave.hansen@intel.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Pavel Emelyanov xemul@parallels.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 3.16 as dependency of L1TF mitigation] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm.h | 6 ------ mm/pagewalk.c | 9 ++------- 2 files changed, 2 insertions(+), 13 deletions(-)
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1114,8 +1114,6 @@ void unmap_vmas(struct mmu_gather *tlb,
/** * mm_walk - callbacks for walk_page_range - * @pgd_entry: if set, called for each non-empty PGD (top-level) entry - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to @@ -1129,10 +1127,6 @@ void unmap_vmas(struct mmu_gather *tlb, * (see walk_page_range for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *pgd, unsigned long addr, - unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, - unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_entry)(pte_t *pte, unsigned long addr, --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -86,9 +86,7 @@ static int walk_pud_range(pgd_t *pgd, un break; continue; } - if (walk->pud_entry) - err = walk->pud_entry(pud, addr, next, walk); - if (!err && (walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pmd_range(pud, addr, next, walk); if (err) break; @@ -237,10 +235,7 @@ int walk_page_range(unsigned long addr, pgd++; continue; } - if (walk->pgd_entry) - err = walk->pgd_entry(pgd, addr, next, walk); - if (!err && - (walk->pud_entry || walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pud_range(pgd, addr, next, walk); if (err) break;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 406b16e26d0996516c8d1641008a7d326bf282d6 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/m32r/include/asm/pgtable-2level.h | 4 ---- arch/m32r/include/asm/pgtable.h | 11 ----------- 2 files changed, 15 deletions(-)
--- a/arch/m32r/include/asm/pgtable-2level.h +++ b/arch/m32r/include/asm/pgtable-2level.h @@ -70,9 +70,5 @@ static inline pmd_t *pmd_offset(pgd_t * #define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot)) #define pfn_pmd(pfn, prot) __pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
-#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (((pte_val(pte) >> 2) & 0x7f) | (((pte_val(pte) >> 10)) << 7)) -#define pgoff_to_pte(off) ((pte_t) { (((off) & 0x7f) << 2) | (((off) >> 7) << 10) | _PAGE_FILE }) - #endif /* __KERNEL__ */ #endif /* _ASM_M32R_PGTABLE_2LEVEL_H */ --- a/arch/m32r/include/asm/pgtable.h +++ b/arch/m32r/include/asm/pgtable.h @@ -80,8 +80,6 @@ extern unsigned long empty_zero_page[102 */
#define _PAGE_BIT_DIRTY 0 /* software: page changed */ -#define _PAGE_BIT_FILE 0 /* when !present: nonlinear file - mapping */ #define _PAGE_BIT_PRESENT 1 /* Valid: page is valid */ #define _PAGE_BIT_GLOBAL 2 /* Global */ #define _PAGE_BIT_LARGE 3 /* Large */ @@ -93,7 +91,6 @@ extern unsigned long empty_zero_page[102 #define _PAGE_BIT_PROTNONE 9 /* software: if not present */
#define _PAGE_DIRTY (1UL << _PAGE_BIT_DIRTY) -#define _PAGE_FILE (1UL << _PAGE_BIT_FILE) #define _PAGE_PRESENT (1UL << _PAGE_BIT_PRESENT) #define _PAGE_GLOBAL (1UL << _PAGE_BIT_GLOBAL) #define _PAGE_LARGE (1UL << _PAGE_BIT_LARGE) @@ -206,14 +203,6 @@ static inline int pte_write(pte_t pte) return pte_val(pte) & _PAGE_WRITE; }
-/* - * The following only works if pte_present() is not true. - */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - static inline int pte_special(pte_t pte) { return 0;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Michal Hocko mhocko@suse.cz
commit e14d7dfb41f5807a0c1c26a13f2b8ef16af24935 upstream.
Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for CONFIG_PAE because phys_addr_t is wider than unsigned long and so the pte_val reps. shift left would get truncated. Fix this up by using proper types.
[dwmw2: Backport to 4.9]
Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation") Reported-by: Jan Beulich JBeulich@suse.com Signed-off-by: Michal Hocko mhocko@suse.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: Adjust context. Also restore the fix to pfn_pud().] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/pgtable.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
--- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -147,21 +147,21 @@ static inline u64 protnone_mask(u64 val)
static inline unsigned long pte_pfn(pte_t pte) { - unsigned long pfn = pte_val(pte); + phys_addr_t pfn = pte_val(pte); pfn ^= protnone_mask(pfn); return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; }
static inline unsigned long pmd_pfn(pmd_t pmd) { - unsigned long pfn = pmd_val(pmd); + phys_addr_t pfn = pmd_val(pmd); pfn ^= protnone_mask(pfn); return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; }
static inline unsigned long pud_pfn(pud_t pud) { - unsigned long pfn = pud_val(pud); + phys_addr_t pfn = pud_val(pud); pfn ^= protnone_mask(pfn); return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; } @@ -371,7 +371,7 @@ static inline pgprotval_t massage_pgprot
static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) { - phys_addr_t pfn = page_nr << PAGE_SHIFT; + phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; pfn ^= protnone_mask(pgprot_val(pgprot)); pfn &= PTE_PFN_MASK; return __pte(pfn | massage_pgprot(pgprot)); @@ -379,7 +379,7 @@ static inline pte_t pfn_pte(unsigned lon
static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) { - phys_addr_t pfn = page_nr << PAGE_SHIFT; + phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; pfn ^= protnone_mask(pgprot_val(pgprot)); pfn &= PTE_PFN_MASK; return __pmd(pfn | massage_pgprot(pgprot)); @@ -387,7 +387,7 @@ static inline pmd_t pfn_pmd(unsigned lon
static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot) { - phys_addr_t pfn = page_nr << PAGE_SHIFT; + phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; pfn ^= protnone_mask(pgprot_val(pgprot)); pfn &= PTE_PFN_MASK; return __pud(pfn | massage_pgprot(pgprot));
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Kees Cook keescook@chromium.org
commit 5c3070890d06ff82eecb808d02d2ca39169533ef upstream.
When speculation flaw mitigations are opt-in (via prctl), using seccomp will automatically opt-in to these protections, since using seccomp indicates at least some level of sandboxing is desired.
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: - Apply to current task - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- kernel/seccomp.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
--- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -16,6 +16,8 @@ #include <linux/atomic.h> #include <linux/audit.h> #include <linux/compat.h> +#include <linux/nospec.h> +#include <linux/prctl.h> #include <linux/sched.h> #include <linux/seccomp.h> #include <linux/syscalls.h> @@ -205,9 +207,24 @@ static inline bool seccomp_may_assign_mo return true; }
+/* + * If a given speculation mitigation is opt-in (prctl()-controlled), + * select it, by disabling speculation (enabling mitigation). + */ +static inline void spec_mitigate(struct task_struct *task, + unsigned long which) +{ + int state = arch_prctl_spec_ctrl_get(task, which); + + if (state > 0 && (state & PR_SPEC_PRCTL)) + arch_prctl_spec_ctrl_set(task, which, PR_SPEC_DISABLE); +} + static inline void seccomp_assign_mode(unsigned long seccomp_mode) { current->seccomp.mode = seccomp_mode; + /* Assume seccomp processes want speculation flaw mitigation. */ + spec_mitigate(current, PR_SPEC_STORE_BYPASS); set_tsk_thread_flag(current, TIF_SECCOMP); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 56563f53d3066afa9e63d6c997bf67e76a8b05c0 upstream.
The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt which was defined as "Spectre V2 : ".
Move the function to be past SSBD and also define the pr_fmt.
Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/cpu/bugs.c | 55 ++++++++++++++++++++------------------ 1 file changed, 29 insertions(+), 26 deletions(-)
--- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -270,32 +270,6 @@ static void x86_amd_ssb_disable(void) wrmsrl(MSR_AMD64_LS_CFG, msrval); }
-static void __init l1tf_select_mitigation(void) -{ - u64 half_pa; - - if (!boot_cpu_has_bug(X86_BUG_L1TF)) - return; - -#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE) - pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n"); - return; -#endif - - /* - * This is extremely unlikely to happen because almost all - * systems have far more MAX_PA/2 than RAM can be fit into - * DIMM slots. - */ - half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT; - if (e820_any_mapped(half_pa, ULLONG_MAX - half_pa, E820_RAM)) { - pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n"); - return; - } - - setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV); -} - #ifdef RETPOLINE static bool spectre_v2_bad_module;
@@ -721,6 +695,35 @@ void x86_spec_ctrl_setup_ap(void) x86_amd_ssb_disable(); }
+#undef pr_fmt +#define pr_fmt(fmt) "L1TF: " fmt +static void __init l1tf_select_mitigation(void) +{ + u64 half_pa; + + if (!boot_cpu_has_bug(X86_BUG_L1TF)) + return; + +#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE) + pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n"); + return; +#endif + + /* + * This is extremely unlikely to happen because almost all + * systems have far more MAX_PA/2 than RAM can be fit into + * DIMM slots. + */ + half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT; + if (e820_any_mapped(half_pa, ULLONG_MAX - half_pa, E820_RAM)) { + pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n"); + return; + } + + setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV); +} +#undef pr_fmt + #ifdef CONFIG_SYSFS
static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 917e401ea75478d4f4575bc8b0ef3d14ecf9ef69 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Chen Liqin liqin.linux@gmail.com Cc: Lennox Wu lennox.wu@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/score/include/asm/pgtable-bits.h | 1 - arch/score/include/asm/pgtable.h | 18 ++---------------- 2 files changed, 2 insertions(+), 17 deletions(-)
--- a/arch/score/include/asm/pgtable-bits.h +++ b/arch/score/include/asm/pgtable-bits.h @@ -6,7 +6,6 @@ #define _PAGE_WRITE (1<<7) /* implemented in software */ #define _PAGE_PRESENT (1<<9) /* implemented in software */ #define _PAGE_MODIFIED (1<<10) /* implemented in software */ -#define _PAGE_FILE (1<<10)
#define _PAGE_GLOBAL (1<<0) #define _PAGE_VALID (1<<1) --- a/arch/score/include/asm/pgtable.h +++ b/arch/score/include/asm/pgtable.h @@ -90,15 +90,6 @@ static inline void pmd_clear(pmd_t *pmdp ((pte_t *)page_address(pmd_page(*(dir))) + __pte_offset(address)) #define pte_unmap(pte) ((void)(pte))
-/* - * Bits 9(_PAGE_PRESENT) and 10(_PAGE_FILE)are taken, - * split up 30 bits of offset into this range: - */ -#define PTE_FILE_MAX_BITS 30 -#define pte_to_pgoff(_pte) \ - (((_pte).pte & 0x1ff) | (((_pte).pte >> 11) << 9)) -#define pgoff_to_pte(off) \ - ((pte_t) {((off) & 0x1ff) | (((off) >> 9) << 11) | _PAGE_FILE}) #define __pte_to_swp_entry(pte) \ ((swp_entry_t) { pte_val(pte)}) #define __swp_entry_to_pte(x) ((pte_t) {(x).val}) @@ -169,8 +160,8 @@ static inline pgprot_t pgprot_noncached( }
#define __swp_type(x) ((x).val & 0x1f) -#define __swp_offset(x) ((x).val >> 11) -#define __swp_entry(type, offset) ((swp_entry_t){(type) | ((offset) << 11)}) +#define __swp_offset(x) ((x).val >> 10) +#define __swp_entry(type, offset) ((swp_entry_t){(type) | ((offset) << 10)})
extern unsigned long empty_zero_page; extern unsigned long zero_page_mask; @@ -198,11 +189,6 @@ static inline int pte_young(pte_t pte) return pte_val(pte) & _PAGE_ACCESSED; }
-static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - #define pte_special(pte) (0)
static inline pte_t pte_wrprotect(pte_t pte)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 15e6c22fd8e5a42c5ed6d487b7c9fe44c2517765 upstream.
svm_vcpu_run() invokes x86_spec_ctrl_restore_host() after VMEXIT, but before the host GS is restored. x86_spec_ctrl_restore_host() uses 'current' to determine the host SSBD state of the thread. 'current' is GS based, but host GS is not yet restored and the access causes a triple fault.
Move the call after the host GS restore.
Fixes: 885f82bfbc6f x86/process: Allow runtime control of Speculative Store Bypass Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Acked-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kvm/svm.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
--- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4061,6 +4061,18 @@ static void svm_vcpu_run(struct kvm_vcpu #endif );
+ /* Eliminate branch target predictions from guest mode */ + vmexit_fill_RSB(); + +#ifdef CONFIG_X86_64 + wrmsrl(MSR_GS_BASE, svm->host.gs_base); +#else + loadsegment(fs, svm->host.fs); +#ifndef CONFIG_X86_32_LAZY_GS + loadsegment(gs, svm->host.gs); +#endif +#endif + /* * We do not use IBRS in the kernel. If this vCPU has used the * SPEC_CTRL MSR it may have left it on; save the value and @@ -4081,18 +4093,6 @@ static void svm_vcpu_run(struct kvm_vcpu
x86_spec_ctrl_restore_host(svm->spec_ctrl);
- /* Eliminate branch target predictions from guest mode */ - vmexit_fill_RSB(); - -#ifdef CONFIG_X86_64 - wrmsrl(MSR_GS_BASE, svm->host.gs_base); -#else - loadsegment(fs, svm->host.fs); -#ifndef CONFIG_X86_32_LAZY_GS - loadsegment(gs, svm->host.gs); -#endif -#endif - reload_tss(vcpu);
local_irq_disable();
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit e96f46ee8587607a828f783daa6eb5b44d25004d upstream.
The style for the 'status' file is CamelCase or this. _.
Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations") Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- fs/proc/array.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -327,7 +327,7 @@ static inline void task_seccomp(struct s #ifdef CONFIG_SECCOMP seq_printf(m, "Seccomp:\t%d\n", p->seccomp.mode); #endif - seq_printf(m, "Speculation Store Bypass:\t"); + seq_printf(m, "Speculation_Store_Bypass:\t"); switch (arch_prctl_spec_ctrl_get(p, PR_SPEC_STORE_BYPASS)) { case -EINVAL: seq_printf(m, "unknown");
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Ben Hutchings ben@decadent.org.uk
These are extracted from commit a00cc7d9dd93 "mm, x86: add support for PUD-sized transparent hugepages" and will be used by later patches.
Signed-off-by: Ben Hutchings ben@decadent.org.uk ---
--- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -303,6 +303,25 @@ static inline pmd_t pmd_mknotpresent(pmd return pmd_clear_flags(pmd, _PAGE_PRESENT); }
+static inline pud_t pud_set_flags(pud_t pud, pudval_t set) +{ + pudval_t v = native_pud_val(pud); + + return __pud(v | set); +} + +static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) +{ + pudval_t v = native_pud_val(pud); + + return __pud(v & ~clear); +} + +static inline pud_t pud_mkhuge(pud_t pud) +{ + return pud_set_flags(pud, _PAGE_PSE); +} + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline int pte_soft_dirty(pte_t pte) { @@ -352,6 +371,12 @@ static inline pmd_t pfn_pmd(unsigned lon massage_pgprot(pgprot)); }
+static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot) +{ + return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) | + massage_pgprot(pgprot)); +} + static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pteval_t val = pte_val(pte);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Juergen Gross jgross@suse.com
commit 74899d92e66663dc7671a8017b3146dcd4735f3b upstream.
Commit:
1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD")
... added speculative_store_bypass_ht_init() to the per-CPU initialization sequence.
speculative_store_bypass_ht_init() needs to be called on each CPU for PV guests, too.
Reported-by: Brian Woods brian.woods@amd.com Tested-by: Brian Woods brian.woods@amd.com Signed-off-by: Juergen Gross jgross@suse.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: boris.ostrovsky@oracle.com Cc: xen-devel@lists.xenproject.org Fixes: 1f50ddb4f4189243c05926b842dc1a0332195f31 ("x86/speculation: Handle HT correctly on AMD") Link: https://lore.kernel.org/lkml/20180621084331.21228-1-jgross@suse.com Signed-off-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/xen/smp_pv.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -27,6 +27,7 @@ #include <xen/interface/xen.h> #include <xen/interface/vcpu.h>
+#include <asm/spec-ctrl.h> #include <asm/xen/interface.h> #include <asm/xen/hypercall.h>
@@ -83,6 +84,8 @@ static void cpu_bringup(void) cpu_data(cpu).x86_max_cores = 1; set_cpu_sibling_map(cpu);
+ speculative_store_bypass_ht_init(); + xen_setup_cpu_clockevents();
notify_cpu_starting(cpu); @@ -334,6 +337,8 @@ static void __init xen_smp_prepare_cpus( } set_cpu_sibling_map(0);
+ speculative_store_bypass_ht_init(); + if (xen_smp_intr_init(0)) BUG();
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit d1035d971829dcf80e8686ccde26f94b0a069472 upstream.
Add a ZEN feature bit so family-dependent static_cpu_has() optimizations can be built for ZEN.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: - Use the next available bit number in CPU feature word 7 - Adjust filename, context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/kernel/cpu/amd.c | 1 + 2 files changed, 2 insertions(+)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -198,6 +198,7 @@ #define X86_FEATURE_STIBP (7*32+18) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_MSR_SPEC_CTRL (7*32+19) /* "" MSR SPEC_CTRL is implemented */ #define X86_FEATURE_SSBD (7*32+20) /* Speculative Store Bypass Disable */ +#define X86_FEATURE_ZEN (7*32+21) /* "" CPU is AMD family 0x17 (Zen) */
#define X86_FEATURE_RETPOLINE (7*32+29) /* "" Generic Retpoline mitigation for Spectre variant 2 */ #define X86_FEATURE_RETPOLINE_AMD (7*32+30) /* "" AMD Retpoline mitigation for Spectre variant 2 */ --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -556,6 +556,7 @@ static void init_amd_ln(struct cpuinfo_x
static void init_amd_zn(struct cpuinfo_x86 *c) { + set_cpu_cap(c, X86_FEATURE_ZEN); /* * Fix erratum 1076: CPB feature bit not being set in CPUID. It affects * all up to and including B1.
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Linus Torvalds torvalds@linux-foundation.org
commit 1aa7a5735a41418d8e01fa7c9565eb2657e2ea3f upstream.
The macro is not type safe and I did look for why that "g" constraint for the asm doesn't work: it's because the asm is more fundamentally wrong.
It does
movl %[val], %%eax
but "val" isn't a 32-bit value, so then gcc will pass it in a register, and generate code like
movl %rsi, %eax
and gas will complain about a nonsensical 'mov' instruction (it's moving a 64-bit register to a 32-bit one).
Passing it through memory will just hide the real bug - gcc still thinks the memory location is 64-bit, but the "movl" will only load the first 32 bits and it all happens to work because x86 is little-endian.
Convert it to a type safe inline function with a little trick which hands the feature into the ALTERNATIVE macro.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -195,15 +195,16 @@ static inline void vmexit_fill_RSB(void) #endif }
-#define alternative_msr_write(_msr, _val, _feature) \ - asm volatile(ALTERNATIVE("", \ - "movl %[msr], %%ecx\n\t" \ - "movl %[val], %%eax\n\t" \ - "movl $0, %%edx\n\t" \ - "wrmsr", \ - _feature) \ - : : [msr] "i" (_msr), [val] "i" (_val) \ - : "eax", "ecx", "edx", "memory") +static __always_inline +void alternative_msr_write(unsigned int msr, u64 val, unsigned int feature) +{ + asm volatile(ALTERNATIVE("", "wrmsr", %c[feature]) + : : "c" (msr), + "a" (val), + "d" (val >> 32), + [feature] "i" (feature) + : "memory"); +}
static inline void indirect_branch_prediction_barrier(void) {
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Jiri Kosina jkosina@suse.cz
commit 6c26fcd2abfe0a56bbd95271fce02df2896cfd24 upstream.
pfn_modify_allowed() and arch_has_pfn_modify_check() are outside of the !__ASSEMBLY__ section in include/asm-generic/pgtable.h, which confuses assembler on archs that don't have __HAVE_ARCH_PFN_MODIFY_ALLOWED (e.g. ia64) and breaks build:
include/asm-generic/pgtable.h: Assembler messages: include/asm-generic/pgtable.h:538: Error: Unknown opcode `static inline bool pfn_modify_allowed(unsigned long pfn,pgprot_t prot)' include/asm-generic/pgtable.h:540: Error: Unknown opcode `return true' include/asm-generic/pgtable.h:543: Error: Unknown opcode `static inline bool arch_has_pfn_modify_check(void)' include/asm-generic/pgtable.h:545: Error: Unknown opcode `return false' arch/ia64/kernel/entry.S:69: Error: `mov' does not fit into bundle
Move those two static inlines into the !__ASSEMBLY__ section so that they don't confuse the asm build pass.
Fixes: 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings") Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [groeck: Context changes] Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/asm-generic/pgtable.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
--- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -806,12 +806,6 @@ static inline void pmdp_set_numa(struct
#endif /* CONFIG_MMU */
-#endif /* !__ASSEMBLY__ */ - -#ifndef io_remap_pfn_range -#define io_remap_pfn_range remap_pfn_range -#endif - #ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot) { @@ -822,6 +816,12 @@ static inline bool arch_has_pfn_modify_c { return false; } +#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */ + +#endif /* !__ASSEMBLY__ */ + +#ifndef io_remap_pfn_range +#define io_remap_pfn_range remap_pfn_range #endif
#endif /* _ASM_GENERIC_PGTABLE_H */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 958f79b9ee55dfaf00c8106ed1c22a2919e0028b upstream.
set_memory_np() is used to mark kernel mappings not present, but it has it's own open coded mechanism which does not have the L1TF protection of inverting the address bits.
Replace the open coded PTE manipulation with the L1TF protecting low level PTE routines.
Passes the CPA self test.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: - cpa->pfn is actually a physical address here and needs to be shifted to produce a PFN - Adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -952,7 +952,8 @@ static int populate_pmd(struct cpa_data
pmd = pmd_offset(pud, start);
- set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE | massage_pgprot(pgprot))); + set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn >> PAGE_SHIFT, + canon_pgprot(pgprot))));
start += PMD_SIZE; cpa->pfn += PMD_SIZE; @@ -1022,7 +1023,8 @@ static int populate_pud(struct cpa_data * Map everything starting from the Gb boundary, possibly with 1G pages */ while (end - start >= PUD_SIZE) { - set_pud(pud, __pud(cpa->pfn | _PAGE_PSE | massage_pgprot(pgprot))); + set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn >> PAGE_SHIFT, + canon_pgprot(pgprot))));
start += PUD_SIZE; cpa->pfn += PUD_SIZE;
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Daniel Rosenberg drosen@google.com
commit 717adfdaf14704fd3ec7fa2c04520c0723247eac upstream.
If our length is greater than the size of the buffer, we overflow the buffer
Signed-off-by: Daniel Rosenberg drosen@google.com Reviewed-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Ben Hutchings ben@decadent.org.uk --- drivers/hid/hid-debug.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
--- a/drivers/hid/hid-debug.c +++ b/drivers/hid/hid-debug.c @@ -1150,6 +1150,8 @@ copy_rest: goto out; if (list->tail > list->head) { len = list->tail - list->head; + if (len > count) + len = count;
if (copy_to_user(buffer + ret, &list->hid_debug_buf[list->head], len)) { ret = -EFAULT; @@ -1159,6 +1161,8 @@ copy_rest: list->head += len; } else { len = HID_DEBUG_BUFSIZE - list->head; + if (len > count) + len = count;
if (copy_to_user(buffer, &list->hid_debug_buf[list->head], len)) { ret = -EFAULT; @@ -1166,7 +1170,9 @@ copy_rest: } list->head = 0; ret += len; - goto copy_rest; + count -= len; + if (count > 0) + goto copy_rest; }
}
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 10a70416e1f067f6c4efda6ffd8ea96002ac4223 upstream.
The L1TF workaround doesn't make any attempt to mitigate speculate accesses to the first physical page for zeroed PTEs. Normally it only contains some data from the early real mode BIOS.
It's not entirely clear that the first page is reserved in all configurations, so add an extra reservation call to make sure it is really reserved. In most configurations (e.g. with the standard reservations) it's likely a nop.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Acked-by: Dave Hansen dave.hansen@intel.com Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/kernel/setup.c | 6 ++++++ 1 file changed, 6 insertions(+)
--- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -860,6 +860,12 @@ void __init setup_arch(char **cmdline_p) memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);
+ /* + * Make sure page 0 is always reserved because on systems with + * L1TF its contents can be leaked to user processes. + */ + memblock_reserve(0, PAGE_SIZE); + early_reserve_initrd();
/*
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Ben Hutchings ben@decadent.org.uk
NUMA balancing has its own functions that manipulated the PRESENT flag in PTEs and PMDs. These were not affected by the changes in commit 6b28baca9b1f "x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation".
This is not a problem upstream because NUMA balancing was changed to use {pte,pmd}_modify() in Linux 4.0.
Override the generic implementations for x86 with implementations that do the same inversion as {pte,pmd}_modify().
Signed-off-by: Ben Hutchings ben@decadent.org.uk Cc: x86@kernel.org Cc: Mel Gorman mgorman@suse.de --- --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -419,6 +419,54 @@ static inline pmd_t pmd_modify(pmd_t pmd return __pmd(val); }
+#ifdef CONFIG_NUMA_BALANCING + +static inline pte_t pte_mknonnuma(pte_t pte) +{ + pteval_t val = pte_val(pte), oldval = val; + + val &= ~_PAGE_NUMA; + val |= (_PAGE_PRESENT|_PAGE_ACCESSED); + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pte(val); +} +#define pte_mknonnuma pte_mknonnuma + +static inline pte_t pte_mknuma(pte_t pte) +{ + pteval_t val = pte_val(pte), oldval = val; + + val &= ~_PAGE_PRESENT; + val |= _PAGE_NUMA; + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pte(val); +} +#define pte_mknuma pte_mknuma + +static inline pmd_t pmd_mknonnuma(pmd_t pmd) +{ + pmdval_t val = pmd_val(pmd), oldval = val; + + val &= ~_PAGE_NUMA; + val |= (_PAGE_PRESENT|_PAGE_ACCESSED); + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pmd(val); +} +#define pmd_mknonnuma pmd_mknonnuma + +static inline pmd_t pmd_mknuma(pmd_t pmd) +{ + pmdval_t val = pmd_val(pmd), oldval = val; + + val &= ~_PAGE_PRESENT; + val |= _PAGE_NUMA; + val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); + return __pmd(val); +} +#define pmd_mknuma pmd_mknuma + +#endif /* CONFIG_NUMA_BALANCING */ + /* mprotect needs to preserve PAT bits when updating vm_page_prot */ #define pgprot_modify pgprot_modify static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 22f9bf3950f20d24198791685f2dccac2c4ef38a upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: James Hogan james.hogan@imgtec.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/metag/include/asm/pgtable.h | 6 ------ 1 file changed, 6 deletions(-)
--- a/arch/metag/include/asm/pgtable.h +++ b/arch/metag/include/asm/pgtable.h @@ -47,7 +47,6 @@ */ #define _PAGE_ACCESSED _PAGE_ALWAYS_ZERO_1 #define _PAGE_DIRTY _PAGE_ALWAYS_ZERO_2 -#define _PAGE_FILE _PAGE_ALWAYS_ZERO_3
/* Pages owned, and protected by, the kernel. */ #define _PAGE_KERNEL _PAGE_PRIV @@ -219,7 +218,6 @@ extern unsigned long empty_zero_page; static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= (~_PAGE_WRITE); return pte; } @@ -327,10 +325,6 @@ static inline void update_mmu_cache(stru #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#define PTE_FILE_MAX_BITS 22 -#define pte_to_pgoff(x) (pte_val(x) >> 10) -#define pgoff_to_pte(x) __pte(((x) << 10) | _PAGE_FILE) - #define kern_addr_valid(addr) (1)
/*
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Finn Thain fthain@telegraphics.com.au
commit ac39452e942af6a212e8f89e8a36b71354323845 upstream.
The cuda_start() function uses spinlock_irq_save/restore for mutual exclusion. Let's have cuda_poll() do the same when polling the VIA interrupt.
The benefit to disabling local irqs when the interrupt is being polled is that the interrupt handler now has the same timing properties regardless of whether it is invoked normally or from cuda_poll().
This driver was written back when local irqs remained enabled during execution of interrupt handlers and cuda_poll() was probably trying to achieve the same effect by use of enable/disable_irq.
Tested-by: Stan Johnson userm57@yahoo.com Signed-off-by: Finn Thain fthain@telegraphics.com.au Signed-off-by: Michael Ellerman mpe@ellerman.id.au [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- drivers/macintosh/via-cuda.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-)
--- a/drivers/macintosh/via-cuda.c +++ b/drivers/macintosh/via-cuda.c @@ -432,26 +432,20 @@ cuda_start(void) void cuda_poll(void) { - /* cuda_interrupt only takes a normal lock, we disable - * interrupts here to avoid re-entering and thus deadlocking. - */ - if (cuda_irq) - disable_irq(cuda_irq); - cuda_interrupt(0, NULL); - if (cuda_irq) - enable_irq(cuda_irq); + cuda_interrupt(0, NULL); }
static irqreturn_t cuda_interrupt(int irq, void *arg) { + unsigned long flags; int status; struct adb_request *req = NULL; unsigned char ibuf[16]; int ibuf_len = 0; int complete = 0;
- spin_lock(&cuda_lock); + spin_lock_irqsave(&cuda_lock, flags);
/* On powermacs, this handler is registered for the VIA IRQ. But they use * just the shift register IRQ -- other VIA interrupt sources are disabled. @@ -464,7 +458,7 @@ cuda_interrupt(int irq, void *arg) #endif { if ((in_8(&via[IFR]) & SR_INT) == 0) { - spin_unlock(&cuda_lock); + spin_unlock_irqrestore(&cuda_lock, flags); return IRQ_NONE; } else { out_8(&via[IFR], SR_INT); @@ -593,7 +587,7 @@ cuda_interrupt(int irq, void *arg) default: printk("cuda_interrupt: unknown cuda_state %d?\n", cuda_state); } - spin_unlock(&cuda_lock); + spin_unlock_irqrestore(&cuda_lock, flags); if (complete && req) { void (*done)(struct adb_request *) = req->done; mb();
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 18747151308f9e0fb63766057957617ec4afa190 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Vineet Gupta vgupta@synopsys.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/arc/include/asm/pgtable.h | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-)
--- a/arch/arc/include/asm/pgtable.h +++ b/arch/arc/include/asm/pgtable.h @@ -61,7 +61,6 @@ #define _PAGE_WRITE (1<<4) /* Page has user write perm (H) */ #define _PAGE_READ (1<<5) /* Page has user read perm (H) */ #define _PAGE_MODIFIED (1<<6) /* Page modified (dirty) (S) */ -#define _PAGE_FILE (1<<7) /* page cache/ swap (S) */ #define _PAGE_GLOBAL (1<<8) /* Page is global (H) */ #define _PAGE_PRESENT (1<<10) /* TLB entry is valid (H) */
@@ -73,7 +72,6 @@ #define _PAGE_READ (1<<3) /* Page has user read perm (H) */ #define _PAGE_ACCESSED (1<<4) /* Page is accessed (S) */ #define _PAGE_MODIFIED (1<<5) /* Page modified (dirty) (S) */ -#define _PAGE_FILE (1<<6) /* page cache/ swap (S) */ #define _PAGE_GLOBAL (1<<8) /* Page is global (H) */ #define _PAGE_PRESENT (1<<9) /* TLB entry is valid (H) */ #define _PAGE_SHARED_CODE (1<<11) /* Shared Code page with cmn vaddr @@ -269,15 +267,6 @@ static inline void pmd_set(pmd_t *pmdp, pte; \ })
-/* TBD: Non linear mapping stuff */ -static inline int pte_file(pte_t pte) -{ - return pte_val(pte) & _PAGE_FILE; -} - -#define PTE_FILE_MAX_BITS 30 -#define pgoff_to_pte(x) __pte(x) -#define pte_to_pgoff(x) (pte_val(x) >> 2) #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) #define pfn_pte(pfn, prot) (__pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))) #define __pte_index(addr) (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) @@ -365,7 +354,7 @@ void update_mmu_cache(struct vm_area_str
/* Encode swap {type,off} tuple into PTE * We reserve 13 bits for 5-bit @type, keeping bits 12-5 zero, ensuring that - * both PAGE_FILE and PAGE_PRESENT are zero in a PTE holding swap "identifier" + * PAGE_PRESENT is zero in a PTE holding swap "identifier" */ #define __swp_entry(type, off) ((swp_entry_t) { \ ((type) & 0x1f) | ((off) << 13) })
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 937fa39fb22fea1c1d8ca9e5f31c452b91ac7239 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Simek monstr@monstr.eu Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/microblaze/include/asm/pgtable.h | 11 ----------- 1 file changed, 11 deletions(-)
--- a/arch/microblaze/include/asm/pgtable.h +++ b/arch/microblaze/include/asm/pgtable.h @@ -40,10 +40,6 @@ extern int mem_init_done; #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-#ifndef __ASSEMBLY__ -static inline int pte_file(pte_t pte) { return 0; } -#endif /* __ASSEMBLY__ */ - #define ZERO_PAGE(vaddr) ({ BUG(); NULL; })
#define swapper_pg_dir ((pgd_t *) NULL) @@ -207,7 +203,6 @@ static inline pte_t pte_mkspecial(pte_t
/* Definitions for MicroBlaze. */ #define _PAGE_GUARDED 0x001 /* G: page is guarded from prefetch */ -#define _PAGE_FILE 0x001 /* when !present: nonlinear file mapping */ #define _PAGE_PRESENT 0x002 /* software: PTE contains a translation */ #define _PAGE_NO_CACHE 0x004 /* I: caching is inhibited */ #define _PAGE_WRITETHRU 0x008 /* W: caching is write-through */ @@ -337,7 +332,6 @@ static inline int pte_write(pte_t pte) { static inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_EXEC; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -499,11 +493,6 @@ static inline pmd_t *pmd_offset(pgd_t *d
#define pte_unmap(pte) kunmap_atomic(pte)
-/* Encode and decode a nonlinear file mapping entry */ -#define PTE_FILE_MAX_BITS 29 -#define pte_to_pgoff(pte) (pte_val(pte) >> 3) -#define pgoff_to_pte(off) ((pte_t) { ((off) << 3) | _PAGE_FILE }) - extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
/*
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andi Kleen ak@linux.intel.com
commit 1063711b57393c1999248cccb57bebfaf16739e7 upstream.
The mmio tracer sets io mapping PTEs and PMDs to non present when enabled without inverting the address bits, which makes the PTE entry vulnerable for L1TF.
Make it use the right low level macros to actually invert the address bits to protect against L1TF.
In principle this could be avoided because MMIO tracing is not likely to be enabled on production machines, but the fix is straigt forward and for consistency sake it's better to get rid of the open coded PTE manipulation.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/mm/kmmio.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-)
--- a/arch/x86/mm/kmmio.c +++ b/arch/x86/mm/kmmio.c @@ -114,24 +114,29 @@ static struct kmmio_fault_page *get_kmmi
static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old) { + pmd_t new_pmd; pmdval_t v = pmd_val(*pmd); if (clear) { - *old = v & _PAGE_PRESENT; - v &= ~_PAGE_PRESENT; - } else /* presume this has been called with clear==true previously */ - v |= *old; - set_pmd(pmd, __pmd(v)); + *old = v; + new_pmd = pmd_mknotpresent(*pmd); + } else { + /* Presume this has been called with clear==true previously */ + new_pmd = __pmd(*old); + } + set_pmd(pmd, new_pmd); }
static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old) { pteval_t v = pte_val(*pte); if (clear) { - *old = v & _PAGE_PRESENT; - v &= ~_PAGE_PRESENT; - } else /* presume this has been called with clear==true previously */ - v |= *old; - set_pte_atomic(pte, __pte(v)); + *old = v; + /* Nothing should care about address */ + pte_clear(&init_mm, 0, pte); + } else { + /* Presume this has been called with clear==true previously */ + set_pte_atomic(pte, __pte(*old)); + } }
static int clear_page_presence(struct kmmio_fault_page *f, bool clear)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 0cc5fa00b0a88dad140b4e5c2cead9951ad36822 upstream.
Add the CPU feature bit CPUID.7.0.EDX[31] which indicates whether the CPU supports Reduced Data Speculation.
[ tglx: Split it out from a later patch ]
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org [bwh: Backported to 3.16: - This CPUID word is feature word 10 - Adjust filename] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/cpufeature.h | 1 + 1 file changed, 1 insertion(+)
--- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -241,6 +241,7 @@ #define X86_FEATURE_SPEC_CTRL (10*32+26) /* "" Speculation Control (IBRS + IBPB) */ #define X86_FEATURE_INTEL_STIBP (10*32+27) /* "" Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ARCH_CAPABILITIES (10*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */ +#define X86_FEATURE_RDS (10*32+31) /* Reduced Data Speculation */
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 11 */ #define X86_FEATURE_IBPB (11*32+12) /* Indirect Branch Prediction Barrier */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit f5b45de9b00eb53d11ada85c61e4ea1c31ab8218 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Mark Salter msalter@redhat.com Cc: Aurelien Jacquiot a-jacquiot@ti.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/c6x/include/asm/pgtable.h | 5 ----- 1 file changed, 5 deletions(-)
--- a/arch/c6x/include/asm/pgtable.h +++ b/arch/c6x/include/asm/pgtable.h @@ -50,11 +50,6 @@ extern void paging_init(void); #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-static inline int pte_file(pte_t pte) -{ - return 0; -} - #define set_pte(pteptr, pteval) (*(pteptr) = pteval) #define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 9b3e661e58b90b0c2d5c2168c23408f1e59e9e35 upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
This patch also adjust __SWP_TYPE_SHIFT and increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/arm64/include/asm/pgtable.h | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-)
--- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -25,7 +25,6 @@ * Software defined PTE bits definition. */ #define PTE_VALID (_AT(pteval_t, 1) << 0) -#define PTE_FILE (_AT(pteval_t, 1) << 2) /* only when !pte_present() */ #define PTE_DIRTY (_AT(pteval_t, 1) << 55) #define PTE_SPECIAL (_AT(pteval_t, 1) << 56) #define PTE_WRITE (_AT(pteval_t, 1) << 57) @@ -402,13 +401,12 @@ extern pgd_t idmap_pg_dir[PTRS_PER_PGD]; /* * Encode and decode a swap entry: * bits 0-1: present (must be zero) - * bit 2: PTE_FILE - * bits 3-8: swap type - * bits 9-57: swap offset + * bits 2-7: swap type + * bits 8-57: swap offset */ -#define __SWP_TYPE_SHIFT 3 +#define __SWP_TYPE_SHIFT 2 #define __SWP_TYPE_BITS 6 -#define __SWP_OFFSET_BITS 49 +#define __SWP_OFFSET_BITS 50 #define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1) #define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT) #define __SWP_OFFSET_MASK ((1UL << __SWP_OFFSET_BITS) - 1) @@ -426,18 +424,6 @@ extern pgd_t idmap_pg_dir[PTRS_PER_PGD]; */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS)
-/* - * Encode and decode a file entry: - * bits 0-1: present (must be zero) - * bit 2: PTE_FILE - * bits 3-57: file offset / PAGE_SIZE - */ -#define pte_file(pte) (pte_val(pte) & PTE_FILE) -#define pte_to_pgoff(x) (pte_val(x) >> 3) -#define pgoff_to_pte(x) __pte(((x) << 3) | PTE_FILE) - -#define PTE_FILE_MAX_BITS 55 - extern int kern_addr_valid(unsigned long addr);
#include <asm-generic/pgtable.h>
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Vlastimil Babka vbabka@suse.cz
commit 1a7ed1ba4bba6c075d5ad61bb75e3fbc870840d6 upstream.
The previous patch has limited swap file size so that large offsets cannot clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.
It assumed that offsets are encoded starting with bit 12, same as pfn. But on x86_64, offsets are encoded starting with bit 9.
Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA and 256TB with 46bit MAX_PA.
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/mm/init.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
--- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -710,7 +710,15 @@ unsigned long max_swapfile_size(void)
if (boot_cpu_has_bug(X86_BUG_L1TF)) { /* Limit the swap file size to MAX_PA/2 for L1TF workaround */ - pages = min_t(unsigned long, l1tf_pfn_limit() + 1, pages); + unsigned long l1tf_limit = l1tf_pfn_limit() + 1; + /* + * We encode swap offsets also with 3 bits below those for pfn + * which makes the usable limit higher. + */ +#ifdef CONFIG_X86_64 + l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT; +#endif + pages = min_t(unsigned long, l1tf_limit, pages); } return pages; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 103f3d9a26df944f4c29de190d72dfbf913c71af upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Mikael Starvik starvik@axis.com Cc: Jesper Nilsson jesper.nilsson@axis.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/cris/include/arch-v10/arch/mmu.h | 3 --- arch/cris/include/arch-v32/arch/mmu.h | 3 --- arch/cris/include/asm/pgtable.h | 4 ---- 3 files changed, 10 deletions(-)
--- a/arch/cris/include/arch-v10/arch/mmu.h +++ b/arch/cris/include/arch-v10/arch/mmu.h @@ -58,7 +58,6 @@ typedef struct /* Bits the HW doesn't care about but the kernel uses them in SW */
#define _PAGE_PRESENT (1<<4) /* page present in memory */ -#define _PAGE_FILE (1<<5) /* set: pagecache, unset: swap (when !PRESENT) */ #define _PAGE_ACCESSED (1<<5) /* simulated in software using valid bit */ #define _PAGE_MODIFIED (1<<6) /* simulated in software using we bit */ #define _PAGE_READ (1<<7) /* read-enabled */ @@ -105,6 +104,4 @@ typedef struct #define __S110 PAGE_SHARED #define __S111 PAGE_SHARED
-#define PTE_FILE_MAX_BITS 26 - #endif --- a/arch/cris/include/arch-v32/arch/mmu.h +++ b/arch/cris/include/arch-v32/arch/mmu.h @@ -53,7 +53,6 @@ typedef struct * software. */ #define _PAGE_PRESENT (1 << 5) /* Page is present in memory. */ -#define _PAGE_FILE (1 << 6) /* 1=pagecache, 0=swap (when !present) */ #define _PAGE_ACCESSED (1 << 6) /* Simulated in software using valid bit. */ #define _PAGE_MODIFIED (1 << 7) /* Simulated in software using we bit. */ #define _PAGE_READ (1 << 8) /* Read enabled. */ @@ -108,6 +107,4 @@ typedef struct #define __S110 PAGE_SHARED_EXEC #define __S111 PAGE_SHARED_EXEC
-#define PTE_FILE_MAX_BITS 25 - #endif /* _ASM_CRIS_ARCH_MMU_H */ --- a/arch/cris/include/asm/pgtable.h +++ b/arch/cris/include/asm/pgtable.h @@ -114,7 +114,6 @@ extern unsigned long empty_zero_page; static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) @@ -290,9 +289,6 @@ static inline void update_mmu_cache(stru */ #define pgtable_cache_init() do { } while (0)
-#define pte_to_pgoff(x) (pte_val(x) >> 6) -#define pgoff_to_pte(x) __pte(((x) << 6) | _PAGE_FILE) - typedef pte_t *pte_addr_t;
#endif /* __ASSEMBLY__ */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 47c61b3955cf712cadfc25635bf9bc174af030ea upstream.
Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to x86_virt_spec_ctrl(). If either X86_FEATURE_LS_CFG_SSBD or X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl argument to check whether the state must be modified on the host. The update reuses speculative_store_bypass_update() so the ZEN-specific sibling coordination can be reused.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/spec-ctrl.h | 6 ++++++ arch/x86/kernel/cpu/bugs.c | 30 ++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)
--- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -53,6 +53,12 @@ static inline u64 ssbd_tif_to_spec_ctrl( return (tifn & _TIF_SSBD) >> (TIF_SSBD - SPEC_CTRL_SSBD_SHIFT); }
+static inline unsigned long ssbd_spec_ctrl_to_tif(u64 spec_ctrl) +{ + BUILD_BUG_ON(TIF_SSBD < SPEC_CTRL_SSBD_SHIFT); + return (spec_ctrl & SPEC_CTRL_SSBD) << (TIF_SSBD - SPEC_CTRL_SSBD_SHIFT); +} + static inline u64 ssbd_tif_to_amd_ls_cfg(u64 tifn) { return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL; --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -223,6 +223,36 @@ x86_virt_spec_ctrl(u64 guest_spec_ctrl, wrmsrl(MSR_IA32_SPEC_CTRL, msrval); } } + + /* + * If SSBD is not handled in MSR_SPEC_CTRL on AMD, update + * MSR_AMD64_L2_CFG or MSR_VIRT_SPEC_CTRL if supported. + */ + if (!static_cpu_has(X86_FEATURE_LS_CFG_SSBD) && + !static_cpu_has(X86_FEATURE_VIRT_SSBD)) + return; + + /* + * If the host has SSBD mitigation enabled, force it in the host's + * virtual MSR value. If its not permanently enabled, evaluate + * current's TIF_SSBD thread flag. + */ + if (static_cpu_has(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE)) + hostval = SPEC_CTRL_SSBD; + else + hostval = ssbd_tif_to_spec_ctrl(ti->flags); + + /* Sanitize the guest value */ + guestval = guest_virt_spec_ctrl & SPEC_CTRL_SSBD; + + if (hostval != guestval) { + unsigned long tif; + + tif = setguest ? ssbd_spec_ctrl_to_tif(guestval) : + ssbd_spec_ctrl_to_tif(hostval); + + speculative_store_bypass_update(tif); + } } EXPORT_SYMBOL_GPL(x86_virt_spec_ctrl);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Kees Cook keescook@chromium.org
commit 00a02d0c502a06d15e07b857f8ff921e3e402675 upstream.
If a seccomp user is not interested in Speculative Store Bypass mitigation by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when adding filters.
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de [bwh: Backported to 3.16: - We don't support SECCOMP_FILTER_FLAG_TSYNC or SECCOMP_FILTER_FLAG_LOG - Drop selftest changes] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/seccomp.h | 2 ++ include/uapi/linux/seccomp.h | 3 +++ kernel/seccomp.c | 14 ++++++++------ 3 files changed, 13 insertions(+), 6 deletions(-)
--- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -3,6 +3,8 @@
#include <uapi/linux/seccomp.h>
+#define SECCOMP_FILTER_FLAG_MASK SECCOMP_FILTER_FLAG_SPEC_ALLOW + #ifdef CONFIG_SECCOMP
#include <linux/thread_info.h> --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -14,6 +14,9 @@ #define SECCOMP_SET_MODE_STRICT 0 #define SECCOMP_SET_MODE_FILTER 1
+/* Valid flags for SECCOMP_SET_MODE_FILTER */ +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) + /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -220,11 +220,13 @@ static inline void spec_mitigate(struct arch_prctl_spec_ctrl_set(task, which, PR_SPEC_FORCE_DISABLE); }
-static inline void seccomp_assign_mode(unsigned long seccomp_mode) +static inline void seccomp_assign_mode(unsigned long seccomp_mode, + unsigned long flags) { current->seccomp.mode = seccomp_mode; - /* Assume seccomp processes want speculation flaw mitigation. */ - spec_mitigate(current, PR_SPEC_STORE_BYPASS); + /* Assume default seccomp processes want spec flaw mitigation. */ + if ((flags & SECCOMP_FILTER_FLAG_SPEC_ALLOW) == 0) + spec_mitigate(current, PR_SPEC_STORE_BYPASS); set_tsk_thread_flag(current, TIF_SECCOMP); }
@@ -524,7 +526,7 @@ static long seccomp_set_mode_strict(void #ifdef TIF_NOTSC disable_TSC(); #endif - seccomp_assign_mode(seccomp_mode); + seccomp_assign_mode(seccomp_mode, 0); ret = 0;
out: @@ -553,7 +555,7 @@ static long seccomp_set_mode_filter(unsi long ret = -EINVAL;
/* Validate flags. */ - if (flags != 0) + if (flags & ~SECCOMP_FILTER_FLAG_MASK) goto out;
if (!seccomp_may_assign_mode(seccomp_mode)) @@ -563,7 +565,7 @@ static long seccomp_set_mode_filter(unsi if (ret) goto out;
- seccomp_assign_mode(seccomp_mode); + seccomp_assign_mode(seccomp_mode, flags); out: return ret; }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Andy Lutomirski luto@kernel.org
commit 5dd0b16cdaff9b94da06074d5888b03235c0bf17 upstream.
This fixes CONFIG_SMP=n, CONFIG_DEBUG_TLBFLUSH=y without introducing further #ifdef soup. Caught by a Kbuild bot randconfig build.
Signed-off-by: Andy Lutomirski luto@kernel.org Cc: Borislav Petkov bp@alien8.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Fixes: ce4a4e565f52 ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code") Link: http://lkml.kernel.org/r/76da9a3cc4415996f2ad2c905b93414add322021.1496673616... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/vm_event_item.h | 2 -- 1 file changed, 2 deletions(-)
--- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -73,10 +73,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS THP_ZERO_PAGE_ALLOC_FAILED, #endif #ifdef CONFIG_DEBUG_TLBFLUSH -#ifdef CONFIG_SMP NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */ NR_TLB_REMOTE_FLUSH_RECEIVED,/* cpu received ipi for flush */ -#endif /* CONFIG_SMP */ NR_TLB_LOCAL_FLUSH_ALL, NR_TLB_LOCAL_FLUSH_ONE, #endif /* CONFIG_DEBUG_TLBFLUSH */
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit ac51b934f3912582d3c897c6c4d09b32ea57b2c7 upstream.
After removing vma->shared.nonlinear we have only one member of vma->shared union, which doesn't make much sense.
This patch drops the union and move struct vma->shared.linear to vma->shared.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm_types.h | 8 +++----- mm/interval_tree.c | 34 +++++++++++++++++----------------- 2 files changed, 20 insertions(+), 22 deletions(-)
--- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -274,11 +274,9 @@ struct vm_area_struct { * For areas with an address space and backing store, * linkage into the address_space->i_mmap interval tree. */ - union { - struct { - struct rb_node rb; - unsigned long rb_subtree_last; - } linear; + struct { + struct rb_node rb; + unsigned long rb_subtree_last; } shared;
/* --- a/mm/interval_tree.c +++ b/mm/interval_tree.c @@ -21,8 +21,8 @@ static inline unsigned long vma_last_pgo return v->vm_pgoff + ((v->vm_end - v->vm_start) >> PAGE_SHIFT) - 1; }
-INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.linear.rb, - unsigned long, shared.linear.rb_subtree_last, +INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.rb, + unsigned long, shared.rb_subtree_last, vma_start_pgoff, vma_last_pgoff,, vma_interval_tree)
/* Insert node immediately after prev in the interval tree */ @@ -36,26 +36,26 @@ void vma_interval_tree_insert_after(stru
VM_BUG_ON(vma_start_pgoff(node) != vma_start_pgoff(prev));
- if (!prev->shared.linear.rb.rb_right) { + if (!prev->shared.rb.rb_right) { parent = prev; - link = &prev->shared.linear.rb.rb_right; + link = &prev->shared.rb.rb_right; } else { - parent = rb_entry(prev->shared.linear.rb.rb_right, - struct vm_area_struct, shared.linear.rb); - if (parent->shared.linear.rb_subtree_last < last) - parent->shared.linear.rb_subtree_last = last; - while (parent->shared.linear.rb.rb_left) { - parent = rb_entry(parent->shared.linear.rb.rb_left, - struct vm_area_struct, shared.linear.rb); - if (parent->shared.linear.rb_subtree_last < last) - parent->shared.linear.rb_subtree_last = last; + parent = rb_entry(prev->shared.rb.rb_right, + struct vm_area_struct, shared.rb); + if (parent->shared.rb_subtree_last < last) + parent->shared.rb_subtree_last = last; + while (parent->shared.rb.rb_left) { + parent = rb_entry(parent->shared.rb.rb_left, + struct vm_area_struct, shared.rb); + if (parent->shared.rb_subtree_last < last) + parent->shared.rb_subtree_last = last; } - link = &parent->shared.linear.rb.rb_left; + link = &parent->shared.rb.rb_left; }
- node->shared.linear.rb_subtree_last = last; - rb_link_node(&node->shared.linear.rb, &parent->shared.linear.rb, link); - rb_insert_augmented(&node->shared.linear.rb, root, + node->shared.rb_subtree_last = last; + rb_link_node(&node->shared.rb, &parent->shared.rb, link); + rb_insert_augmented(&node->shared.rb, root, &vma_interval_tree_augment); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
commit 240da953fcc6a9008c92fae5b1f727ee5ed167ab upstream.
The "336996 Speculative Execution Side Channel Mitigations" from May defines this as SSB_NO, hence lets sync-up.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/uapi/asm/msr-index.h | 2 +- arch/x86/kernel/cpu/common.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -59,7 +59,7 @@ #define MSR_IA32_ARCH_CAPABILITIES 0x0000010a #define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */ #define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */ -#define ARCH_CAP_SSBD_NO (1 << 4) /* +#define ARCH_CAP_SSB_NO (1 << 4) /* * Not susceptible to Speculative Store Bypass * attack, so no Speculative Store Bypass * control required. --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -850,7 +850,7 @@ static void __init cpu_set_bug_bits(stru rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
if (!x86_match_cpu(cpu_no_spec_store_bypass) && - !(ia32_cap & ARCH_CAP_SSBD_NO)) + !(ia32_cap & ARCH_CAP_SSB_NO)) setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS);
if (x86_match_cpu(cpu_no_speculation))
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Dan Williams dan.j.williams@intel.com
commit 87744ab3832b83ba71b931f86f9cfdb000d07da5 upstream.
vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(), fails to check the pgprot_t it uses for the mapping against the one recorded in the memtype tracking tree. Add the missing call to track_pfn_insert() to preclude cases where incompatible aliased mappings are established for a given physical address range.
[groeck: Backport to v4.4.y]
Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwill... Signed-off-by: Dan Williams dan.j.williams@intel.com Cc: David Airlie airlied@linux.ie Cc: Matthew Wilcox mawilcox@microsoft.com Cc: Ross Zwisler ross.zwisler@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- mm/memory.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
--- a/mm/memory.c +++ b/mm/memory.c @@ -1623,10 +1623,14 @@ EXPORT_SYMBOL(vm_insert_pfn_prot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { + pgprot_t pgprot = vma->vm_page_prot; + BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
if (addr < vma->vm_start || addr >= vma->vm_end) return -EFAULT; + if (track_pfn_insert(vma, &pgprot, pfn)) + return -EINVAL;
/* * If we don't have pte special, then we have to use the pfn_valid() @@ -1639,9 +1643,9 @@ int vm_insert_mixed(struct vm_area_struc struct page *page;
page = pfn_to_page(pfn); - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, pgprot); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, pgprot); } EXPORT_SYMBOL(vm_insert_mixed);
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Jim Mattson jmattson@google.com
commit 5f2b745f5e1304f438f9b2cd03ebc8120b6e0d3b upstream.
Cast val and (val >> 32) to (u32), so that they fit in a general-purpose register in both 32-bit and 64-bit code.
[ tglx: Made it u32 instead of uintptr_t ]
Fixes: c65732e4f721 ("x86/cpu: Restore CPUID_8000_0008_EBX reload") Signed-off-by: Jim Mattson jmattson@google.com Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -219,8 +219,8 @@ void alternative_msr_write(unsigned int { asm volatile(ALTERNATIVE("", "wrmsr", %c[feature]) : : "c" (msr), - "a" (val), - "d" (val >> 32), + "a" ((u32)val), + "d" ((u32)(val >> 32)), [feature] "i" (feature) : "memory"); }
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 3824e3cf7e865b2ff0b71de23b16e332fe6a853a upstream.
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Jonas Bonn jonas@southpole.se Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/openrisc/include/asm/pgtable.h | 8 -------- arch/openrisc/kernel/head.S | 5 ----- 2 files changed, 13 deletions(-)
--- a/arch/openrisc/include/asm/pgtable.h +++ b/arch/openrisc/include/asm/pgtable.h @@ -125,7 +125,6 @@ extern void paging_init(void); #define _PAGE_CC 0x001 /* software: pte contains a translation */ #define _PAGE_CI 0x002 /* cache inhibit */ #define _PAGE_WBC 0x004 /* write back cache */ -#define _PAGE_FILE 0x004 /* set: pagecache, unset: swap (when !PRESENT) */ #define _PAGE_WOM 0x008 /* weakly ordered memory */
#define _PAGE_A 0x010 /* accessed */ @@ -240,7 +239,6 @@ static inline int pte_write(pte_t pte) { static inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_EXEC; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
@@ -438,12 +436,6 @@ static inline void update_mmu_cache(stru #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
-/* Encode and decode a nonlinear file mapping entry */ - -#define PTE_FILE_MAX_BITS 26 -#define pte_to_pgoff(x) (pte_val(x) >> 6) -#define pgoff_to_pte(x) __pte(((x) << 6) | _PAGE_FILE) - #define kern_addr_valid(addr) (1)
#include <asm-generic/pgtable.h> --- a/arch/openrisc/kernel/head.S +++ b/arch/openrisc/kernel/head.S @@ -754,11 +754,6 @@ _dc_enable:
/* ===============================================[ page table masks ]=== */
-/* bit 4 is used in hardware as write back cache bit. we never use this bit - * explicitly, so we can reuse it as _PAGE_FILE bit and mask it out when - * writing into hardware pte's - */ - #define DTLB_UP_CONVERT_MASK 0x3fa #define ITLB_UP_CONVERT_MASK 0x3a
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner tglx@linutronix.de
commit 4b59bdb569453a60b752b274ca61f009e37f4dae upstream.
x86_spec_ctrl_set() is only used in bugs.c and the extra mask checks there provide no real value as both call sites can just write x86_spec_ctrl_base to MSR_SPEC_CTRL. x86_spec_ctrl_base is valid and does not need any extra masking or checking.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ben Hutchings ben@decadent.org.uk --- arch/x86/include/asm/nospec-branch.h | 2 -- arch/x86/kernel/cpu/bugs.c | 13 ++----------- 2 files changed, 2 insertions(+), 13 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -172,8 +172,6 @@ enum spectre_v2_mitigation { SPECTRE_V2_IBRS, };
-extern void x86_spec_ctrl_set(u64); - /* The Speculative Store Bypass disable variants */ enum ssb_mitigation { SPEC_STORE_BYPASS_NONE, --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -194,15 +194,6 @@ static const char *spectre_v2_strings[]
static enum spectre_v2_mitigation spectre_v2_enabled = SPECTRE_V2_NONE;
-void x86_spec_ctrl_set(u64 val) -{ - if (val & x86_spec_ctrl_mask) - WARN_ONCE(1, "SPEC_CTRL MSR value 0x%16llx is unknown.\n", val); - else - wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base | val); -} -EXPORT_SYMBOL_GPL(x86_spec_ctrl_set); - void x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool setguest) { @@ -540,7 +531,7 @@ static enum ssb_mitigation __init __ssb_ case X86_VENDOR_INTEL: x86_spec_ctrl_base |= SPEC_CTRL_SSBD; x86_spec_ctrl_mask &= ~SPEC_CTRL_SSBD; - x86_spec_ctrl_set(SPEC_CTRL_SSBD); + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); break; case X86_VENDOR_AMD: x86_amd_ssb_disable(); @@ -652,7 +643,7 @@ int arch_prctl_spec_ctrl_get(struct task void x86_spec_ctrl_setup_ap(void) { if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) - x86_spec_ctrl_set(x86_spec_ctrl_base & ~x86_spec_ctrl_mask); + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
if (ssb_mode == SPEC_STORE_BYPASS_DISABLE) x86_amd_ssb_disable();
3.16.59-rc1 review patch. If anyone has any objections, please let me know.
------------------
From: Kees Cook keescook@chromium.org
commit fae1fa0fc6cca8beee3ab8ed71d54f9a78fa3f64 upstream.
As done with seccomp and no_new_privs, also show speculation flaw mitigation state in /proc/$pid/status.
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ben Hutchings ben@decadent.org.uk --- fs/proc/array.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
--- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -79,6 +79,7 @@ #include <linux/delayacct.h> #include <linux/seq_file.h> #include <linux/pid_namespace.h> +#include <linux/prctl.h> #include <linux/ptrace.h> #include <linux/tracehook.h> #include <linux/user_namespace.h> @@ -326,6 +327,28 @@ static inline void task_seccomp(struct s #ifdef CONFIG_SECCOMP seq_printf(m, "Seccomp:\t%d\n", p->seccomp.mode); #endif + seq_printf(m, "Speculation Store Bypass:\t"); + switch (arch_prctl_spec_ctrl_get(p, PR_SPEC_STORE_BYPASS)) { + case -EINVAL: + seq_printf(m, "unknown"); + break; + case PR_SPEC_NOT_AFFECTED: + seq_printf(m, "not vulnerable"); + break; + case PR_SPEC_PRCTL | PR_SPEC_DISABLE: + seq_printf(m, "thread mitigated"); + break; + case PR_SPEC_PRCTL | PR_SPEC_ENABLE: + seq_printf(m, "thread vulnerable"); + break; + case PR_SPEC_DISABLE: + seq_printf(m, "globally mitigated"); + break; + default: + seq_printf(m, "vulnerable"); + break; + } + seq_putc(m, '\n'); }
static inline void task_context_switch_counts(struct seq_file *m,
On 09/29/2018 02:43 PM, Ben Hutchings wrote:
This is the start of the stable review cycle for the 3.16.59 release. There are 131 patches in this series, which will be posted as responses to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Mon Oct 01 21:43:06 UTC 2018. Anything received after that time might be too late.
Build results: total: 139 pass: 139 fail: 0 Qemu test results: total: 217 pass: 217 fail: 0
Details are available at https://kerneltests.org/builders/.
Guenter
On Sun, 2018-09-30 at 07:06 -0700, Guenter Roeck wrote:
On 09/29/2018 02:43 PM, Ben Hutchings wrote:
This is the start of the stable review cycle for the 3.16.59 release. There are 131 patches in this series, which will be posted as responses to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Mon Oct 01 21:43:06 UTC 2018. Anything received after that time might be too late.
Build results: total: 139 pass: 139 fail: 0 Qemu test results: total: 217 pass: 217 fail: 0
Details are available at https://kerneltests.org/builders/.
Thanks for checking.
Ben.
linux-stable-mirror@lists.linaro.org