The patch titled
Subject: mm, thp: fix mlocking THP page with migration enabled
has been added to the -mm tree. Its filename is
mm-thp-fix-mlocking-thp-page-with-migration-enabled.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-fix-mlocking-thp-page-with-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-fix-mlocking-thp-page-with-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm, thp: fix mlocking THP page with migration enabled
A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.
If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.
We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.
Introduction of THP migration breaks[1] the rules around mlocking THP
pages. If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.
For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().
Anon THP pages can only be shared between processes via fork(). Mlocked
page can only be shared if parent mlocked it before forking, otherwise CoW
will be triggered on mlock().
For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page. PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.
Test-case:
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/mempolicy.h>
#include <numaif.h>
int main(void)
{
unsigned long nodemask = 4;
void *addr;
addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
if (fork()) {
wait(NULL);
return 0;
}
mlock(addr, 4UL << 10);
mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
&nodemask, 4, MPOL_MF_MOVE);
return 0;
}
[1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL…
Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel…
Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum(a)oracle.com>
Reviewed-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 2 +-
mm/migrate.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
--- a/mm/huge_memory.c~mm-thp-fix-mlocking-thp-page-with-migration-enabled
+++ a/mm/huge_memory.c
@@ -2931,7 +2931,7 @@ void remove_migration_pmd(struct page_vm
else
page_add_file_rmap(new, true);
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
- if (vma->vm_flags & VM_LOCKED)
+ if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
mlock_vma_page(new);
update_mmu_cache_pmd(vma, address, pvmw->pmd);
}
--- a/mm/migrate.c~mm-thp-fix-mlocking-thp-page-with-migration-enabled
+++ a/mm/migrate.c
@@ -275,6 +275,9 @@ static bool remove_migration_pte(struct
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
+ if (PageTransHuge(page) && PageMlocked(page))
+ clear_page_mlock(page);
+
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
}
_
Patches currently in -mm which might be from kirill.shutemov(a)linux.intel.com are
mm-thp-fix-mlocking-thp-page-with-migration-enabled.patch
The patch titled
Subject: mm, thp: fix mlocking THP page with migration enabled
has been added to the -mm tree. Its filename is
mm-thp-fix-mlocking-thp-page-with-migration-enabled.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-fix-mlocking-thp-page-with-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-fix-mlocking-thp-page-with-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm, thp: fix mlocking THP page with migration enabled
A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.
If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.
We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.
Introduction of THP migration breaks[1] the rules around mlocking THP
pages. If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.
For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().
Anon THP pages can only be shared between processes via fork(). Mlocked
page can only be shared if parent mlocked it before forking, otherwise CoW
will be triggered on mlock().
For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page. PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.
Test-case:
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/mempolicy.h>
#include <numaif.h>
int main(void)
{
unsigned long nodemask = 4;
void *addr;
addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
if (fork()) {
wait(NULL);
return 0;
}
mlock(addr, 4UL << 10);
mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
&nodemask, 4, MPOL_MF_MOVE);
return 0;
}
[1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL…
Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel…
Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum(a)oracle.com>
Reviewed-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 2 +-
mm/migrate.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
--- a/mm/huge_memory.c~mm-thp-fix-mlocking-thp-page-with-migration-enabled
+++ a/mm/huge_memory.c
@@ -2931,7 +2931,7 @@ void remove_migration_pmd(struct page_vm
else
page_add_file_rmap(new, true);
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
- if (vma->vm_flags & VM_LOCKED)
+ if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
mlock_vma_page(new);
update_mmu_cache_pmd(vma, address, pvmw->pmd);
}
--- a/mm/migrate.c~mm-thp-fix-mlocking-thp-page-with-migration-enabled
+++ a/mm/migrate.c
@@ -275,6 +275,9 @@ static bool remove_migration_pte(struct
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
+ if (PageTransHuge(page) && PageMlocked(page))
+ clear_page_mlock(page);
+
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
}
_
Patches currently in -mm which might be from kirill.shutemov(a)linux.intel.com are
mm-thp-fix-mlocking-thp-page-with-migration-enabled.patch
If on an initiator system a LUN reset is issued while I/O is in
progress with queue depth > 1, avoid that data corruption occurs
as follows:
- The initiator submits a READ (a).
- The initiator submits a LUN reset before READ (a) completes.
- The target responds that the LUN reset succeeded after READ (a)
has been marked as CMD_T_COMPLETE and before .queue_status() has
been called.
- The initiator receives the LUN reset response and frees the
tag used by READ (a).
- The initiator submits READ (b) and reuses the tag of READ (a).
- The initiator receives the response for READ (a) and interprets
this as a completion for READ (b).
- The initiator receives the completion for READ (b) and discards
it.
With the SRP initiator and target drivers and when running fio
concurrently with sg_reset -d it only takes a few minutes to
reproduce this.
Signed-off-by: Bart Van Assche <bvanassche(a)acm.org>
Fixes: commit febe562c20df ("target: Fix LUN_RESET active I/O handling for ACK_KREF")
Cc: Nicholas Bellinger <nab(a)linux-iscsi.org>
Cc: Mike Christie <mchristi(a)redhat.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Hannes Reinecke <hare(a)suse.de>
Cc: <stable(a)vger.kernel.org>
---
drivers/target/target_core_tmr.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c
index 2750a2c7b563..6e419396c1e4 100644
--- a/drivers/target/target_core_tmr.c
+++ b/drivers/target/target_core_tmr.c
@@ -90,7 +90,7 @@ static int target_check_cdb_and_preempt(struct list_head *list,
return 1;
}
-static bool __target_check_io_state(struct se_cmd *se_cmd,
+static bool __target_check_io_state(struct se_cmd *se_cmd, u32 skip_flags,
struct se_session *tmr_sess, int tas)
{
struct se_session *sess = se_cmd->se_sess;
@@ -108,7 +108,7 @@ static bool __target_check_io_state(struct se_cmd *se_cmd,
* long as se_cmd->cmd_kref is still active unless zero.
*/
spin_lock(&se_cmd->t_state_lock);
- if (se_cmd->transport_state & (CMD_T_COMPLETE | CMD_T_FABRIC_STOP)) {
+ if (se_cmd->transport_state & (skip_flags | CMD_T_FABRIC_STOP)) {
pr_debug("Attempted to abort io tag: %llu already complete or"
" fabric stop, skipping\n", se_cmd->tag);
spin_unlock(&se_cmd->t_state_lock);
@@ -165,7 +165,8 @@ void core_tmr_abort_task(
printk("ABORT_TASK: Found referenced %s task_tag: %llu\n",
se_cmd->se_tfo->get_fabric_name(), ref_tag);
- if (!__target_check_io_state(se_cmd, se_sess, 0))
+ if (!__target_check_io_state(se_cmd, CMD_T_COMPLETE, se_sess,
+ 0))
continue;
spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags);
@@ -349,7 +350,7 @@ static void core_tmr_drain_state_list(
continue;
spin_lock(&sess->sess_cmd_lock);
- rc = __target_check_io_state(cmd, tmr_sess, tas);
+ rc = __target_check_io_state(cmd, 0, tmr_sess, tas);
spin_unlock(&sess->sess_cmd_lock);
if (!rc)
continue;
--
2.18.0
On Mon, 17 Sep 2018, gregkh(a)linuxfoundation.org wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> x86/kexec: Allocate 8k PGDs for PTI
>
> to the 3.18-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> x86-kexec-allocate-8k-pgds-for-pti.patch
> and it can be found in the queue-3.18 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
I believe this commit is an example of the auto-selector being too
eager, and this should not be in *any* of the stable trees. As the
commit message indicates, it's a fix by Joerg for his PTI-x86-32
implementation - which has not been backported to any of the stable
trees (yet), has it?
In several of the recent stable trees, I think this will not do any
actual harm; but it looks as if it will prevent relevant x86-32 configs
from building on 3.18 (I see no definition of PGD_ALLOCATION_ORDER in
linux-3.18.y - you preferred not to have any PTI in that tree), and I
haven't checked whether its definition in older backports will build
correctly here or not.
Hugh
>
>
> From foo@baz Mon Sep 17 11:45:57 CEST 2018
> From: Joerg Roedel <jroedel(a)suse.de>
> Date: Wed, 25 Jul 2018 17:48:03 +0200
> Subject: x86/kexec: Allocate 8k PGDs for PTI
>
> From: Joerg Roedel <jroedel(a)suse.de>
>
> [ Upstream commit ca38dc8f2724d101038b1205122c93a1c7f38f11 ]
>
> Fuzzing the PTI-x86-32 code with trinity showed unhandled
> kernel paging request oops-messages that looked a lot like
> silent data corruption.
>
> Lot's of debugging and testing lead to the kexec-32bit code,
> which is still allocating 4k PGDs when PTI is enabled. But
> since it uses native_set_pud() to build the page-table, it
> will unevitably call into __pti_set_user_pgtbl(), which
> writes beyond the allocated 4k page.
>
> Use PGD_ALLOCATION_ORDER to allocate PGDs in the kexec code
> to fix the issue.
>
> Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
> Tested-by: David H. Gutteridge <dhgutteridge(a)sympatico.ca>
> Cc: "H . Peter Anvin" <hpa(a)zytor.com>
> Cc: linux-mm(a)kvack.org
> Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
> Cc: Andy Lutomirski <luto(a)kernel.org>
> Cc: Dave Hansen <dave.hansen(a)intel.com>
> Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
> Cc: Juergen Gross <jgross(a)suse.com>
> Cc: Peter Zijlstra <peterz(a)infradead.org>
> Cc: Borislav Petkov <bp(a)alien8.de>
> Cc: Jiri Kosina <jkosina(a)suse.cz>
> Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
> Cc: Brian Gerst <brgerst(a)gmail.com>
> Cc: David Laight <David.Laight(a)aculab.com>
> Cc: Denys Vlasenko <dvlasenk(a)redhat.com>
> Cc: Eduardo Valentin <eduval(a)amazon.com>
> Cc: Greg KH <gregkh(a)linuxfoundation.org>
> Cc: Will Deacon <will.deacon(a)arm.com>
> Cc: aliguori(a)amazon.com
> Cc: daniel.gruss(a)iaik.tugraz.at
> Cc: hughd(a)google.com
> Cc: keescook(a)google.com
> Cc: Andrea Arcangeli <aarcange(a)redhat.com>
> Cc: Waiman Long <llong(a)redhat.com>
> Cc: Pavel Machek <pavel(a)ucw.cz>
> Cc: Arnaldo Carvalho de Melo <acme(a)kernel.org>
> Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com>
> Cc: Jiri Olsa <jolsa(a)redhat.com>
> Cc: Namhyung Kim <namhyung(a)kernel.org>
> Cc: joro(a)8bytes.org
> Link: https://lkml.kernel.org/r/1532533683-5988-4-git-send-email-joro@8bytes.org
> Signed-off-by: Sasha Levin <alexander.levin(a)microsoft.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> ---
> arch/x86/kernel/machine_kexec_32.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -69,7 +69,7 @@ static void load_segments(void)
>
> static void machine_kexec_free_page_tables(struct kimage *image)
> {
> - free_page((unsigned long)image->arch.pgd);
> + free_pages((unsigned long)image->arch.pgd, PGD_ALLOCATION_ORDER);
> image->arch.pgd = NULL;
> #ifdef CONFIG_X86_PAE
> free_page((unsigned long)image->arch.pmd0);
> @@ -85,7 +85,8 @@ static void machine_kexec_free_page_tabl
>
> static int machine_kexec_alloc_page_tables(struct kimage *image)
> {
> - image->arch.pgd = (pgd_t *)get_zeroed_page(GFP_KERNEL);
> + image->arch.pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + PGD_ALLOCATION_ORDER);
> #ifdef CONFIG_X86_PAE
> image->arch.pmd0 = (pmd_t *)get_zeroed_page(GFP_KERNEL);
> image->arch.pmd1 = (pmd_t *)get_zeroed_page(GFP_KERNEL);
>
>
> Patches currently in stable-queue which might be from jroedel(a)suse.de are
>
> queue-3.18/x86-kexec-allocate-8k-pgds-for-pti.patch
> queue-3.18/x86-mm-remove-in_nmi-warning-from-vmalloc_fault.patch
>
I'm seeing the following i386 build failure with defconfig on 4.4-rc branch:
$ make ARCH=i386
CHK include/config/kernel.release
CHK include/generated/uapi/linux/version.h
CHK include/generated/utsrelease.h
CHK include/generated/bounds.h
CHK include/generated/timeconst.h
CHK include/generated/asm-offsets.h
CALL scripts/checksyscalls.sh
CHK include/generated/compile.h
CC arch/x86/kernel/machine_kexec_32.o
arch/x86/kernel/machine_kexec_32.c: In function ‘machine_kexec_free_page_tables’:
arch/x86/kernel/machine_kexec_32.c:73:45: error: ‘PGD_ALLOCATION_ORDER’ undeclared (first use in this function); did you mean ‘PAGE_ALLOC_COSTLY_ORDER’?
free_pages((unsigned long)image->arch.pgd, PGD_ALLOCATION_ORDER);
^~~~~~~~~~~~~~~~~~~~
PAGE_ALLOC_COSTLY_ORDER
arch/x86/kernel/machine_kexec_32.c:73:45: note: each undeclared identifier is reported only once for each function it appears in
arch/x86/kernel/machine_kexec_32.c: In function ‘machine_kexec_alloc_page_tables’:
arch/x86/kernel/machine_kexec_32.c:90:11: error: ‘PGD_ALLOCATION_ORDER’ undeclared (first use in this function); did you mean ‘PAGE_ALLOC_COSTLY_ORDER’?
PGD_ALLOCATION_ORDER);
^~~~~~~~~~~~~~~~~~~~
PAGE_ALLOC_COSTLY_ORDER
scripts/Makefile.build:269: recipe for target 'arch/x86/kernel/machine_kexec_32.o' failed
make[2]: *** [arch/x86/kernel/machine_kexec_32.o] Error 1
scripts/Makefile.build:476: recipe for target 'arch/x86/kernel' failed
make[1]: *** [arch/x86/kernel] Error 2
Makefile:980: recipe for target 'arch/x86' failed
make: *** [arch/x86] Error 2
Dan