Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
Interestingly, the problem doesn't trigger with nopti. I assume this is because the kernel is mapped with global pages if we boot with nopti. The sequence of operations when we create a new task is that we first load its mm while still running on the old stack (which crashes if the old stack is unmapped in the new mm unless the TLB saves us), then we call prepare_switch_to(), and then we switch to the new stack. prepare_switch_to() pokes the new stack directly, which will populate the mapping through vmalloc_fault(). I assume that we're getting lucky on non-PTI systems -- the old stack's TLB entry stays alive long enough to make it all the way through prepare_switch_to() and switch_to() so that we make it to a valid stack.
Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support") Cc: stable@vger.kernel.org Reported-and-tested-by: Neil Berrington neil.berrington@datacore.com Signed-off-by: Andy Lutomirski luto@kernel.org --- arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a1561957dccb..5bfe61a5e8e3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, local_irq_restore(flags); }
+static void sync_current_stack_to_mm(struct mm_struct *mm) +{ + unsigned long sp = current_stack_pointer; + pgd_t *pgd = pgd_offset(mm, sp); + + if (CONFIG_PGTABLE_LEVELS > 4) { + if (unlikely(pgd_none(*pgd))) { + pgd_t *pgd_ref = pgd_offset_k(sp); + + set_pgd(pgd, *pgd_ref); + } + } else { + /* + * "pgd" is faked. The top level entries are "p4d"s, so sync + * the p4d. This compiles to approximately the same code as + * the 5-level case. + */ + p4d_t *p4d = p4d_offset(pgd, sp); + + if (unlikely(p4d_none(*p4d))) { + pgd_t *pgd_ref = pgd_offset_k(sp); + p4d_t *p4d_ref = p4d_offset(pgd_ref, sp); + + set_p4d(p4d, *p4d_ref); + } + } +} + void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { @@ -226,11 +254,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * mapped in the new pgd, we'll double-fault. Forcibly * map it. */ - unsigned int index = pgd_index(current_stack_pointer); - pgd_t *pgd = next->pgd + index; - - if (unlikely(pgd_none(*pgd))) - set_pgd(pgd, init_mm.pgd[index]); + sync_current_stack_to_mm(next); }
/* Stop remote flushes for the previous mm */
On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
You don't mention it, but we can normally handle vmalloc() faults in the kernel that are due to unsynchronized page tables. The thing that kills us here is that we have an unmapped stack and we try to use that stack when entering the page fault handler, which double faults. The double fault handler gets a new stack and saves us enough to get an oops out.
Right?
+static void sync_current_stack_to_mm(struct mm_struct *mm) +{
- unsigned long sp = current_stack_pointer;
- pgd_t *pgd = pgd_offset(mm, sp);
- if (CONFIG_PGTABLE_LEVELS > 4) {
if (unlikely(pgd_none(*pgd))) {
pgd_t *pgd_ref = pgd_offset_k(sp);
set_pgd(pgd, *pgd_ref);
}
- } else {
/*
* "pgd" is faked. The top level entries are "p4d"s, so sync
* the p4d. This compiles to approximately the same code as
* the 5-level case.
*/
p4d_t *p4d = p4d_offset(pgd, sp);
if (unlikely(p4d_none(*p4d))) {
pgd_t *pgd_ref = pgd_offset_k(sp);
p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
set_p4d(p4d, *p4d_ref);
}
- }
+}
We keep having to add these. It seems like a real deficiency in the mechanism that we're using for pgd folding. Can't we get a warning or something when we try to do a set_pgd() that's (silently) not doing anything? This exact same pattern bit me more than once with the KPTI/KAISER patches.
On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen dave.hansen@intel.com wrote:
On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
You don't mention it, but we can normally handle vmalloc() faults in the kernel that are due to unsynchronized page tables. The thing that kills us here is that we have an unmapped stack and we try to use that stack when entering the page fault handler, which double faults. The double fault handler gets a new stack and saves us enough to get an oops out.
Right?
Exactly.
There are two special code paths that can't use vmalloc_fault(): this one and switch_to(). The latter avoids explicit page table fiddling and just touches the new stack before loading it into rsp.
+static void sync_current_stack_to_mm(struct mm_struct *mm) +{
unsigned long sp = current_stack_pointer;
pgd_t *pgd = pgd_offset(mm, sp);
if (CONFIG_PGTABLE_LEVELS > 4) {
if (unlikely(pgd_none(*pgd))) {
pgd_t *pgd_ref = pgd_offset_k(sp);
set_pgd(pgd, *pgd_ref);
}
} else {
/*
* "pgd" is faked. The top level entries are "p4d"s, so sync
* the p4d. This compiles to approximately the same code as
* the 5-level case.
*/
p4d_t *p4d = p4d_offset(pgd, sp);
if (unlikely(p4d_none(*p4d))) {
pgd_t *pgd_ref = pgd_offset_k(sp);
p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
set_p4d(p4d, *p4d_ref);
}
}
+}
We keep having to add these. It seems like a real deficiency in the mechanism that we're using for pgd folding. Can't we get a warning or something when we try to do a set_pgd() that's (silently) not doing anything? This exact same pattern bit me more than once with the KPTI/KAISER patches.
Hmm, maybe.
What I'd really like to see is an entirely different API. Maybe:
typedef struct { opaque, but probably includes: int depth; /* 0 is root */ void *table; } ptbl_ptr;
ptbl_ptr root_table = mm_root_ptbl(mm);
set_ptbl_entry(root_table, pa, prot);
/* walk tables */ ptbl_ptr pt = ...; ptentry_ptr entry; while (ptbl_has_children(pt)) { pt = pt_next(pt, addr); } entry = pt_entry_at(pt, addr); /* do something with entry */
etc.
Now someone can add a sixth level without changing every code path in the kernel that touches page tables.
--Andy
* Andy Lutomirski luto@kernel.org wrote:
What I'd really like to see is an entirely different API. Maybe:
typedef struct { opaque, but probably includes: int depth; /* 0 is root */ void *table; } ptbl_ptr;
ptbl_ptr root_table = mm_root_ptbl(mm);
set_ptbl_entry(root_table, pa, prot);
/* walk tables */ ptbl_ptr pt = ...; ptentry_ptr entry; while (ptbl_has_children(pt)) { pt = pt_next(pt, addr); } entry = pt_entry_at(pt, addr); /* do something with entry */
etc.
Now someone can add a sixth level without changing every code path in the kernel that touches page tables.
Iteration based page table lookups would be neat.
A sixth level is unavoidable on x86-64 I think - we'll get there in a decade or so? The sixth level will also use up the last ~8 bits of virtual memory available on 64-bit.
Thanks,
Ingo
On Thu, Jan 25, 2018 at 02:00:22PM -0800, Andy Lutomirski wrote:
On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen dave.hansen@intel.com wrote:
On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
You don't mention it, but we can normally handle vmalloc() faults in the kernel that are due to unsynchronized page tables. The thing that kills us here is that we have an unmapped stack and we try to use that stack when entering the page fault handler, which double faults. The double fault handler gets a new stack and saves us enough to get an oops out.
Right?
Exactly.
There are two special code paths that can't use vmalloc_fault(): this one and switch_to(). The latter avoids explicit page table fiddling and just touches the new stack before loading it into rsp.
+static void sync_current_stack_to_mm(struct mm_struct *mm) +{
unsigned long sp = current_stack_pointer;
pgd_t *pgd = pgd_offset(mm, sp);
if (CONFIG_PGTABLE_LEVELS > 4) {
if (unlikely(pgd_none(*pgd))) {
pgd_t *pgd_ref = pgd_offset_k(sp);
set_pgd(pgd, *pgd_ref);
}
} else {
/*
* "pgd" is faked. The top level entries are "p4d"s, so sync
* the p4d. This compiles to approximately the same code as
* the 5-level case.
*/
p4d_t *p4d = p4d_offset(pgd, sp);
if (unlikely(p4d_none(*p4d))) {
pgd_t *pgd_ref = pgd_offset_k(sp);
p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
set_p4d(p4d, *p4d_ref);
}
}
+}
We keep having to add these. It seems like a real deficiency in the mechanism that we're using for pgd folding. Can't we get a warning or something when we try to do a set_pgd() that's (silently) not doing anything? This exact same pattern bit me more than once with the KPTI/KAISER patches.
Hmm, maybe.
What I'd really like to see is an entirely different API. Maybe:
typedef struct { opaque, but probably includes: int depth; /* 0 is root */ void *table; } ptbl_ptr;
ptbl_ptr root_table = mm_root_ptbl(mm);
set_ptbl_entry(root_table, pa, prot);
/* walk tables */ ptbl_ptr pt = ...; ptentry_ptr entry; while (ptbl_has_children(pt)) { pt = pt_next(pt, addr); } entry = pt_entry_at(pt, addr); /* do something with entry */
etc.
I thought about very similar design, but never got time to try it really. It's not one-week-end type of project :/
On Thu, Jan 25, 2018 at 01:12:14PM -0800, Andy Lutomirski wrote:
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
Ouch. Sorry for this.
Interestingly, the problem doesn't trigger with nopti. I assume this is because the kernel is mapped with global pages if we boot with nopti. The sequence of operations when we create a new task is that we first load its mm while still running on the old stack (which crashes if the old stack is unmapped in the new mm unless the TLB saves us), then we call prepare_switch_to(), and then we switch to the new stack. prepare_switch_to() pokes the new stack directly, which will populate the mapping through vmalloc_fault(). I assume that we're getting lucky on non-PTI systems -- the old stack's TLB entry stays alive long enough to make it all the way through prepare_switch_to() and switch_to() so that we make it to a valid stack.
Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support") Cc: stable@vger.kernel.org Reported-and-tested-by: Neil Berrington neil.berrington@datacore.com Signed-off-by: Andy Lutomirski luto@kernel.org
arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a1561957dccb..5bfe61a5e8e3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, local_irq_restore(flags); } +static void sync_current_stack_to_mm(struct mm_struct *mm) +{
- unsigned long sp = current_stack_pointer;
- pgd_t *pgd = pgd_offset(mm, sp);
- if (CONFIG_PGTABLE_LEVELS > 4) {
Can we have
if (PTRS_PER_P4D > 1)
here instead? This way I wouldn't need to touch the code again for boot-time switching support.
On Fri, Jan 26, 2018 at 10:51 AM, Kirill A. Shutemov kirill@shutemov.name wrote:
On Thu, Jan 25, 2018 at 01:12:14PM -0800, Andy Lutomirski wrote:
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
Ouch. Sorry for this.
Interestingly, the problem doesn't trigger with nopti. I assume this is because the kernel is mapped with global pages if we boot with nopti. The sequence of operations when we create a new task is that we first load its mm while still running on the old stack (which crashes if the old stack is unmapped in the new mm unless the TLB saves us), then we call prepare_switch_to(), and then we switch to the new stack. prepare_switch_to() pokes the new stack directly, which will populate the mapping through vmalloc_fault(). I assume that we're getting lucky on non-PTI systems -- the old stack's TLB entry stays alive long enough to make it all the way through prepare_switch_to() and switch_to() so that we make it to a valid stack.
Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support") Cc: stable@vger.kernel.org Reported-and-tested-by: Neil Berrington neil.berrington@datacore.com Signed-off-by: Andy Lutomirski luto@kernel.org
arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a1561957dccb..5bfe61a5e8e3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, local_irq_restore(flags); }
+static void sync_current_stack_to_mm(struct mm_struct *mm) +{
unsigned long sp = current_stack_pointer;
pgd_t *pgd = pgd_offset(mm, sp);
if (CONFIG_PGTABLE_LEVELS > 4) {
Can we have
if (PTRS_PER_P4D > 1)
here instead? This way I wouldn't need to touch the code again for boot-time switching support.
Want to send a patch?
(Also, I haven't noticed a patch to fix up the SYSRET checking for boot-time switching. Have I just missed it?)
--Andy
On Fri, Jan 26, 2018 at 11:02:08AM -0800, Andy Lutomirski wrote:
On Fri, Jan 26, 2018 at 10:51 AM, Kirill A. Shutemov kirill@shutemov.name wrote:
On Thu, Jan 25, 2018 at 01:12:14PM -0800, Andy Lutomirski wrote:
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing.
Ouch. Sorry for this.
Interestingly, the problem doesn't trigger with nopti. I assume this is because the kernel is mapped with global pages if we boot with nopti. The sequence of operations when we create a new task is that we first load its mm while still running on the old stack (which crashes if the old stack is unmapped in the new mm unless the TLB saves us), then we call prepare_switch_to(), and then we switch to the new stack. prepare_switch_to() pokes the new stack directly, which will populate the mapping through vmalloc_fault(). I assume that we're getting lucky on non-PTI systems -- the old stack's TLB entry stays alive long enough to make it all the way through prepare_switch_to() and switch_to() so that we make it to a valid stack.
Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support") Cc: stable@vger.kernel.org Reported-and-tested-by: Neil Berrington neil.berrington@datacore.com Signed-off-by: Andy Lutomirski luto@kernel.org
arch/x86/mm/tlb.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a1561957dccb..5bfe61a5e8e3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, local_irq_restore(flags); }
+static void sync_current_stack_to_mm(struct mm_struct *mm) +{
unsigned long sp = current_stack_pointer;
pgd_t *pgd = pgd_offset(mm, sp);
if (CONFIG_PGTABLE_LEVELS > 4) {
Can we have
if (PTRS_PER_P4D > 1)
here instead? This way I wouldn't need to touch the code again for boot-time switching support.
Want to send a patch?
I'll send it with the rest of boot-time switching stuff.
(Also, I haven't noticed a patch to fix up the SYSRET checking for boot-time switching. Have I just missed it?)
It's not upstream yet.
There are two patches: initial boot-time switching support and optimization on top of it.
https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=la57... https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=la57...
linux-stable-mirror@lists.linaro.org