Debug exception handlers may be called for exceptions generated both by
user and kernel code. In many cases, this is checked explicitly, but
in other cases things either happen to work by happy accident or they
go slightly wrong. For example, executing 'brk #4' from userspace will
enter the kprobes code and be ignored, but the instruction will be
retried forever in userspace instead of delivering a SIGTRAP.
Fix this issue in the most stable-friendly fashion by simply adding
explicit checks of the triggering exception level to all of our debug
exception handlers.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
---
arch/arm64/kernel/kgdb.c | 14 ++++++++++----
arch/arm64/kernel/probes/kprobes.c | 6 ++++++
2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
index ce46c4cdf368..691854b77c7f 100644
--- a/arch/arm64/kernel/kgdb.c
+++ b/arch/arm64/kernel/kgdb.c
@@ -244,27 +244,33 @@ int kgdb_arch_handle_exception(int exception_vector, int signo,
static int kgdb_brk_fn(struct pt_regs *regs, unsigned int esr)
{
+ if (user_mode(regs))
+ return DBG_HOOK_ERROR;
+
kgdb_handle_exception(1, SIGTRAP, 0, regs);
- return 0;
+ return DBG_HOOK_HANDLED;
}
NOKPROBE_SYMBOL(kgdb_brk_fn)
static int kgdb_compiled_brk_fn(struct pt_regs *regs, unsigned int esr)
{
+ if (user_mode(regs))
+ return DBG_HOOK_ERROR;
+
compiled_break = 1;
kgdb_handle_exception(1, SIGTRAP, 0, regs);
- return 0;
+ return DBG_HOOK_HANDLED;
}
NOKPROBE_SYMBOL(kgdb_compiled_brk_fn);
static int kgdb_step_brk_fn(struct pt_regs *regs, unsigned int esr)
{
- if (!kgdb_single_step)
+ if (user_mode(regs) || !kgdb_single_step)
return DBG_HOOK_ERROR;
kgdb_handle_exception(1, SIGTRAP, 0, regs);
- return 0;
+ return DBG_HOOK_HANDLED;
}
NOKPROBE_SYMBOL(kgdb_step_brk_fn);
diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index f17afb99890c..7fb6f3aa5ceb 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -450,6 +450,9 @@ kprobe_single_step_handler(struct pt_regs *regs, unsigned int esr)
struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
int retval;
+ if (user_mode(regs))
+ return DBG_HOOK_ERROR;
+
/* return error if this is not our step */
retval = kprobe_ss_hit(kcb, instruction_pointer(regs));
@@ -466,6 +469,9 @@ kprobe_single_step_handler(struct pt_regs *regs, unsigned int esr)
int __kprobes
kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr)
{
+ if (user_mode(regs))
+ return DBG_HOOK_ERROR;
+
kprobe_handler(regs);
return DBG_HOOK_HANDLED;
}
--
2.11.0
FAR_EL1 is UNKNOWN for all debug exceptions other than those caused by
taking a hardware watchpoint. Unfortunately, if a debug handler returns
a non-zero value, then we will propagate the UNKNOWN FAR value to
userspace via the si_addr field of the SIGTRAP siginfo_t.
Instead, let's set si_addr to take on the PC of the faulting instruction,
which we have available in the current pt_regs.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
---
arch/arm64/mm/fault.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index efb7b2cbead5..ef46925096f0 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -824,11 +824,12 @@ void __init hook_debug_fault_code(int nr,
debug_fault_info[nr].name = name;
}
-asmlinkage int __exception do_debug_exception(unsigned long addr,
+asmlinkage int __exception do_debug_exception(unsigned long addr_if_watchpoint,
unsigned int esr,
struct pt_regs *regs)
{
const struct fault_info *inf = esr_to_debug_fault_info(esr);
+ unsigned long pc = instruction_pointer(regs);
int rv;
/*
@@ -838,14 +839,14 @@ asmlinkage int __exception do_debug_exception(unsigned long addr,
if (interrupts_enabled(regs))
trace_hardirqs_off();
- if (user_mode(regs) && !is_ttbr0_addr(instruction_pointer(regs)))
+ if (user_mode(regs) && !is_ttbr0_addr(pc))
arm64_apply_bp_hardening();
- if (!inf->fn(addr, esr, regs)) {
+ if (!inf->fn(addr_if_watchpoint, esr, regs)) {
rv = 1;
} else {
arm64_notify_die(inf->name, regs,
- inf->sig, inf->code, (void __user *)addr, esr);
+ inf->sig, inf->code, (void __user *)pc, esr);
rv = 0;
}
--
2.11.0
On 27/02/19 22:31, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> sfc: suppress duplicate nvmem partition types in efx_ef10_mtd_probe
>
> to the 4.20-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> sfc-suppress-duplicate-nvmem-partition-types-in-efx_.patch
> and it can be found in the queue-4.20 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
If you are taking this patch, you also need c65285428b6e
sfc: initialise found bitmap in efx_ef10_mtd_probe
which fixes bugs in the above patch; I don't currently see it in the
stable-queue.
(Also, it's not clear whether the original fix is really needed on stable
kernels; while the bug is present there, it is harmless until a v5.0-rc1
commit, probably c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem API")
interacts with it.)
The above remarks apply to all six stable trees for which this patch has
been queued.
-Ed
The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited.
Kernel 4.14 fails to build with GCC 8 on powerpc64, due to 'in' being
uninitialised in epapr_hypercall*.
This is fixed in commit 186b8f1587c79c2fa04bfa392fdf08 upstream, and
this commit applies cleanly to the 4.14 tree. This commit is already on
the 4.19 branch.
Best,
--arw
--
A. Wilcox (awilfox)
Project Lead, Adélie Linux
https://www.adelielinux.org
Hi Sasha,
Thanks for the heads-up!
On Tue, Feb 26, 2019 at 09:24:00PM +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
>
> The bot has tested the following trees: v4.20.12, v4.19.25, v4.14.103, v4.9.160, v4.4.176, v3.18.136.
>
Lu Baolu, can you please check for which stable trees this commit is
relevant and provide the backports of the patch (with dependencies if
necessary) to the relevant stable trees?
Thanks,
Joerg
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: fix races and page leaks during migration
hugetlb pages should only be migrated if they are 'active'. The routines
set/clear_page_huge_active() modify the active state of hugetlb pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could race
and migrate the page while it is being added to page table by the fault
code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem. For
example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It then
migrates the pages associated with the file from one node to another.
When the program exits, huge page counts are as follows:
node0
1024 free_hugepages
1024 nr_hugepages
node1
0 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:
node0
1024 free_hugepages
1024 nr_hugepages
node1
1024 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem. To fix, check for this condition
before migrating a huge page. If the condition is detected, return EBUSY
for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: <stable(a)vger.kernel.org>
[mike.kravetz(a)oracle.com: v2]
Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz(a)oracle.com: update comment and changelog]
Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 12 ++++++++++++
mm/hugetlb.c | 16 +++++++++++++---
mm/migrate.c | 11 +++++++++++
3 files changed, 36 insertions(+), 3 deletions(-)
--- a/fs/hugetlbfs/inode.c~huegtlbfs-fix-races-and-page-leaks-during-migration
+++ a/fs/hugetlbfs/inode.c
@@ -859,6 +859,18 @@ static int hugetlbfs_migrate_page(struct
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
+
+ /*
+ * page_private is subpool pointer in hugetlb pages. Transfer to
+ * new page. PagePrivate is not associated with page_private for
+ * hugetlb pages and can not be set here as only page_huge_active
+ * pages can be migrated.
+ */
+ if (page_private(page)) {
+ set_page_private(newpage, page_private(page));
+ set_page_private(page, 0);
+ }
+
if (mode != MIGRATE_SYNC_NO_COPY)
migrate_page_copy(newpage, page);
else
--- a/mm/hugetlb.c~huegtlbfs-fix-races-and-page-leaks-during-migration
+++ a/mm/hugetlb.c
@@ -3624,7 +3624,6 @@ retry_avoidcopy:
copy_user_huge_page(new_page, old_page, address, vma,
pages_per_huge_page(h));
__SetPageUptodate(new_page);
- set_page_huge_active(new_page);
mmu_notifier_range_init(&range, mm, haddr, haddr + huge_page_size(h));
mmu_notifier_invalidate_range_start(&range);
@@ -3645,6 +3644,7 @@ retry_avoidcopy:
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
+ set_page_huge_active(new_page);
/* Make the old page be freed below */
new_page = old_page;
}
@@ -3729,6 +3729,7 @@ static vm_fault_t hugetlb_no_page(struct
pte_t new_pte;
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
+ bool new_page = false;
/*
* Currently, we are forced to kill the process in the event the
@@ -3790,7 +3791,7 @@ retry:
}
clear_huge_page(page, address, pages_per_huge_page(h));
__SetPageUptodate(page);
- set_page_huge_active(page);
+ new_page = true;
if (vma->vm_flags & VM_MAYSHARE) {
int err = huge_add_to_page_cache(page, mapping, idx);
@@ -3861,6 +3862,15 @@ retry:
}
spin_unlock(ptl);
+
+ /*
+ * Only make newly allocated pages active. Existing pages found
+ * in the pagecache could be !page_huge_active() if they have been
+ * isolated for migration.
+ */
+ if (new_page)
+ set_page_huge_active(page);
+
unlock_page(page);
out:
return ret;
@@ -4095,7 +4105,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
* the set_pte_at() write.
*/
__SetPageUptodate(page);
- set_page_huge_active(page);
mapping = dst_vma->vm_file->f_mapping;
idx = vma_hugecache_offset(h, dst_vma, dst_addr);
@@ -4163,6 +4172,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
update_mmu_cache(dst_vma, dst_addr, dst_pte);
spin_unlock(ptl);
+ set_page_huge_active(page);
if (vm_shared)
unlock_page(page);
ret = 0;
--- a/mm/migrate.c~huegtlbfs-fix-races-and-page-leaks-during-migration
+++ a/mm/migrate.c
@@ -1315,6 +1315,16 @@ static int unmap_and_move_huge_page(new_
lock_page(hpage);
}
+ /*
+ * Check for pages which are in the process of being freed. Without
+ * page_mapping() set, hugetlbfs specific move page routine will not
+ * be called and we could leak usage counts for subpools.
+ */
+ if (page_private(hpage) && !page_mapping(hpage)) {
+ rc = -EBUSY;
+ goto out_unlock;
+ }
+
if (PageAnon(hpage))
anon_vma = page_get_anon_vma(hpage);
@@ -1345,6 +1355,7 @@ put_anon:
put_new_page = NULL;
}
+out_unlock:
unlock_page(hpage);
out:
if (rc != -EAGAIN)
_
On 2/11/19 8:27 PM, Andrew Morton wrote:
> On Mon, 11 Feb 2019 10:02:45 -0800 <rcampbell(a)nvidia.com> wrote:
>
>> From: Ralph Campbell <rcampbell(a)nvidia.com>
>>
>> The system call, get_mempolicy() [1], passes an unsigned long *nodemask
>> pointer and an unsigned long maxnode argument which specifies the
>> length of the user's nodemask array in bits (which is rounded up).
>> The manual page says that if the maxnode value is too small,
>> get_mempolicy will return EINVAL but there is no system call to return
>> this minimum value. To determine this value, some programs search
>> /proc/<pid>/status for a line starting with "Mems_allowed:" and use
>> the number of digits in the mask to determine the minimum value.
>> A recent change to the way this line is formatted [2] causes these
>> programs to compute a value less than MAX_NUMNODES so get_mempolicy()
>> returns EINVAL.
>>
>> Change get_mempolicy(), the older compat version of get_mempolicy(), and
>> the copy_nodes_to_user() function to use nr_node_ids instead of
>> MAX_NUMNODES, thus preserving the defacto method of computing the
>> minimum size for the nodemask array and the maxnode argument.
>>
>> [1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html
>> [2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redha…
Please, the next time include linux-api and people involved in the previous
thread [1] into the CC list. Likely there should have been a Suggested-by: for
Alexander as well.
>>
>
> Ugh, what a mess.
I'm afraid it's even somewhat worse mess now.
> For a start, that's a crazy interface. I wish that had been brought to
> our attention so we could have provided a sane way for userspace to
> determine MAX_NUMNODES.
>
> Secondly, 4fb8e5b89bcbbb ("include/linux/nodemask.h: use nr_node_ids
> (not MAX_NUMNODES) in __nodemask_pr_numnodes()") introduced a
There's no such commit, that sha was probably from linux-next. The patch is
still in mmotm [1]. Luckily, I would say. Maybe Linus or some automation could
run some script to check for bogus Fixes tags before accepting patches?
> regession. The proposed get_mempolicy() change appears to be a good
> one, but is a strange way of addressing the regression. I suppose it's
> acceptable, as long as this change is backported into kernels which
> have 4fb8e5b89bcbbb.
Based on the non-existing sha, hopefully it wasn't backported anywhere, but
maybe some AI did anyway. Ah, seems like it indeed made it as far as 4.9, as a
fix for non-existing commit and without proper linux-api consideration :(
I guess it's too late to revert it for 5.0. Hopefully the change is really safe
and won't break anything, i.e. hopefully nobody was determining MAX_NUMNODES by
increasing buffer size until get_mempolicy() stopped returning EINVAL. Or other
problem in e.g. CRIU context.
What about the manpage? It says "The value specified by maxnode is less than
the number of node IDs supported by the system." which could be perhaps applied
both to nr_node_ids or MAX_NUMNODES. Or should we update it?
[1]
https://lore.kernel.org/linux-mm/631c44cc-df2d-40d4-a537-d24864df0679@nvidi…
[2]
https://www.ozlabs.org/~akpm/mmotm/broken-out/include-linux-nodemaskh-use-n…
Quoting Kenneth Graunke (2018-01-05 06:06:34)
> On Thursday, January 4, 2018 4:41:35 PM PST Rodrigo Vivi wrote:
> > On Thu, Jan 04, 2018 at 11:39:23PM +0000, Kenneth Graunke wrote:
> > > On Thursday, January 4, 2018 1:23:06 PM PST Chris Wilson wrote:
> > > > Quoting Kenneth Graunke (2018-01-04 19:38:05)
> > > > > Geminilake requires the 3D driver to select whether barriers are
> > > > > intended for compute shaders, or tessellation control shaders, by
> > > > > whacking a "Barrier Mode" bit in SLICE_COMMON_ECO_CHICKEN1 when
> > > > > switching pipelines. Failure to do this properly can result in GPU
> > > > > hangs.
> > > > >
> > > > > Unfortunately, this means it needs to switch mid-batch, so only
> > > > > userspace can properly set it. To facilitate this, the kernel needs
> > > > > to whitelist the register.
> > > > >
> > > > > Signed-off-by: Kenneth Graunke <kenneth(a)whitecape.org>
> > > > > Cc: stable(a)vger.kernel.org
> > > > > ---
> > > > > drivers/gpu/drm/i915/i915_reg.h | 2 ++
> > > > > drivers/gpu/drm/i915/intel_engine_cs.c | 5 +++++
> > > > > 2 files changed, 7 insertions(+)
> > > > >
> > > > > Hello,
> > > > >
> > > > > We unfortunately need to whitelist an extra register for GPU hang fix
> > > > > on Geminilake. Here's the corresponding Mesa patch:
> > > >
> > > > Thankfully it appears to be context saved. Has a w/a name been assigned
> > > > for this?
> > > > -Chris
> > >
> > > There doesn't appear to be one. The workaround page lists it, but there
> > > is no name. The register description has a note saying that you need to
> > > set this, but doesn't call it out as a workaround.
> >
> > It mentions only BXT:ALL, but not mention to GLK.
> >
> > Should we add to both then?
>
> Well, that's irritating. On the workarounds page, it does indeed say
> "BXT" with no mention of GLK. But the workaround text says to set
> "SLICE_COMMON_CHICKEN_ECO1 Barrier Mode [...] (bit 7 of MMIO 0x731C)."
>
> Looking at the register definition for SLICE_COMMON_ECO_CHICKEN1, bit 7
> is "Barrier Mode" on [GLK] only, with no mention of BXT. It's marked
> reserved PBC on [SKL+, not GLK, not KBL]. On KBL it's something else.
>
> I believe Mark saw hangs in tessellation control shader hangs on
> Geminilake only, and never saw this issue on Broxton. So, my guess is
> that the workaround really is new on Geminilake, and the BXT tag on the
> workarounds page is incorrect. (Mark, does that sound right to you?)
Hi, I'm back!
This fails a selftest on glk as we can't even write to the register
0x731c, or at least can't read from the register.
Did bspec ever get updated to include this register & wa?
-Chris