Hello,
After I've upgraded from 5.7.11 (didn't have access to this machine for
about 10 months) to 5.12.10, I've noticed that anytime I used all my
cores to, for example, compile a project, the system would degrade
significantly in performance and applications would start to stutter.
Compiles are also about 3-4x slower on kernels with the regression vs.
without. After debugging this for the past 24 hours or so, I've narrowed
it down to a change between 5.7.19 and 5.8.1. Sadly, bisect does not
help, because trying to run any of the 5.8 RC kernels causes the kernel
to be stuck before init, without any apparent errors on the screen (and
I don't have a serial cable to dump the kernel output to). I'm listing
all the information I know and my system information below.
Reproduction steps (dunno if this helps, but):
1. Boot with kernel with the regression
2. Do something that uses all cores, like compiling the Linux kernel
3. Observe long compile times and stuttering applications (which doesn't
happen even on full load with a working kernel)
Regression between kernel versions:
5.7 - working
5.7.11 - working
5.7.19 - working
5.8.1 - broken
5.8.18 - broken
5.12.10 - broken
8ecfa36cd4db3275bf3b6c6f32c7e3c6bb537de2 (master on 2021-06-13) - broken
System information:
Gentoo Linux 17.1 amd64
CPU info:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x27
cpu MHz : 2042.008
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow
vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb
flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
shadow_vmcs
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
mds swapgs itlb_multihit srbds
bogomips : 7199.96
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
RAM: 16GiB
GPU: VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
Redwood XT [Radeon HD 5670/5690/5730]
cat /proc/sys/kernel/tainted: 0
Thanks in advance.
Best Regards,
Adam Edge
This is the start of the stable review cycle for the 4.9.273 release.
There are 42 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 16 Jun 2021 10:26:30 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.273-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.273-rc1
Liangyan <liangyan.peng(a)linux.alibaba.com>
tracing: Correct the length check which causes memory corruption
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
ftrace: Do not blindly read the ip address in ftrace_bug()
Ming Lei <ming.lei(a)redhat.com>
scsi: core: Only put parent device if host state differs from SHOST_CREATED
Ming Lei <ming.lei(a)redhat.com>
scsi: core: Fix error handling of scsi_host_alloc()
Dai Ngo <dai.ngo(a)oracle.com>
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
Paolo Bonzini <pbonzini(a)redhat.com>
kvm: fix previous commit for 32-bit builds
Leo Yan <leo.yan(a)linaro.org>
perf session: Correct buffer copying when peeking events
Dan Carpenter <dan.carpenter(a)oracle.com>
NFS: Fix a potential NULL dereference in nfs_get_client()
Marco Elver <elver(a)google.com>
perf: Fix data race between pin_count increment/decrement
Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
regulator: core: resolve supply for boot-on/always-on regulators
Maciej Żenczykowski <maze(a)google.com>
usb: fix various gadget panics on 10gbps cabling
Maciej Żenczykowski <maze(a)google.com>
usb: fix various gadgets null ptr deref on 10gbps cabling.
Linyu Yuan <linyyuan(a)codeaurora.com>
usb: gadget: eem: fix wrong eem header operation
Johan Hovold <johan(a)kernel.org>
USB: serial: quatech2: fix control-request directions
Alexandre GRIVEAUX <agriveaux(a)deutnet.info>
USB: serial: omninet: add device id for Zyxel Omni 56K Plus
George McCollister <george.mccollister(a)gmail.com>
USB: serial: ftdi_sio: add NovaTech OrionMX product ID
Marian-Cristian Rotariu <marian.c.rotariu(a)gmail.com>
usb: dwc3: ep0: fix NULL pointer exception
Maciej Żenczykowski <maze(a)google.com>
USB: f_ncm: ncm_bitrate (speed) is unsigned
Alexander Kuznetsov <wwfq(a)yandex-team.ru>
cgroup1: don't allow '\n' in renaming
Ritesh Harjani <riteshh(a)linux.ibm.com>
btrfs: return value from btrfs_mark_extent_written() in case of error
Paolo Bonzini <pbonzini(a)redhat.com>
kvm: avoid speculation-based attacks from out-of-range memslot accesses
Desmond Cheong Zhi Xi <desmondcheongzx(a)gmail.com>
drm: Lock pointer access in drm_master_release()
Chris Packham <chris.packham(a)alliedtelesis.co.nz>
i2c: mpc: implement erratum A-004447 workaround
Chris Packham <chris.packham(a)alliedtelesis.co.nz>
i2c: mpc: Make use of i2c_recover_bus()
Chris Packham <chris.packham(a)alliedtelesis.co.nz>
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers
Chris Packham <chris.packham(a)alliedtelesis.co.nz>
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers
Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com>
bnx2x: Fix missing error code in bnx2x_iov_init_one()
Tiezhu Yang <yangtiezhu(a)loongson.cn>
MIPS: Fix kernel hang under FUNCTION_GRAPH_TRACER and PREEMPT_TRACER
Saubhik Mukherjee <saubhik.mukherjee(a)gmail.com>
net: appletalk: cops: Fix data race in cops_probe1
Zong Li <zong.li(a)sifive.com>
net: macb: ensure the device is available before accessing GEMGXL control registers
Dmitry Bogdanov <d.bogdanov(a)yadro.com>
scsi: target: qla2xxx: Wait for stop_phase1 at WWN removal
Matt Wang <wwentao(a)vmware.com>
scsi: vmw_pvscsi: Set correct residual data length
Zheyu Ma <zheyuma97(a)gmail.com>
net/qla3xxx: fix schedule while atomic in ql_sem_spinlock
Sergey Senozhatsky <senozhatsky(a)chromium.org>
wq: handle VM suspension in stall detection
Shakeel Butt <shakeelb(a)google.com>
cgroup: disable controllers at parse time
Dan Carpenter <dan.carpenter(a)oracle.com>
net: mdiobus: get rid of a BUG_ON()
Johannes Berg <johannes.berg(a)intel.com>
netlink: disable IRQs for netlink_lock_table()
Johannes Berg <johannes.berg(a)intel.com>
bonding: init notify_work earlier to avoid uninitialized use
Zheyu Ma <zheyuma97(a)gmail.com>
isdn: mISDN: netjet: Fix crash in nj_probe:
Zou Wei <zou_wei(a)huawei.com>
ASoC: sti-sas: add missing MODULE_DEVICE_TABLE
Jeimon <jjjinmeng.zhou(a)gmail.com>
net/nfc/rawsock.c: fix a permission check bug
Kees Cook <keescook(a)chromium.org>
proc: Track /proc/$pid/attr/ opener mm_struct
-------------
Diffstat:
Makefile | 4 +-
arch/mips/lib/mips-atomic.c | 12 +--
arch/powerpc/boot/dts/fsl/p1010si-post.dtsi | 8 ++
arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 16 ++++
drivers/gpu/drm/drm_auth.c | 3 +-
drivers/i2c/busses/i2c-mpc.c | 95 ++++++++++++++++++++++-
drivers/isdn/hardware/mISDN/netjet.c | 1 -
drivers/net/appletalk/cops.c | 4 +-
drivers/net/bonding/bond_main.c | 2 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 4 +-
drivers/net/ethernet/cadence/macb.c | 3 +
drivers/net/ethernet/qlogic/qla3xxx.c | 2 +-
drivers/net/phy/mdio_bus.c | 3 +-
drivers/regulator/core.c | 6 ++
drivers/scsi/hosts.c | 25 +++---
drivers/scsi/qla2xxx/qla_target.c | 2 +
drivers/scsi/vmw_pvscsi.c | 8 +-
drivers/usb/dwc3/ep0.c | 3 +
drivers/usb/gadget/config.c | 8 ++
drivers/usb/gadget/function/f_ecm.c | 2 +-
drivers/usb/gadget/function/f_eem.c | 6 +-
drivers/usb/gadget/function/f_loopback.c | 2 +-
drivers/usb/gadget/function/f_ncm.c | 2 +-
drivers/usb/gadget/function/f_printer.c | 3 +-
drivers/usb/gadget/function/f_rndis.c | 2 +-
drivers/usb/gadget/function/f_serial.c | 2 +-
drivers/usb/gadget/function/f_sourcesink.c | 3 +-
drivers/usb/gadget/function/f_subset.c | 2 +-
drivers/usb/gadget/function/f_tcm.c | 3 +-
drivers/usb/serial/ftdi_sio.c | 1 +
drivers/usb/serial/ftdi_sio_ids.h | 1 +
drivers/usb/serial/omninet.c | 2 +
drivers/usb/serial/quatech2.c | 6 +-
fs/btrfs/file.c | 4 +-
fs/nfs/client.c | 2 +-
fs/nfs/nfs4proc.c | 8 ++
fs/proc/base.c | 9 ++-
include/linux/kvm_host.h | 11 ++-
kernel/cgroup.c | 17 ++--
kernel/events/core.c | 2 +
kernel/trace/ftrace.c | 8 +-
kernel/trace/trace.c | 2 +-
kernel/workqueue.c | 12 ++-
net/netlink/af_netlink.c | 6 +-
net/nfc/rawsock.c | 2 +-
sound/soc/codecs/sti-sas.c | 1 +
tools/perf/util/session.c | 1 +
47 files changed, 266 insertions(+), 65 deletions(-)
The patch titled
Subject: mm: page_vma_mapped_walk(): use pmd_read_atomic()
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-use-pmd_read_atomic.patch
This patch was dropped because it was withdrawn
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): use pmd_read_atomic()
page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
have a multi-word pmd entry, for which READ_ONCE() is not good enough.
Link: https://lkml.kernel.org/r/594c1f0-d396-5346-1f36-606872cddb18@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-pmd_read_atomic
+++ a/mm/page_vma_mapped.c
@@ -182,13 +182,16 @@ restart:
pud = pud_offset(p4d, pvmw->address);
if (!pud_present(*pud))
return false;
+
pvmw->pmd = pmd_offset(pud, pvmw->address);
/*
* Make sure the pmd value isn't cached in a register by the
* compiler and used as a stale value after we've observed a
* subsequent update.
*/
- pmde = READ_ONCE(*pvmw->pmd);
+ pmde = pmd_read_atomic(pvmw->pmd);
+ barrier();
+
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-fix-__split_huge_pmd_locked-on-shmem-migration-entry.patch
mm-thp-make-is_huge_zero_pmd-safe-and-quicker.patch
mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting.patch
mm-thp-fix-vma_address-if-virtual-address-below-file-offset.patch
mm-thp-unmap_mapping_page-to-fix-thp-truncate_cleanup_page.patch
mm-page_vma_mapped_walk-use-page-for-pvmw-page.patch
mm-page_vma_mapped_walk-settle-pagehuge-on-entry.patch
mm-page_vma_mapped_walk-use-pmde-for-pvmw-pmd.patch
mm-page_vma_mapped_walk-prettify-pvmw_migration-block.patch
mm-page_vma_mapped_walk-crossing-page-table-boundary.patch
mm-page_vma_mapped_walk-add-a-level-of-indentation.patch
mm-page_vma_mapped_walk-use-goto-instead-of-while-1.patch
mm-page_vma_mapped_walk-get-vma_address_end-earlier.patch
mm-thp-fix-page_vma_mapped_walk-if-thp-mapped-by-ptes.patch
mm-thp-another-pvmw_sync-fix-in-page_vma_mapped_walk.patch
mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page.patch
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm/gup: fix try_grab_compound_head() race with split_huge_page()
has been added to the -mm tree. Its filename is
mm-gup-fix-try_grab_compound_head-race-with-split_huge_page.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-gup-fix-try_grab_compound_head…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-fix-try_grab_compound_head…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm/gup: fix try_grab_compound_head() race with split_huge_page()
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent freeing
of page tables (via local_irq_save()), but not against concurrent TLB
flushes, freeing of data pages, or splitting of compound pages.
Because no reference is held to the page when try_grab_compound_head() is
called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).
The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound page
may have been split up (either explicitly through split_huge_page() or by
freeing the compound page to the buddy allocator and then allocating its
individual order-0 pages). If that happens, get_user_pages_fast() may end
up returning the right page but lifting the refcount on a now-unrelated
page, leading to use-after-free of pages.
To fix it: Re-check whether the pages still belong together after lifting
the refcount on the head page. Move anything else that checks
compound_head(page) below the refcount increment.
This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so
that PV TLB flushes are used instead of IPIs).
As requested on the list, also replace the existing VM_BUG_ON_PAGE() with
a warning and bailout. Since the existing code only performed the BUG_ON
check on DEBUG_VM kernels, ensure that the new code also only performs the
check under that configuration - I don't want to mix two logically
separate changes together too much. The macro VM_WARN_ON_ONCE_PAGE()
doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef
block. An alternative would be to change the VM_WARN_ON_ONCE_PAGE()
definition for !DEBUG_VM such that it always returns false, but since that
would differ from the behavior of the normal WARN macros, it might be too
confusing for readers.
Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 43 insertions(+), 15 deletions(-)
--- a/mm/gup.c~mm-gup-fix-try_grab_compound_head-race-with-split_huge_page
+++ a/mm/gup.c
@@ -44,6 +44,23 @@ static void hpage_pincount_sub(struct pa
atomic_sub(refs, compound_pincount_ptr(page));
}
+/* Equivalent to calling put_page() @refs times. */
+static void put_page_refs(struct page *page, int refs)
+{
+#ifdef CONFIG_DEBUG_VM
+ if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
+ return;
+#endif
+
+ /*
+ * Calling put_page() for each ref is unnecessarily slow. Only the last
+ * ref needs a put_page().
+ */
+ if (refs > 1)
+ page_ref_sub(page, refs - 1);
+ put_page(page);
+}
+
/*
* Return the compound head page with ref appropriately incremented,
* or NULL if that failed.
@@ -56,6 +73,21 @@ static inline struct page *try_get_compo
return NULL;
if (unlikely(!page_cache_add_speculative(head, refs)))
return NULL;
+
+ /*
+ * At this point we have a stable reference to the head page; but it
+ * could be that between the compound_head() lookup and the refcount
+ * increment, the compound page was split, in which case we'd end up
+ * holding a reference on a page that has nothing to do with the page
+ * we were given anymore.
+ * So now that the head page is stable, recheck that the pages still
+ * belong together.
+ */
+ if (unlikely(compound_head(page) != head)) {
+ put_page_refs(head, refs);
+ return NULL;
+ }
+
return head;
}
@@ -96,6 +128,14 @@ __maybe_unused struct page *try_grab_com
return NULL;
/*
+ * CAUTION: Don't use compound_head() on the page before this
+ * point, the result won't be stable.
+ */
+ page = try_get_compound_head(page, refs);
+ if (!page)
+ return NULL;
+
+ /*
* When pinning a compound page of order > 1 (which is what
* hpage_pincount_available() checks for), use an exact count to
* track it, via hpage_pincount_add/_sub().
@@ -103,15 +143,10 @@ __maybe_unused struct page *try_grab_com
* However, be sure to *also* increment the normal page refcount
* field at least once, so that the page really is pinned.
*/
- if (!hpage_pincount_available(page))
- refs *= GUP_PIN_COUNTING_BIAS;
-
- page = try_get_compound_head(page, refs);
- if (!page)
- return NULL;
-
if (hpage_pincount_available(page))
hpage_pincount_add(page, refs);
+ else
+ page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
orig_refs);
@@ -135,14 +170,7 @@ static void put_compound_head(struct pag
refs *= GUP_PIN_COUNTING_BIAS;
}
- VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
- /*
- * Calling put_page() for each ref is unnecessarily slow. Only the last
- * ref needs a put_page().
- */
- if (refs > 1)
- page_ref_sub(page, refs - 1);
- put_page(page);
+ put_page_refs(page, refs);
}
/**
_
Patches currently in -mm which might be from jannh(a)google.com are
mm-gup-fix-try_grab_compound_head-race-with-split_huge_page.patch
The patch titled
Subject: mm, futex: fix shared futex pgoff on shmem huge page
has been added to the -mm tree. Its filename is
mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-futex-fix-shared-futex-pgoff-o…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-futex-fix-shared-futex-pgoff-o…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen that
waking the second wakes the first instead, and leaves the second waiting:
the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu(a)google.com>
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Zhang Yi <wetpzy(a)gmail.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 16 ----------------
include/linux/pagemap.h | 13 +++++++------
kernel/futex.c | 3 +--
mm/hugetlb.c | 5 +----
4 files changed, 9 insertions(+), 28 deletions(-)
--- a/include/linux/hugetlb.h~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/include/linux/hugetlb.h
@@ -741,17 +741,6 @@ static inline int hstate_index(struct hs
return h - hstates;
}
-pgoff_t __basepage_index(struct page *page);
-
-/* Return page->index in PAGE_SIZE units */
-static inline pgoff_t basepage_index(struct page *page)
-{
- if (!PageCompound(page))
- return page->index;
-
- return __basepage_index(page);
-}
-
extern int dissolve_free_huge_page(struct page *page);
extern int dissolve_free_huge_pages(unsigned long start_pfn,
unsigned long end_pfn);
@@ -988,11 +977,6 @@ static inline int hstate_index(struct hs
return 0;
}
-static inline pgoff_t basepage_index(struct page *page)
-{
- return page->index;
-}
-
static inline int dissolve_free_huge_page(struct page *page)
{
return 0;
--- a/include/linux/pagemap.h~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/include/linux/pagemap.h
@@ -516,7 +516,7 @@ static inline struct page *read_mapping_
}
/*
- * Get index of the page with in radix-tree
+ * Get index of the page within radix-tree (but not for hugetlb pages).
* (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE)
*/
static inline pgoff_t page_to_index(struct page *page)
@@ -536,14 +536,15 @@ static inline pgoff_t page_to_index(stru
}
/*
- * Get the offset in PAGE_SIZE.
- * (TODO: hugepage should have ->index in PAGE_SIZE)
+ * Get the offset in PAGE_SIZE (even for hugetlb pages).
+ * (TODO: hugetlb pages should have ->index in PAGE_SIZE)
*/
static inline pgoff_t page_to_pgoff(struct page *page)
{
- if (unlikely(PageHeadHuge(page)))
- return page->index << compound_order(page);
-
+ if (unlikely(PageHuge(page))) {
+ extern pgoff_t hugetlb_basepage_index(struct page *page);
+ return hugetlb_basepage_index(page);
+ }
return page_to_index(page);
}
--- a/kernel/futex.c~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/kernel/futex.c
@@ -35,7 +35,6 @@
#include <linux/jhash.h>
#include <linux/pagemap.h>
#include <linux/syscalls.h>
-#include <linux/hugetlb.h>
#include <linux/freezer.h>
#include <linux/memblock.h>
#include <linux/fault-inject.h>
@@ -650,7 +649,7 @@ again:
key->both.offset |= FUT_OFF_INODE; /* inode-based key */
key->shared.i_seq = get_inode_sequence_number(inode);
- key->shared.pgoff = basepage_index(tail);
+ key->shared.pgoff = page_to_pgoff(tail);
rcu_read_unlock();
}
--- a/mm/hugetlb.c~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/mm/hugetlb.c
@@ -1588,15 +1588,12 @@ struct address_space *hugetlb_page_mappi
return NULL;
}
-pgoff_t __basepage_index(struct page *page)
+pgoff_t hugetlb_basepage_index(struct page *page)
{
struct page *page_head = compound_head(page);
pgoff_t index = page_index(page_head);
unsigned long compound_idx;
- if (!PageHuge(page_head))
- return page_index(page);
-
if (compound_order(page_head) >= MAX_ORDER)
compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
else
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-fix-__split_huge_pmd_locked-on-shmem-migration-entry.patch
mm-thp-make-is_huge_zero_pmd-safe-and-quicker.patch
mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting.patch
mm-thp-fix-vma_address-if-virtual-address-below-file-offset.patch
mm-thp-unmap_mapping_page-to-fix-thp-truncate_cleanup_page.patch
mm-page_vma_mapped_walk-use-page-for-pvmw-page.patch
mm-page_vma_mapped_walk-settle-pagehuge-on-entry.patch
mm-page_vma_mapped_walk-use-pmd_read_atomic.patch
mm-page_vma_mapped_walk-use-pmde-for-pvmw-pmd.patch
mm-page_vma_mapped_walk-prettify-pvmw_migration-block.patch
mm-page_vma_mapped_walk-crossing-page-table-boundary.patch
mm-page_vma_mapped_walk-add-a-level-of-indentation.patch
mm-page_vma_mapped_walk-use-goto-instead-of-while-1.patch
mm-page_vma_mapped_walk-get-vma_address_end-earlier.patch
mm-thp-fix-page_vma_mapped_walk-if-thp-mapped-by-ptes.patch
mm-thp-another-pvmw_sync-fix-in-page_vma_mapped_walk.patch
mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page.patch
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent
freeing of page tables (via local_irq_save()), but not against
concurrent TLB flushes, freeing of data pages, or splitting of compound
pages.
Because no reference is held to the page when try_grab_compound_head()
is called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).
The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound
page may have been split up (either explicitly through split_huge_page()
or by freeing the compound page to the buddy allocator and then
allocating its individual order-0 pages).
If that happens, get_user_pages_fast() may end up returning the right
page but lifting the refcount on a now-unrelated page, leading to
use-after-free of pages.
To fix it:
Re-check whether the pages still belong together after lifting the
refcount on the head page.
Move anything else that checks compound_head(page) below the refcount
increment.
This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
so that PV TLB flushes are used instead of IPIs).
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: stable(a)vger.kernel.org
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh(a)google.com>
---
resending because linux-mm was down
mm/gup.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 15 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 3ded6a5f26b2..1f9c0ac15073 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -43,8 +43,21 @@ static void hpage_pincount_sub(struct page *page, int refs)
atomic_sub(refs, compound_pincount_ptr(page));
}
+/* Equivalent to calling put_page() @refs times. */
+static void put_page_refs(struct page *page, int refs)
+{
+ VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+ /*
+ * Calling put_page() for each ref is unnecessarily slow. Only the last
+ * ref needs a put_page().
+ */
+ if (refs > 1)
+ page_ref_sub(page, refs - 1);
+ put_page(page);
+}
+
/*
* Return the compound head page with ref appropriately incremented,
* or NULL if that failed.
*/
@@ -55,8 +68,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
if (WARN_ON_ONCE(page_ref_count(head) < 0))
return NULL;
if (unlikely(!page_cache_add_speculative(head, refs)))
return NULL;
+
+ /*
+ * At this point we have a stable reference to the head page; but it
+ * could be that between the compound_head() lookup and the refcount
+ * increment, the compound page was split, in which case we'd end up
+ * holding a reference on a page that has nothing to do with the page
+ * we were given anymore.
+ * So now that the head page is stable, recheck that the pages still
+ * belong together.
+ */
+ if (unlikely(compound_head(page) != head)) {
+ put_page_refs(head, refs);
+ return NULL;
+ }
+
return head;
}
/*
@@ -94,25 +122,28 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page,
if (unlikely((flags & FOLL_LONGTERM) &&
!is_pinnable_page(page)))
return NULL;
+ /*
+ * CAUTION: Don't use compound_head() on the page before this
+ * point, the result won't be stable.
+ */
+ page = try_get_compound_head(page, refs);
+ if (!page)
+ return NULL;
+
/*
* When pinning a compound page of order > 1 (which is what
* hpage_pincount_available() checks for), use an exact count to
* track it, via hpage_pincount_add/_sub().
*
* However, be sure to *also* increment the normal page refcount
* field at least once, so that the page really is pinned.
*/
- if (!hpage_pincount_available(page))
- refs *= GUP_PIN_COUNTING_BIAS;
-
- page = try_get_compound_head(page, refs);
- if (!page)
- return NULL;
-
if (hpage_pincount_available(page))
hpage_pincount_add(page, refs);
+ else
+ page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
orig_refs);
@@ -134,16 +165,9 @@ static void put_compound_head(struct page *page, int refs, unsigned int flags)
else
refs *= GUP_PIN_COUNTING_BIAS;
}
- VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
- /*
- * Calling put_page() for each ref is unnecessarily slow. Only the last
- * ref needs a put_page().
- */
- if (refs > 1)
- page_ref_sub(page, refs - 1);
- put_page(page);
+ put_page_refs(page, refs);
}
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
base-commit: 614124bea77e452aa6df7a8714e8bc820b489922
--
2.32.0.272.g935e593368-goog