After expanding i_mmap_rwsem use for better shared pmd and page fault/
truncation synchronization, remove code that is no longer necessary.
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/hugetlbfs/inode.c | 46 +++++++++++++++-----------------------------
mm/hugetlb.c | 21 ++++++++++----------
2 files changed, 25 insertions(+), 42 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3244147fc42b..a9c00c6ef80d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -624,7 +604,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 362601b69c56..89e1a253a40b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3760,16 +3760,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3859,9 +3859,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3964,8 +3961,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
--
2.17.2
This is the start of the stable review cycle for the 4.9.146 release.
There are 51 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun Dec 16 11:56:52 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.146-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.146-rc1
Guenter Roeck <linux(a)roeck-us.net>
staging: speakup: Replace strncpy with memcpy
Namhyung Kim <namhyung(a)kernel.org>
pstore: Convert console write to use ->write_buf
Pan Bian <bianpan2016(a)163.com>
ocfs2: fix potential use after free
Qian Cai <cai(a)gmx.us>
debugobjects: avoid recursive calls with kmemleak
Pan Bian <bianpan2016(a)163.com>
hfsplus: do not free node before using
Pan Bian <bianpan2016(a)163.com>
hfs: do not free node before using
Larry Chen <lchen(a)suse.com>
ocfs2: fix deadlock caused by ocfs2_defrag_extent()
Colin Ian King <colin.king(a)canonical.com>
fscache, cachefiles: remove redundant variable 'cache'
NeilBrown <neilb(a)suse.com>
fscache: fix race between enablement and dropping of object
Srikanth Boddepalli <boddepalli.srikanth(a)gmail.com>
xen: xlate_mmu: add missing header to fix 'W=1' warning
Y.C. Chen <yc_chen(a)aspeedtech.com>
drm/ast: fixed reading monitor EDID not stable issue
Pan Bian <bianpan2016(a)163.com>
net: hisilicon: remove unexpected free_netdev
Josh Elsasser <jelsasser(a)appneta.com>
ixgbe: recognize 1000BaseLX SFP modules as 1Gbps
Yunjian Wang <wangyunjian(a)huawei.com>
igb: fix uninitialized variables
Kiran Kumar Modukuri <kiran.modukuri(a)gmail.com>
cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
Lorenzo Bianconi <lorenzo.bianconi(a)redhat.com>
net: thunderx: fix NULL pointer dereference in nic_remove
Yi Wang <wang.yi59(a)zte.com.cn>
x86/kvm/vmx: fix old-style function declaration
Yi Wang <wang.yi59(a)zte.com.cn>
KVM: x86: fix empty-body warnings
Aaro Koskinen <aaro.koskinen(a)iki.fi>
USB: omap_udc: fix USB gadget functionality on Palm Tungsten E
Aaro Koskinen <aaro.koskinen(a)iki.fi>
USB: omap_udc: fix omap_udc_start() on 15xx machines
Aaro Koskinen <aaro.koskinen(a)iki.fi>
USB: omap_udc: fix crashes on probe error and module removal
Aaro Koskinen <aaro.koskinen(a)iki.fi>
USB: omap_udc: use devm_request_irq()
Xin Long <lucien.xin(a)gmail.com>
ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notf
Martynas Pumputis <m(a)lambda.lt>
bpf: fix check of allowed specifiers in bpf_trace_printk
Pan Bian <bianpan2016(a)163.com>
exportfs: do not read dentry after free
Peter Ujfalusi <peter.ujfalusi(a)ti.com>
ASoC: omap-dmic: Add pm_qos handling to avoid overruns with CPU_IDLE
Peter Ujfalusi <peter.ujfalusi(a)ti.com>
ASoC: omap-mcpdm: Add pm_qos handling to avoid under/overruns with CPU_IDLE
Majd Dibbiny <majd(a)mellanox.com>
RDMA/mlx5: Fix fence type for IB_WR_LOCAL_INV WR
Robbie Ko <robbieko(a)synology.com>
Btrfs: send, fix infinite loop due to directory rename dependencies
Artem Savkov <asavkov(a)redhat.com>
objtool: Fix segfault in .cold detection with -ffunction-sections
Artem Savkov <asavkov(a)redhat.com>
objtool: Fix double-free in .cold detection error path
Huacai Chen <chenhc(a)lemote.com>
hwmon: (w83795) temp4_type has writable permission
Tzung-Bi Shih <tzungbi(a)google.com>
ASoC: dapm: Recalculate audio map forcely when card instantiated
Peter Ujfalusi <peter.ujfalusi(a)ti.com>
ASoC: omap-abe-twl6040: Fix missing audio card caused by deferred probing
Nicolin Chen <nicoleotsuka(a)gmail.com>
hwmon: (ina2xx) Fix current value calculation
Thomas Richter <tmricht(a)linux.ibm.com>
s390/cpum_cf: Reject request for sampling in event initialization
Florian Westphal <fw(a)strlen.de>
selftests: add script to stress-test nft packet path vs. control plane
YueHaibing <yuehaibing(a)huawei.com>
sysv: return 'err' instead of 0 in __sysv_write_inode
Janusz Krzysztofik <jmkrzyszt(a)gmail.com>
ARM: OMAP1: ams-delta: Fix possible use of uninitialized field
Adam Ford <aford173(a)gmail.com>
ARM: dts: logicpd-somlv: Fix interrupt on mmc3_dat1
Nathan Chancellor <natechancellor(a)gmail.com>
ARM: OMAP2+: prm44xx: Fix section annotation on omap44xx_prm_enable_io_wakeup
Stefano Brivio <sbrivio(a)redhat.com>
neighbour: Avoid writing before skb->head in neigh_hh_output()
Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
tun: forbid iface creation with rtnl ops
Yuchung Cheng <ycheng(a)google.com>
tcp: fix NULL ref in tail loss probe
Eric Dumazet <edumazet(a)google.com>
rtnetlink: ndo_dflt_fdb_dump() only work for ARPHRD_ETHER devices
Christoph Paasch <cpaasch(a)apple.com>
net: Prevent invalid access to skb->prev in __qdisc_drop_all
Heiner Kallweit <hkallweit1(a)gmail.com>
net: phy: don't allow __set_phy_supported to add unsupported modes
Tarick Bedeir <tarick(a)google.com>
net/mlx4_core: Correctly set PFC param if global pause is turned off.
Su Yanjun <suyj.fnst(a)cn.fujitsu.com>
net: 8139cp: fix a BUG triggered by changing mtu with network traffic
Stefano Brivio <sbrivio(a)redhat.com>
ipv6: Check available headroom in ip6_xmit() even without options
Jiri Wiesner <jwiesner(a)suse.com>
ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/logicpd-som-lv.dtsi | 2 +-
arch/arm/mach-omap1/board-ams-delta.c | 3 +
arch/arm/mach-omap2/prm44xx.c | 2 +-
arch/s390/kernel/perf_cpum_cf.c | 2 +
arch/x86/kvm/lapic.c | 2 +-
arch/x86/kvm/vmx.c | 8 +-
drivers/gpu/drm/ast/ast_mode.c | 36 +++++++--
drivers/hwmon/ina2xx.c | 2 +-
drivers/hwmon/w83795.c | 2 +-
drivers/infiniband/hw/mlx5/qp.c | 19 ++---
drivers/net/ethernet/cavium/thunder/nic_main.c | 3 +
drivers/net/ethernet/hisilicon/hip04_eth.c | 4 +-
drivers/net/ethernet/intel/igb/e1000_i210.c | 1 +
drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 4 +-
drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 4 +-
drivers/net/ethernet/realtek/8139cp.c | 5 ++
drivers/net/phy/phy_device.c | 19 ++---
drivers/net/tun.c | 6 +-
drivers/staging/speakup/kobjects.c | 4 +-
drivers/usb/gadget/udc/omap_udc.c | 87 ++++++++--------------
drivers/xen/xlate_mmu.c | 1 +
fs/btrfs/send.c | 11 ++-
fs/cachefiles/rdwr.c | 9 ++-
fs/exportfs/expfs.c | 2 +-
fs/fscache/object.c | 3 +
fs/hfs/btree.c | 3 +-
fs/hfsplus/btree.c | 3 +-
fs/ocfs2/export.c | 2 +-
fs/ocfs2/move_extents.c | 47 ++++++------
fs/pstore/platform.c | 4 +-
fs/sysv/inode.c | 2 +-
include/net/neighbour.h | 28 +++++--
kernel/trace/bpf_trace.c | 8 +-
lib/debugobjects.c | 3 +-
net/core/rtnetlink.c | 3 +
net/ipv4/ip_fragment.c | 7 ++
net/ipv4/tcp_output.c | 12 ++-
net/ipv6/ip6_output.c | 42 +++++------
net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +-
net/ipv6/reassembly.c | 8 +-
net/netfilter/ipvs/ip_vs_ctl.c | 3 +
net/sched/sch_netem.c | 3 +
sound/soc/omap/omap-abe-twl6040.c | 67 ++++++++---------
sound/soc/omap/omap-dmic.c | 9 +++
sound/soc/omap/omap-mcpdm.c | 43 ++++++++++-
sound/soc/soc-core.c | 1 +
tools/objtool/elf.c | 19 ++++-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/netfilter/Makefile | 6 ++
tools/testing/selftests/netfilter/config | 2 +
.../selftests/netfilter/nft_trans_stress.sh | 78 +++++++++++++++++++
52 files changed, 439 insertions(+), 218 deletions(-)
Mr Gleixner,
I was upset when I compiled 4.14.87 found that SCHED_SMT had been
forced on. At the time, I just reported it to my blog, and posted a
question about it to a couple of forums, and stayed with an earlier
kernel.
However, when 4.14.88 came out, and still the same situation, alarm
bells went off, and I looked through the kernel changelog. Found it,
4.14.86:
"x86/Kconfig: Select SCHED_SMT if SMP enabled"
Then:
"CONFIG_SCHED_SMT is enabled by all distros, so there is not a real point to
have it configurable. ..."
...that is a lie. It would be correct to state that is true of the
distros you use, and presumably also for all of you guys who signed
off on it.
Puppy Linux is an example of a distro that has mostly not had
SCHED_SMT enabled. Ditto for most of the forks of Puppy. Two distros
that I currently maintain, Quirky and EasyOS (easyos.org) have SMP
enabled but not SCHED_SMT.
The difference between them is important, they should remain
independently settable. I am so surprised that all of you guys went
along with forcing it on.
For the record, my blog post:
http://bkhome.org/news/201812/kernel-41487-compiled.html
Regards,
Barry Kauler