On Mon, Oct 15, 2018 at 06:54:31AM -0700, Omer Tripp wrote:
> Hi Greg and all,
>
> Here is my analysis of the complete gadget, and looking forward to your
> corrections/feedback if there are any inaccuracies:
>
>
> 1.
>
> __close_fd() is reachable via the close() syscall with a user-controlled
> fd.
> 2.
>
> If said bounds check is mispredicted, then a user-controlled address
> fdt->fd[fd] is obtained then dereferenced, and the value of a
> user-controlled address is loaded into the local variable file.
> 3.
>
> file is then passed as an argument to filp_close, where the cache
> lines secret
> + offsetof(f_op) and secret + offsetof(f_mode) are hot and vulnerable to
> a timing channel attack.
>
>
> The mitigation proposed by Greg Hackmann blocks this gadget.
What ever happened to this patch? Did it get reposted? If not, can
someone please do so with this text in the changelog?
thanks,
greg k-h
At least old Xen net backends seem to send frags with no real data
sometimes. In case such a fragment happens to occur with the frag limit
already reached the frontend will BUG currently even if this situation
is easily recoverable.
Modify the BUG_ON() condition accordingly.
Cc: stable(a)vger.kernel.org
Tested-by: Dietmar Hahn <dietmar.hahn(a)ts.fujitsu.com>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
---
drivers/net/xen-netfront.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index f17f602e6171..5b97cc946d70 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -905,7 +905,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {
unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
- BUG_ON(pull_to <= skb_headlen(skb));
+ BUG_ON(pull_to < skb_headlen(skb));
__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
}
if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
--
2.16.4
commit d57f9da890696af1484f4a47f7f123560197865a upstream.
struct bioctx includes the ref refcount_t to track the number of I/O
fragments used to process a target BIO as well as ensure that the zone
of the BIO is kept in the active state throughout the lifetime of the
BIO. However, since decrementing of this reference count is done in the
target .end_io method, the function bio_endio() must be called multiple
times for read and write target BIOs, which causes problems with the
value of the __bi_remaining struct bio field for chained BIOs (e.g. the
clone BIO passed by dm core is large and splits into fragments by the
block layer), resulting in incorrect values and inconsistencies with the
BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call:
BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
in bio_remaining_done() called from bio_endio().
Fix this ensuring that bio_endio() is called only once for any target
BIO by always using internal clone BIOs for processing any read or
write target BIO. This allows reference counting using the target BIO
context counter to trigger the target BIO completion bio_endio() call
once all data, metadata and other zone work triggered by the BIO
complete.
Overall, this simplifies the code too as the target .end_io becomes
unnecessary and differences between read and write BIO issuing and
completion processing disappear.
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable(a)vger.kernel.org #4.14
Signed-off-by: Damien Le Moal <damien.lemoal(a)wdc.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
---
drivers/md/dm-zoned-target.c | 122 +++++++++++------------------------
1 file changed, 38 insertions(+), 84 deletions(-)
diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index ba6b0a90ecfb..532bfce7f072 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -20,7 +20,6 @@ struct dmz_bioctx {
struct dm_zone *zone;
struct bio *bio;
atomic_t ref;
- blk_status_t status;
};
/*
@@ -78,65 +77,66 @@ static inline void dmz_bio_endio(struct bio *bio, blk_status_t status)
{
struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
- if (bioctx->status == BLK_STS_OK && status != BLK_STS_OK)
- bioctx->status = status;
- bio_endio(bio);
+ if (status != BLK_STS_OK && bio->bi_status == BLK_STS_OK)
+ bio->bi_status = status;
+
+ if (atomic_dec_and_test(&bioctx->ref)) {
+ struct dm_zone *zone = bioctx->zone;
+
+ if (zone) {
+ if (bio->bi_status != BLK_STS_OK &&
+ bio_op(bio) == REQ_OP_WRITE &&
+ dmz_is_seq(zone))
+ set_bit(DMZ_SEQ_WRITE_ERR, &zone->flags);
+ dmz_deactivate_zone(zone);
+ }
+ bio_endio(bio);
+ }
}
/*
- * Partial clone read BIO completion callback. This terminates the
+ * Completion callback for an internally cloned target BIO. This terminates the
* target BIO when there are no more references to its context.
*/
-static void dmz_read_bio_end_io(struct bio *bio)
+static void dmz_clone_endio(struct bio *clone)
{
- struct dmz_bioctx *bioctx = bio->bi_private;
- blk_status_t status = bio->bi_status;
+ struct dmz_bioctx *bioctx = clone->bi_private;
+ blk_status_t status = clone->bi_status;
- bio_put(bio);
+ bio_put(clone);
dmz_bio_endio(bioctx->bio, status);
}
/*
- * Issue a BIO to a zone. The BIO may only partially process the
+ * Issue a clone of a target BIO. The clone may only partially process the
* original target BIO.
*/
-static int dmz_submit_read_bio(struct dmz_target *dmz, struct dm_zone *zone,
- struct bio *bio, sector_t chunk_block,
- unsigned int nr_blocks)
+static int dmz_submit_bio(struct dmz_target *dmz, struct dm_zone *zone,
+ struct bio *bio, sector_t chunk_block,
+ unsigned int nr_blocks)
{
struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
- sector_t sector;
struct bio *clone;
- /* BIO remap sector */
- sector = dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block);
-
- /* If the read is not partial, there is no need to clone the BIO */
- if (nr_blocks == dmz_bio_blocks(bio)) {
- /* Setup and submit the BIO */
- bio->bi_iter.bi_sector = sector;
- atomic_inc(&bioctx->ref);
- generic_make_request(bio);
- return 0;
- }
-
- /* Partial BIO: we need to clone the BIO */
clone = bio_clone_fast(bio, GFP_NOIO, dmz->bio_set);
if (!clone)
return -ENOMEM;
- /* Setup the clone */
- clone->bi_iter.bi_sector = sector;
+ bio_set_dev(clone, dmz->dev->bdev);
+ clone->bi_iter.bi_sector =
+ dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block);
clone->bi_iter.bi_size = dmz_blk2sect(nr_blocks) << SECTOR_SHIFT;
- clone->bi_end_io = dmz_read_bio_end_io;
+ clone->bi_end_io = dmz_clone_endio;
clone->bi_private = bioctx;
bio_advance(bio, clone->bi_iter.bi_size);
- /* Submit the clone */
atomic_inc(&bioctx->ref);
generic_make_request(clone);
+ if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
+ zone->wp_block += nr_blocks;
+
return 0;
}
@@ -214,7 +214,7 @@ static int dmz_handle_read(struct dmz_target *dmz, struct dm_zone *zone,
if (nr_blocks) {
/* Valid blocks found: read them */
nr_blocks = min_t(unsigned int, nr_blocks, end_block - chunk_block);
- ret = dmz_submit_read_bio(dmz, rzone, bio, chunk_block, nr_blocks);
+ ret = dmz_submit_bio(dmz, rzone, bio, chunk_block, nr_blocks);
if (ret)
return ret;
chunk_block += nr_blocks;
@@ -228,25 +228,6 @@ static int dmz_handle_read(struct dmz_target *dmz, struct dm_zone *zone,
return 0;
}
-/*
- * Issue a write BIO to a zone.
- */
-static void dmz_submit_write_bio(struct dmz_target *dmz, struct dm_zone *zone,
- struct bio *bio, sector_t chunk_block,
- unsigned int nr_blocks)
-{
- struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
-
- /* Setup and submit the BIO */
- bio_set_dev(bio, dmz->dev->bdev);
- bio->bi_iter.bi_sector = dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block);
- atomic_inc(&bioctx->ref);
- generic_make_request(bio);
-
- if (dmz_is_seq(zone))
- zone->wp_block += nr_blocks;
-}
-
/*
* Write blocks directly in a data zone, at the write pointer.
* If a buffer zone is assigned, invalidate the blocks written
@@ -265,7 +246,9 @@ static int dmz_handle_direct_write(struct dmz_target *dmz,
return -EROFS;
/* Submit write */
- dmz_submit_write_bio(dmz, zone, bio, chunk_block, nr_blocks);
+ ret = dmz_submit_bio(dmz, zone, bio, chunk_block, nr_blocks);
+ if (ret)
+ return ret;
/*
* Validate the blocks in the data zone and invalidate
@@ -301,7 +284,9 @@ static int dmz_handle_buffered_write(struct dmz_target *dmz,
return -EROFS;
/* Submit write */
- dmz_submit_write_bio(dmz, bzone, bio, chunk_block, nr_blocks);
+ ret = dmz_submit_bio(dmz, bzone, bio, chunk_block, nr_blocks);
+ if (ret)
+ return ret;
/*
* Validate the blocks in the buffer zone
@@ -600,7 +585,6 @@ static int dmz_map(struct dm_target *ti, struct bio *bio)
bioctx->zone = NULL;
bioctx->bio = bio;
atomic_set(&bioctx->ref, 1);
- bioctx->status = BLK_STS_OK;
/* Set the BIO pending in the flush list */
if (!nr_sectors && bio_op(bio) == REQ_OP_WRITE) {
@@ -623,35 +607,6 @@ static int dmz_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_SUBMITTED;
}
-/*
- * Completed target BIO processing.
- */
-static int dmz_end_io(struct dm_target *ti, struct bio *bio, blk_status_t *error)
-{
- struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
-
- if (bioctx->status == BLK_STS_OK && *error)
- bioctx->status = *error;
-
- if (!atomic_dec_and_test(&bioctx->ref))
- return DM_ENDIO_INCOMPLETE;
-
- /* Done */
- bio->bi_status = bioctx->status;
-
- if (bioctx->zone) {
- struct dm_zone *zone = bioctx->zone;
-
- if (*error && bio_op(bio) == REQ_OP_WRITE) {
- if (dmz_is_seq(zone))
- set_bit(DMZ_SEQ_WRITE_ERR, &zone->flags);
- }
- dmz_deactivate_zone(zone);
- }
-
- return DM_ENDIO_DONE;
-}
-
/*
* Get zoned device information.
*/
@@ -946,7 +901,6 @@ static struct target_type dmz_type = {
.ctr = dmz_ctr,
.dtr = dmz_dtr,
.map = dmz_map,
- .end_io = dmz_end_io,
.io_hints = dmz_io_hints,
.prepare_ioctl = dmz_prepare_ioctl,
.postsuspend = dmz_suspend,
--
2.19.2
On Wed, 12 Dec 2018, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen(a)linux.intel.com>
>
> Memory protection key behavior should be the same in a child as it was
> in the parent before a fork. But, there is a bug that resets the
> state in the child at fork instead of preserving it.
>
> Our creation of new mm's is a bit convoluted. At fork(), the code
> does:
>
> 1. memcpy() the parent mm to initialize child
> 2. mm_init() to initalize some select stuff stuff
> 3. dup_mmap() to create true copies that memcpy()
> did not do right.
>
> For pkeys, we need to preserve two bits of state across a fork:
> 'execute_only_pkey' and 'pkey_allocation_map'. Those are preserved by
> the memcpy(), which I thought did the right thing. But, mm_init()
> calls init_new_context(), which I thought was *only* for execve()-time
> and overwrites 'execute_only_pkey' and 'pkey_allocation_map' with
> "new" values. But, alas, init_new_context() is used at execve() and
> fork().
>
> The result is that, after a fork(), the child's pkey state ends up
> looking like it does after an execve(), which is totally wrong. pkeys
> that are already allocated can be allocated again, for instance.
>
> To fix this, add code called by dup_mmap() to copy the pkey state from
> parent to child explicitly. Also add a comment above init_new_context()
> to make it more clear to the next poor sod what this code is used for.
>
> Fixes: e8c24d3a23a ("x86/pkeys: Allocation/free syscalls")
> Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
> Cc: Thomas Gleixner <tglx(a)linutronix.de>
> Cc: Ingo Molnar <mingo(a)redhat.com>
> Cc: Borislav Petkov <bp(a)alien8.de>
> Cc: "H. Peter Anvin" <hpa(a)zytor.com>
> Cc: x86(a)kernel.org
> Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
> Cc: Peter Zijlstra <peterz(a)infradead.org>
> Cc: Michael Ellerman <mpe(a)ellerman.id.au>
> Cc: Will Deacon <will.deacon(a)arm.com>
> Cc: Andy Lutomirski <luto(a)kernel.org>
> Cc: Joerg Roedel <jroedel(a)suse.de>
> Cc: stable(a)vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx(a)linutronix.de>
[ resending without broken headers ]
this is a backport of commit 7aa54be297655 ("locking/qspinlock, x86:
Provide liveness guarantee") for the v4.9 stable tree.
For the v4.4 tree the ARCH_USE_QUEUED_SPINLOCKS option got disabled on
x86.
For v4.9 it has been decided to do a minimal backport of the final fix
(including all its dependencies).
With this backport I can't reproduce the issue in the latest v4.9-RT
tree. I was able to boot (and use) an arm64 box with these patches so it
is not broken in an abvious way.
Sebastian
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/hugetlbfs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..3244147fc42b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -505,8 +505,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +540,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
--
2.17.2
this is a backport of commit 7aa54be297655 ("locking/qspinlock, x86:
Provide liveness guarantee") for the v4.19 stable tree.
Initially I assumed that this was merged late in v4.19-rc but actually
it is just part v4.20-rc1.
For v4.19, most things are already in the tree. The GEN_BINARY_RMWcc
macro is still "old" and I skipped the documentation update.
Sebastian
The patch titled
Subject: hugetlbfs: remove unnecessary code after i_mmap_rwsem synchronization
has been removed from the -mm tree. Its filename was
hugetlbfs-remove-unnecessary-code-after-i_mmap_rwsem-synchronization.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: remove unnecessary code after i_mmap_rwsem synchronization
After expanding i_mmap_rwsem use for better shared pmd and page fault/
truncation synchronization, remove code that is no longer necessary.
Link: http://lkml.kernel.org/r/20181203200850.6460-4-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 46 +++++++++++++----------------------------
mm/hugetlb.c | 21 ++++++++----------
2 files changed, 25 insertions(+), 42 deletions(-)
--- a/fs/hugetlbfs/inode.c~hugetlbfs-remove-unnecessary-code-after-i_mmap_rwsem-synchronization
+++ a/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cac
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struc
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struc
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -624,7 +604,11 @@ static long hugetlbfs_fallocate(struct f
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
--- a/mm/hugetlb.c~hugetlbfs-remove-unnecessary-code-after-i_mmap_rwsem-synchronization
+++ a/mm/hugetlb.c
@@ -3761,16 +3761,16 @@ static vm_fault_t hugetlb_no_page(struct
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3860,9 +3860,6 @@ retry:
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3965,8 +3962,10 @@ vm_fault_t hugetlb_fault(struct mm_struc
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
The patch titled
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
has been removed from the -mm tree. Its filename was
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
Link: http://lkml.kernel.org/r/20181203200850.6460-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/fs/hugetlbfs/inode.c
@@ -505,8 +505,8 @@ static int hugetlb_vmtruncate(struct ino
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +540,8 @@ static long hugetlbfs_punch_hole(struct
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlbfs-remove-unnecessary-code-after-i_mmap_rwsem-synchronization.patch