July 2018 - Linux-stable-mirror

[PATCH] perf/core: fix a possible deadlock scenario

by Cong Wang

hrtimer_cancel() busy-waits for the hrtimer callback to stop, pretty much like del_timer_sync(). This creates a possible deadlock scenario where we hold a spinlock before calling hrtimer_cancel() while in trying to acquire the same spinlock in the callback. This kind of deadlock is already known and is catchable by lockdep, like for del_timer_sync(), we can add lockdep annotations. However, it is still missing for hrtimer_cancel(). (I have a WIP patch to make it complete for hrtimer_cancel() but it breaks booting.) And there is such a deadlock scenario in kernel/events/core.c too, well actually, it is a simpler version: the hrtimer callback waits for itself to finish on the same CPU! It sounds stupid but it is not obvious at all, it hides very deeply in the perf event code: cpu_clock_event_init(): perf_swevent_init_hrtimer(): hwc->hrtimer.function = perf_swevent_hrtimer; perf_swevent_hrtimer(): __perf_event_overflow(): __perf_event_account_interrupt(): perf_adjust_period(): pmu->stop(): cpu_clock_event_stop(): perf_swevent_cancel(): hrtimer_cancel() Getting stuck in a timer doesn't sound very scary, however, in this case, its consequences are a disaster: perf_event_overflow() which calls __perf_event_overflow() is called in NMI handler too, so it is racy with hrtimer callback as disabling IRQ can't possibly disable NMI. This means this hrtimer callback once interrupted by an NMI handler could deadlock within NMI! As a further consequence, other IRQ handling is blocked too, notably the IPI handler, especially when smp_call_function_*() waits for their callbacks synchronously. This is why we saw so many soft lockup's in smp_call_function_single() given how widely they are used in kernel. Ironically, perf event code uses synchronous smp_call_function_single() heavily too. The fix is not easy. To minimize the impact, ideally we should just avoid busy waiting when it is called within the hrtimer callback on the same CPU, there is no reason to wait for itself to finish anyway. Probably it doesn't even need to cancel itself either since it will restart by pmu->start() later. There are two possible fixes here: 1. Modify hrtimer API to detect if a hrtimer callback is running on the same CPU now. This does not look pretty though. 2. Passing some information from perf_swevent_hrtimer() down to perf_swevent_cancel(). So I pick the latter approach, it is simple and straightforward. Note, currently perf_swevent_hrtimer() still races with perf_event_overflow() in NMI on the same CPU anyway, given there is no lock around and probably locking does not even help. But it is nothing new, and the race itself is not bad either, at most we have some inconsistent updates on the event sample period. Fixes: abd50713944c ("perf: Reimplement frequency driven sampling") Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme(a)kernel.org> Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com> Cc: Jiri Olsa <jolsa(a)redhat.com> Cc: Namhyung Kim <namhyung(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com> --- include/linux/perf_event.h | 3 +++ kernel/events/core.c | 43 +++++++++++++++++++++++++++---------------- 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1fa12887ec02..aab39b8aa720 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -310,6 +310,9 @@ struct pmu { #define PERF_EF_START 0x01 /* start the counter when adding */ #define PERF_EF_RELOAD 0x02 /* reload the counter when starting */ #define PERF_EF_UPDATE 0x04 /* update the counter when stopping */ +#define PERF_EF_NO_WAIT 0x08 /* do not wait when stopping, for + * example, waiting for a timer + */ /* * Adds/Removes a counter to/from the PMU, can be done inside a diff --git a/kernel/events/core.c b/kernel/events/core.c index 8f0434a9951a..f15832346b35 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3555,7 +3555,8 @@ do { \ static DEFINE_PER_CPU(int, perf_throttled_count); static DEFINE_PER_CPU(u64, perf_throttled_seq); -static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable) +static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, + bool disable, bool nowait) { struct hw_perf_event *hwc = &event->hw; s64 period, sample_period; @@ -3574,8 +3575,13 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo hwc->sample_period = sample_period; if (local64_read(&hwc->period_left) > 8*sample_period) { - if (disable) - event->pmu->stop(event, PERF_EF_UPDATE); + if (disable) { + int flags = PERF_EF_UPDATE; + + if (nowait) + flags |= PERF_EF_NO_WAIT; + event->pmu->stop(event, flags); + } local64_set(&hwc->period_left, 0); @@ -3645,7 +3651,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx, * twice. */ if (delta > 0) - perf_adjust_period(event, period, delta, false); + perf_adjust_period(event, period, delta, false, false); event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); next: @@ -7681,7 +7687,8 @@ static void perf_log_itrace_start(struct perf_event *event) } static int -__perf_event_account_interrupt(struct perf_event *event, int throttle) +__perf_event_account_interrupt(struct perf_event *event, int throttle, + bool nowait) { struct hw_perf_event *hwc = &event->hw; int ret = 0; @@ -7710,7 +7717,8 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle) hwc->freq_time_stamp = now; if (delta > 0 && delta < 2*TICK_NSEC) - perf_adjust_period(event, delta, hwc->last_period, true); + perf_adjust_period(event, delta, hwc->last_period, true, + nowait); } return ret; @@ -7718,7 +7726,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle) int perf_event_account_interrupt(struct perf_event *event) { - return __perf_event_account_interrupt(event, 1); + return __perf_event_account_interrupt(event, 1, false); } /* @@ -7727,7 +7735,7 @@ int perf_event_account_interrupt(struct perf_event *event) static int __perf_event_overflow(struct perf_event *event, int throttle, struct perf_sample_data *data, - struct pt_regs *regs) + struct pt_regs *regs, bool nowait) { int events = atomic_read(&event->event_limit); int ret = 0; @@ -7739,7 +7747,7 @@ static int __perf_event_overflow(struct perf_event *event, if (unlikely(!is_sampling_event(event))) return 0; - ret = __perf_event_account_interrupt(event, throttle); + ret = __perf_event_account_interrupt(event, throttle, nowait); /* * XXX event_limit might not quite work as expected on inherited @@ -7768,7 +7776,7 @@ int perf_event_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) { - return __perf_event_overflow(event, 1, data, regs); + return __perf_event_overflow(event, 1, data, regs, true); } /* @@ -7831,7 +7839,7 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow, for (; overflow; overflow--) { if (__perf_event_overflow(event, throttle, - data, regs)) { + data, regs, false)) { /* * We inhibit the overflow from happening when * hwc->interrupts == MAX_INTERRUPTS. @@ -9110,7 +9118,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) if (regs && !perf_exclude_event(event, regs)) { if (!(event->attr.exclude_idle && is_idle_task(current))) - if (__perf_event_overflow(event, 1, &data, regs)) + if (__perf_event_overflow(event, 1, &data, regs, true)) ret = HRTIMER_NORESTART; } @@ -9141,7 +9149,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event) HRTIMER_MODE_REL_PINNED); } -static void perf_swevent_cancel_hrtimer(struct perf_event *event) +static void perf_swevent_cancel_hrtimer(struct perf_event *event, bool sync) { struct hw_perf_event *hwc = &event->hw; @@ -9149,7 +9157,10 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event) ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer); local64_set(&hwc->period_left, ktime_to_ns(remaining)); - hrtimer_cancel(&hwc->hrtimer); + if (sync) + hrtimer_cancel(&hwc->hrtimer); + else + hrtimer_try_to_cancel(&hwc->hrtimer); } } @@ -9200,7 +9211,7 @@ static void cpu_clock_event_start(struct perf_event *event, int flags) static void cpu_clock_event_stop(struct perf_event *event, int flags) { - perf_swevent_cancel_hrtimer(event); + perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT); cpu_clock_event_update(event); } @@ -9277,7 +9288,7 @@ static void task_clock_event_start(struct perf_event *event, int flags) static void task_clock_event_stop(struct perf_event *event, int flags) { - perf_swevent_cancel_hrtimer(event); + perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT); task_clock_event_update(event, event->ctx->time); } -- 2.14.4

6 years, 11 months

3
7
0 0

[PATCH v2] bcache: set max writeback rate when I/O request is idle

by Coly Li

Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: stable(a)vger.kernel.org #4.16+ Signed-off-by: Coly Li <colyli(a)suse.de> Tested-by: Kai Krakow <kai(a)kaishome.de> Cc: Michael Lyle <mlyle(a)lyle.org> Cc: Stefan Priebe <s.priebe(a)profihost.ag> --- Channgelog: v2, Fix a deadlock reported by Stefan Priebe. v1, Initial version. drivers/md/bcache/bcache.h | 11 ++-- drivers/md/bcache/request.c | 51 ++++++++++++++- drivers/md/bcache/super.c | 1 + drivers/md/bcache/sysfs.c | 14 +++-- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++-------- 7 files changed, 155 insertions(+), 41 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index d6bf294f3907..469ab1a955e0 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_t has_dirty; - /* - * Set to zero by things that touch the backing volume-- except - * writeback. Incremented by writeback. Used to determine when to - * accelerate idle writeback. - */ - atomic_t backing_idle; - struct bch_ratelimit writeback_rate; struct delayed_work writeback_rate_update; @@ -514,6 +507,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_t idle_counter; + atomic_t at_max_writeback_rate; struct cache_sb sb; @@ -523,6 +518,8 @@ struct cache_set { struct bcache_device **devices; unsigned devices_max_used; + /* See set_at_max_writeback_rate() for how it is used */ + unsigned previous_dirty_dc_nr; struct list_head cached_devs; uint64_t cached_dev_sectors; struct closure caching; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa8047..1af3d96abfa5 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) /* Cached devices - read & write stuff */ +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* + * If bch_register_lock is acquired by other attach/detach operations, + * waiting here will increase I/O request latency for seconds or more. + * To avoid such situation, only writeback rate of current cached device + * is set to 1, and __update_write_back() will decide writeback rate + * of other cached devices (remember c->idle_counter is 0 now). + */ + if (mutex_trylock(&bch_register_lock)){ + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* + * set writeback rate to default minimum value, + * then let update_writeback_rate() to decide the + * upcoming rate. + */ + atomic64_set(&dc->writeback_rate.rate, 1); + } + + mutex_unlock(&bch_register_lock); + } else + atomic64_set(&this_dc->writeback_rate.rate, 1); +} + static blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio) { @@ -1119,7 +1156,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, return BLK_QC_T_NONE; } - atomic_set(&dc->backing_idle, 0); + if (d->c) { + atomic_set(&d->c->idle_counter, 0); + /* + * If at_max_writeback_rate of cache set is true and new I/O + * comes, quit max writeback rate of all cached devices + * attached to this cache set, and set at_max_writeback_rate + * to false. + */ + if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) { + atomic_set(&d->c->at_max_writeback_rate, 0); + quit_max_writeback_rate(d->c, dc); + } + } generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); bio_set_dev(bio, dc->bdev); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fa4058e43202..fa532d9f9353 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) c->block_bits = ilog2(sb->block_size); c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry); c->devices_max_used = 0; + c->previous_dirty_dc_nr = 0; c->btree_pages = bucket_pages(c); if (c->btree_pages > BTREE_MAX_PAGES) c->btree_pages = max_t(int, c->btree_pages / 4, diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 225b15aa0340..d719021bff81 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); + sysfs_hprint(writeback_rate, + atomic64_read(&dc->writeback_rate.rate) << 9); sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); sysfs_printf(io_error_limit, "%i", dc->error_limit); sysfs_printf(io_disable, "%i", dc->io_disable); @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev) char change[20]; s64 next_io; - bch_hprint(rate, dc->writeback_rate.rate << 9); + bch_hprint(rate, + atomic64_read(&dc->writeback_rate.rate) << 9); bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); bch_hprint(target, dc->writeback_rate_target << 9); bch_hprint(proportional,dc->writeback_rate_proportional << 9); @@ -255,8 +257,12 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40); - sysfs_strtoul_clamp(writeback_rate, - dc->writeback_rate.rate, 1, INT_MAX); + if (attr == &sysfs_writeback_rate) { + int v; + + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); + atomic64_set(&dc->writeback_rate.rate, v); + } sysfs_strtoul_clamp(writeback_rate_update_seconds, dc->writeback_rate_update_seconds, diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index fc479b026d6d..84f90c3d996d 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) { uint64_t now = local_clock(); - d->next += div_u64(done * NSEC_PER_SEC, d->rate); + d->next += div_u64(done * NSEC_PER_SEC, atomic64_read(&d->rate)); /* Bound the time. Don't let us fall further than 2 seconds behind * (this prevents unnecessary backlog that would make it impossible diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index cced87f8eb27..7e17f32ab563 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -442,7 +442,7 @@ struct bch_ratelimit { * Rate at which we want to do work, in units per second * The units here correspond to the units passed to bch_next_delay() */ - uint32_t rate; + atomic64_t rate; }; static inline void bch_ratelimit_reset(struct bch_ratelimit *d) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index ad45ebe1a74b..11ffadc3cf8f 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct cached_dev *dc) return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT; } +static bool set_at_max_writeback_rate(struct cache_set *c, + struct cached_dev *dc) +{ + int i, dirty_dc_nr = 0; + struct bcache_device *d; + + /* + * bch_register_lock is acquired in cached_dev_detach_finish() before + * calling cancel_writeback_rate_update_dwork() to stop the delayed + * kworker writeback_rate_update (where the context we are for now). + * Therefore call mutex_lock() here may introduce deadlock when shut + * down the bcache device. + * c->previous_dirty_dc_nr is used to record previous calculated + * dirty_dc_nr when mutex_trylock() last time succeeded. Then if + * mutex_trylock() failed here, use c->previous_dirty_dc_nr as dirty + * cached device number. Of cause it might be inaccurate, but a few more + * or less loop before setting c->at_max_writeback_rate is much better + * then a deadlock here. + */ + if (mutex_trylock(&bch_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + if (atomic_read(&dc->has_dirty)) + dirty_dc_nr++; + } + c->previous_dirty_dc_nr = dirty_dc_nr; + + mutex_unlock(&bch_register_lock); + } else + dirty_dc_nr = c->previous_dirty_dc_nr; + + /* + * Idle_counter is increased everytime when update_writeback_rate() + * is rescheduled in. If all backing devices attached to the same + * cache set has same dc->writeback_rate_update_seconds value, it + * is about 10 rounds of update_writeback_rate() is called on each + * backing device, then the code will fall through at set 1 to + * c->at_max_writeback_rate, and a max wrteback rate to each + * dc->writeback_rate.rate. This is not very accurate but works well + * to make sure the whole cache set has no new I/O coming before + * writeback rate is set to a max number. + */ + if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10) + return false; + + if (atomic_read(&c->at_max_writeback_rate) != 1) + atomic_set(&c->at_max_writeback_rate, 1); + + + atomic64_set(&dc->writeback_rate.rate, INT_MAX); + + /* keep writeback_rate_target as existing value */ + dc->writeback_rate_proportional = 0; + dc->writeback_rate_integral_scaled = 0; + dc->writeback_rate_change = 0; + + /* + * Check c->idle_counter and c->at_max_writeback_rate agagain in case + * new I/O arrives during before set_at_max_writeback_rate() returns. + * Then the writeback rate is set to 1, and its new value should be + * decided via __update_writeback_rate(). + */ + if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 || + !atomic_read(&c->at_max_writeback_rate)) + return false; + + return true; +} + static void __update_writeback_rate(struct cached_dev *dc) { /* @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct cached_dev *dc) dc->writeback_rate_proportional = proportional_scaled; dc->writeback_rate_integral_scaled = integral_scaled; - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate; - dc->writeback_rate.rate = new_rate; + dc->writeback_rate_change = new_rate - + atomic64_read(&dc->writeback_rate.rate); + atomic64_set(&dc->writeback_rate.rate, new_rate); dc->writeback_rate_target = target; } @@ -138,9 +213,16 @@ static void update_writeback_rate(struct work_struct *work) down_read(&dc->writeback_lock); - if (atomic_read(&dc->has_dirty) && - dc->writeback_percent) - __update_writeback_rate(dc); + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { + /* + * If the whole cache set is idle, set_at_max_writeback_rate() + * will set writeback rate to a max number. Then it is + * unncessary to update writeback rate for an idle cache set + * in maximum writeback rate number(s). + */ + if (!set_at_max_writeback_rate(c, dc)) + __update_writeback_rate(dc); + } up_read(&dc->writeback_lock); @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc) delay = writeback_delay(dc, size); - /* If the control system would wait for at least half a - * second, and there's been no reqs hitting the backing disk - * for awhile: use an alternate mode where we have at most - * one contiguous set of writebacks in flight at a time. If - * someone wants to do IO it will be quick, as it will only - * have to contend with one operation in flight, and we'll - * be round-tripping data to the backing disk as quickly as - * it can accept it. - */ - if (delay >= HZ / 2) { - /* 3 means at least 1.5 seconds, up to 7.5 if we - * have slowed way down. - */ - if (atomic_inc_return(&dc->backing_idle) >= 3) { - /* Wait for current I/Os to finish */ - closure_sync(&cl); - /* And immediately launch a new set. */ - delay = 0; - } - } - while (!kthread_should_stop() && !test_bit(CACHE_SET_IO_DISABLE, &dc->disk.c->flags) && delay) { @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_running = true; dc->writeback_percent = 10; dc->writeback_delay = 30; - dc->writeback_rate.rate = 1024; + atomic64_set(&dc->writeback_rate.rate, 1024); dc->writeback_rate_minimum = 8; dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; -- 2.17.1

6 years, 11 months

2
3
0 0

[PATCH v6 00/13] mm: Teach memory_failure() about ZONE_DEVICE pages

by Dan Williams

Changes since v5 [1]: * Move put_page() before memory_failure() in madvise_inject_error() (Naoya) * The previous change uncovered a latent bug / broken assumption in __put_devmap_managed_page(). We need to preserve page->mapping for dax pages when they go idle. * Rename mapping_size() to dev_pagemap_mapping_size() (Naoya) * Catch and fail attempts to soft-offline dax pages (Naoya) * Collect Naoya's ack on "mm, memory_failure: Collect mapping size in collect_procs()" [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-July/016682.html --- As it stands, memory_failure() gets thoroughly confused by dev_pagemap backed mappings. The recovery code has specific enabling for several possible page states and needs new enabling to handle poison in dax mappings. In order to support reliable reverse mapping of user space addresses: 1/ Add new locking in the memory_failure() rmap path to prevent races that would typically be handled by the page lock. 2/ Since dev_pagemap pages are hidden from the page allocator and the "compound page" accounting machinery, add a mechanism to determine the size of the mapping that encompasses a given poisoned pfn. 3/ Given pmem errors can be repaired, change the speculatively accessed poison protection, mce_unmap_kpfn(), to be reversible and otherwise allow ongoing access from the kernel. A side effect of this enabling is that MADV_HWPOISON becomes usable for dax mappings, however the primary motivation is to allow the system to survive userspace consumption of hardware-poison via dax. Specifically the current behavior is: mce: Uncorrected hardware memory error in user-access at af34214200 {1}[Hardware Error]: It has been corrected by h/w and requires no further action mce: [Hardware Error]: Machine check events logged {1}[Hardware Error]: event severity: corrected Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users [..] Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed mce: Memory error not recovered <reboot> ...and with these changes: Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000 Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption Memory failure: 0x20cb00: recovery action for dax page: Recovered Given all the cross dependencies I propose taking this through nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax folks. --- Dan Williams (13): device-dax: Convert to vmf_insert_mixed and vm_fault_t device-dax: Enable page_mapping() device-dax: Set page->index filesystem-dax: Set page->index mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages mm, dev_pagemap: Do not clear ->mapping on final put mm, madvise_inject_error: Let memory_failure() optionally take a page reference mm, memory_failure: Collect mapping size in collect_procs() filesystem-dax: Introduce dax_lock_mapping_entry() mm, memory_failure: Teach memory_failure() about dev_pagemap pages x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses x86/memory_failure: Introduce {set,clear}_mce_nospec() libnvdimm, pmem: Restore page attributes when clearing errors arch/x86/include/asm/set_memory.h | 42 ++++++ arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 -- arch/x86/kernel/cpu/mcheck/mce.c | 38 ----- arch/x86/mm/pat.c | 16 ++ drivers/dax/device.c | 75 +++++++--- drivers/nvdimm/pmem.c | 26 ++++ drivers/nvdimm/pmem.h | 13 ++ fs/dax.c | 125 ++++++++++++++++- include/linux/dax.h | 13 ++ include/linux/huge_mm.h | 5 - include/linux/mm.h | 1 include/linux/set_memory.h | 14 ++ kernel/memremap.c | 1 mm/hmm.c | 2 mm/huge_memory.c | 4 - mm/madvise.c | 16 ++ mm/memory-failure.c | 210 +++++++++++++++++++++++------ 17 files changed, 481 insertions(+), 135 deletions(-)

6 years, 11 months

4
5
0 0

[PATCH] x86/entry/64: Remove %ebx handling from error_entry/exit

by Andy Lutomirski

error_entry and error_exit communicate the user vs kernel status of the frame using %ebx. This is unnecessary -- the information is in regs->cs. Just use regs->cs. This makes error_entry simpler and makes error_exit more robust. It also fixes a nasty bug. Before all the Spectre nonsense, The xen_failsafe_callback entry point returned like this: ALLOC_PT_GPREGS_ON_STACK SAVE_C_REGS SAVE_EXTRA_REGS ENCODE_FRAME_POINTER jmp error_exit And it did not go through error_entry. This was bogus: RBX contained garbage, and error_exit expected a flag in RBX. Fortunately, it generally contained *nonzero* garbage, so the correct code path was used. As part of the Spectre fixes, code was added to clear RBX to mitigate certain speculation attacks. Now, depending on kernel configuration, RBX got zeroed and, when running some Wine workloads, the kernel crashes. This was introduced by: commit 3ac6d8c787b8 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface") With this patch applied, RBX is no longer needed as a flag, and the problem goes away. I suspect that malicious userspace could use this bug to crash the kernel even without the offending patch applied, though. [Historical note: I wrote this patch as a cleanup before I was aware of the bug it fixed.] [Note to stable maintainers: this should probably get applied to all kernels. If you're nervous about that, a more conservative fix to add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should also fix the problem.] Cc: Brian Gerst <brgerst(a)gmail.com> Cc: Borislav Petkov <bp(a)alien8.de> Cc: Dominik Brodowski <linux(a)dominikbrodowski.net> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: "H. Peter Anvin" <hpa(a)zytor.com> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com> Cc: Juergen Gross <jgross(a)suse.com> Cc: xen-devel(a)lists.xenproject.org Cc: x86(a)kernel.org Cc: stable(a)vger.kernel.org Fixes: 3ac6d8c787b8 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface") Reported-and-tested-by: "M. Vefa Bicakci" <m.v.b(a)runbox.com> Signed-off-by: Andy Lutomirski <luto(a)kernel.org> --- I could also submit the conservative fix tagged for -stable and respin this on top of it. Ingo, Greg, what do you prefer? arch/x86/entry/entry_64.S | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 73a522d53b53..8ae7ffda8f98 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -981,7 +981,7 @@ ENTRY(\sym) call \do_sym - jmp error_exit /* %ebx: no swapgs flag */ + jmp error_exit .endif END(\sym) .endm @@ -1222,7 +1222,6 @@ END(paranoid_exit) /* * Save all registers in pt_regs, and switch GS if needed. - * Return: EBX=0: came from user mode; EBX=1: otherwise */ ENTRY(error_entry) UNWIND_HINT_FUNC @@ -1269,7 +1268,6 @@ ENTRY(error_entry) * for these here too. */ .Lerror_kernelspace: - incl %ebx leaq native_irq_return_iret(%rip), %rcx cmpq %rcx, RIP+8(%rsp) je .Lerror_bad_iret @@ -1303,28 +1301,20 @@ ENTRY(error_entry) /* * Pretend that the exception came from user mode: set up pt_regs - * as if we faulted immediately after IRET and clear EBX so that - * error_exit knows that we will be returning to user mode. + * as if we faulted immediately after IRET. */ mov %rsp, %rdi call fixup_bad_iret mov %rax, %rsp - decl %ebx jmp .Lerror_entry_from_usermode_after_swapgs END(error_entry) - -/* - * On entry, EBX is a "return to kernel mode" flag: - * 1: already in kernel mode, don't need SWAPGS - * 0: user gsbase is loaded, we need SWAPGS and standard preparation for return to usermode - */ ENTRY(error_exit) UNWIND_HINT_REGS DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_OFF - testl %ebx, %ebx - jnz retint_kernel + testb $3, CS(%rsp) + jz retint_kernel jmp retint_user END(error_exit) -- 2.17.1

6 years, 11 months

2
2
0 0

[PATCH] USB: option: add support for DW5821e

by Aleksander Morgado

The device exposes AT, NMEA and DIAG ports in both USB configurations. The patch explicitly ignores interfaces 0 and 1, as they're bound to other drivers already; and also interface 6, which is a GNSS interface for which we don't have a driver yet. T: Bus=01 Lev=03 Prnt=04 Port=00 Cnt=01 Dev#= 18 Spd=480 MxCh= 0 D: Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 2 P: Vendor=413c ProdID=81d7 Rev=03.18 S: Manufacturer=DELL S: Product=DW5821e Snapdragon X20 LTE S: SerialNumber=0123456789ABCDEF C: #Ifs= 7 Cfg#= 2 Atr=a0 MxPwr=500mA I: If#= 0 Alt= 0 #EPs= 1 Cls=02(commc) Sub=0e Prot=00 Driver=cdc_mbim I: If#= 1 Alt= 1 #EPs= 2 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option I: If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option I: If#= 5 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option I: If#= 6 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none) T: Bus=01 Lev=03 Prnt=04 Port=00 Cnt=01 Dev#= 16 Spd=480 MxCh= 0 D: Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 2 P: Vendor=413c ProdID=81d7 Rev=03.18 S: Manufacturer=DELL S: Product=DW5821e Snapdragon X20 LTE S: SerialNumber=0123456789ABCDEF C: #Ifs= 6 Cfg#= 1 Atr=a0 MxPwr=500mA I: If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan I: If#= 1 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=00 Prot=00 Driver=usbhid I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option I: If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option I: If#= 5 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option Signed-off-by: Aleksander Morgado <aleksander(a)aleksander.es> Cc: stable <stable(a)vger.kernel.org> --- drivers/usb/serial/option.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c index 664e61f16b6a..0215b70c4efc 100644 --- a/drivers/usb/serial/option.c +++ b/drivers/usb/serial/option.c @@ -196,6 +196,8 @@ static void option_instat_callback(struct urb *urb); #define DELL_PRODUCT_5800_V2_MINICARD_VZW 0x8196 /* Novatel E362 */ #define DELL_PRODUCT_5804_MINICARD_ATT 0x819b /* Novatel E371 */ +#define DELL_PRODUCT_5821E 0x81d7 + #define KYOCERA_VENDOR_ID 0x0c88 #define KYOCERA_PRODUCT_KPC650 0x17da #define KYOCERA_PRODUCT_KPC680 0x180a @@ -1030,6 +1032,8 @@ static const struct usb_device_id option_ids[] = { { USB_DEVICE_AND_INTERFACE_INFO(DELL_VENDOR_ID, DELL_PRODUCT_5800_MINICARD_VZW, 0xff, 0xff, 0xff) }, { USB_DEVICE_AND_INTERFACE_INFO(DELL_VENDOR_ID, DELL_PRODUCT_5800_V2_MINICARD_VZW, 0xff, 0xff, 0xff) }, { USB_DEVICE_AND_INTERFACE_INFO(DELL_VENDOR_ID, DELL_PRODUCT_5804_MINICARD_ATT, 0xff, 0xff, 0xff) }, + { USB_DEVICE(DELL_VENDOR_ID, DELL_PRODUCT_5821E), + .driver_info = RSVD(0) | RSVD(1) | RSVD(6) }, { USB_DEVICE(ANYDATA_VENDOR_ID, ANYDATA_PRODUCT_ADU_E100A) }, /* ADU-E100, ADU-310 */ { USB_DEVICE(ANYDATA_VENDOR_ID, ANYDATA_PRODUCT_ADU_500A) }, { USB_DEVICE(ANYDATA_VENDOR_ID, ANYDATA_PRODUCT_ADU_620UW) }, -- 2.18.0

6 years, 11 months

1
0
0 0

[merged] mm-memcg-fix-use-after-free-in-mem_cgroup_iter.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm: memcg: fix use after free in mem_cgroup_iter() has been removed from the -mm tree. Its filename was mm-memcg-fix-use-after-free-in-mem_cgroup_iter.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Jing Xia <jing.xia.mail(a)gmail.com> Subject: mm: memcg: fix use after free in mem_cgroup_iter() It was reported that a kernel crash happened in mem_cgroup_iter(), which can be triggered if the legacy cgroup-v1 non-hierarchical mode is used. Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f ...... Call trace: mem_cgroup_iter+0x2e0/0x6d4 shrink_zone+0x8c/0x324 balance_pgdat+0x450/0x640 kswapd+0x130/0x4b8 kthread+0xe8/0xfc ret_from_fork+0x10/0x20 mem_cgroup_iter(): ...... if (css_tryget(css)) <-- crash here break; ...... The crashing reason is that mem_cgroup_iter() uses the memcg object whose pointer is stored in iter->position, which has been freed before and filled with POISON_FREE(0x6b). And the root cause of the use-after-free issue is that invalidate_reclaim_iterators() fails to reset the value of iter->position to NULL when the css of the memcg is released in non- hierarchical mode. Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.… Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim") Signed-off-by: Jing Xia <jing.xia.mail(a)gmail.com> Acked-by: Michal Hocko <mhocko(a)suse.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com> Cc: <chunyan.zhang(a)unisoc.com> Cc: Shakeel Butt <shakeelb(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN mm/memcontrol.c~mm-memcg-fix-use-after-free-in-mem_cgroup_iter mm/memcontrol.c --- a/mm/memcontrol.c~mm-memcg-fix-use-after-free-in-mem_cgroup_iter +++ a/mm/memcontrol.c @@ -850,7 +850,7 @@ static void invalidate_reclaim_iterators int nid; int i; - while ((memcg = parent_mem_cgroup(memcg))) { + for (; memcg; memcg = parent_mem_cgroup(memcg)) { for_each_node(nid) { mz = mem_cgroup_nodeinfo(memcg, nid); for (i = 0; i <= DEF_PRIORITY; i++) { _ Patches currently in -mm which might be from jing.xia.mail(a)gmail.com are

6 years, 11 months

1
0
0 0

[merged] thp-fix-data-loss-when-splitting-a-file-pmd.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm/huge_memory.c: fix data loss when splitting a file pmd has been removed from the -mm tree. Its filename was thp-fix-data-loss-when-splitting-a-file-pmd.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Hugh Dickins <hughd(a)google.com> Subject: mm/huge_memory.c: fix data loss when splitting a file pmd __split_huge_pmd_locked() must check if the cleared huge pmd was dirty, and propagate that to PageDirty: otherwise, data may be lost when a huge tmpfs page is modified then split then reclaimed. How has this taken so long to be noticed? Because there was no problem when the huge page is written by a write system call (shmem_write_end() calls set_page_dirty()), nor when the page is allocated for a write fault (fault_dirty_shared_page() calls set_page_dirty()); but when allocated for a read fault (which MAP_POPULATE simulates), no set_page_dirty(). Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils Fixes: d21b9e57c74c ("thp: handle file pages in split_huge_pmd()") Signed-off-by: Hugh Dickins <hughd(a)google.com> Reported-by: Ashwin Chaugule <ashwinch(a)google.com> Reviewed-by: Yang Shi <yang.shi(a)linux.alibaba.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: "Huang, Ying" <ying.huang(a)intel.com> Cc: <stable(a)vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/huge_memory.c | 2 ++ 1 file changed, 2 insertions(+) diff -puN mm/huge_memory.c~thp-fix-data-loss-when-splitting-a-file-pmd mm/huge_memory.c --- a/mm/huge_memory.c~thp-fix-data-loss-when-splitting-a-file-pmd +++ a/mm/huge_memory.c @@ -2084,6 +2084,8 @@ static void __split_huge_pmd_locked(stru if (vma_is_dax(vma)) return; page = pmd_page(_pmd); + if (!PageDirty(page) && pmd_dirty(_pmd)) + set_page_dirty(page); if (!PageReferenced(page) && pmd_young(_pmd)) SetPageReferenced(page); page_remove_rmap(page, true); _ Patches currently in -mm which might be from hughd(a)google.com are

6 years, 11 months

1
0
0 0

[merged] fat-fix-memory-allocation-failure-handling-of-match_strdup.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: fat: fix memory allocation failure handling of match_strdup() has been removed from the -mm tree. Its filename was fat-fix-memory-allocation-failure-handling-of-match_strdup.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: OGAWA Hirofumi <hirofumi(a)mail.parknet.co.jp> Subject: fat: fix memory allocation failure handling of match_strdup() In parse_options(), if match_strdup() failed, parse_options() leaves opts->iocharset in unexpected state (i.e. still pointing the freed string). And this can be the cause of double free. To fix, this initialize opts->iocharset always when freeing. Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi(a)mail.parknet.co.jp> Reported-by: syzbot+90b8e10515ae88228a92(a)syzkaller.appspotmail.com Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/fat/inode.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff -puN fs/fat/inode.c~fat-fix-memory-allocation-failure-handling-of-match_strdup fs/fat/inode.c --- a/fs/fat/inode.c~fat-fix-memory-allocation-failure-handling-of-match_strdup +++ a/fs/fat/inode.c @@ -707,13 +707,21 @@ static void fat_set_state(struct super_b brelse(bh); } +static void fat_reset_iocharset(struct fat_mount_options *opts) +{ + if (opts->iocharset != fat_default_iocharset) { + /* Note: opts->iocharset can be NULL here */ + kfree(opts->iocharset); + opts->iocharset = fat_default_iocharset; + } +} + static void delayed_free(struct rcu_head *p) { struct msdos_sb_info *sbi = container_of(p, struct msdos_sb_info, rcu); unload_nls(sbi->nls_disk); unload_nls(sbi->nls_io); - if (sbi->options.iocharset != fat_default_iocharset) - kfree(sbi->options.iocharset); + fat_reset_iocharset(&sbi->options); kfree(sbi); } @@ -1132,7 +1140,7 @@ static int parse_options(struct super_bl opts->fs_fmask = opts->fs_dmask = current_umask(); opts->allow_utime = -1; opts->codepage = fat_default_codepage; - opts->iocharset = fat_default_iocharset; + fat_reset_iocharset(opts); if (is_vfat) { opts->shortname = VFAT_SFN_DISPLAY_WINNT|VFAT_SFN_CREATE_WIN95; opts->rodir = 0; @@ -1289,8 +1297,7 @@ static int parse_options(struct super_bl /* vfat specific */ case Opt_charset: - if (opts->iocharset != fat_default_iocharset) - kfree(opts->iocharset); + fat_reset_iocharset(opts); iocharset = match_strdup(&args[0]); if (!iocharset) return -ENOMEM; @@ -1881,8 +1888,7 @@ out_fail: iput(fat_inode); unload_nls(sbi->nls_io); unload_nls(sbi->nls_disk); - if (sbi->options.iocharset != fat_default_iocharset) - kfree(sbi->options.iocharset); + fat_reset_iocharset(&sbi->options); sb->s_fs_info = NULL; kfree(sbi); return error; _ Patches currently in -mm which might be from hirofumi(a)mail.parknet.co.jp are fat-validate-i_start-before-using.patch

6 years, 11 months

1
0
0 0

[alternative-merged] mm-fix-vma_is_anonymous-false-positives.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm: fix vma_is_anonymous() false-positives has been removed from the -mm tree. Its filename was mm-fix-vma_is_anonymous-false-positives.patch This patch was dropped because an alternative patch was merged ------------------------------------------------------ From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com> Subject: mm: fix vma_is_anonymous() false-positives vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous VMA. This is unreliable as ->mmap may not set ->vm_ops. False-positive vma_is_anonymous() may lead to crashes: next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0 prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000 pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000 flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare) ------------[ cut here ]------------ kernel BUG at mm/memory.c:1422! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline] RIP: 0010:zap_pud_range mm/memory.c:1466 [inline] RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline] RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508 Code: ff 31 ff 4c 89 e6 42 c6 04 33 f8 e8 92 dd d0 ff 4d 85 e4 0f 85 4a eb ff ff e8 54 dc d0 ff 48 8b bd 10 fc ff ff e8 82 95 fe ff <0f> 0b e8 41 dc d0 ff 0f 0b 4c 89 ad 18 fc ff ff c7 85 7c fb ff ff RSP: 0018:ffff8801b0587330 EFLAGS: 00010286 RAX: 000000000000013c RBX: 1ffff100360b0e9c RCX: ffffc90002620000 RDX: 0000000000000000 RSI: ffffffff81631851 RDI: 0000000000000001 RBP: ffff8801b05877c8 R08: ffff880199d40300 R09: ffffed003b5c4fc0 R10: ffffed003b5c4fc0 R11: ffff8801dae27e07 R12: 0000000000000000 R13: ffff88019c1e13c0 R14: dffffc0000000000 R15: 0000000020e01000 FS: 00007fca32251700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f04c540d000 CR3: 00000001ac1f0000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: unmap_single_vma+0x1a0/0x310 mm/memory.c:1553 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644 unmap_mapping_range_vma mm/memory.c:2792 [inline] unmap_mapping_range_tree mm/memory.c:2813 [inline] unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845 unmap_mapping_range+0x48/0x60 mm/memory.c:2880 truncate_pagecache+0x54/0x90 mm/truncate.c:800 truncate_setsize+0x70/0xb0 mm/truncate.c:826 simple_setattr+0xe9/0x110 fs/libfs.c:409 notify_change+0xf13/0x10f0 fs/attr.c:335 do_truncate+0x1ac/0x2b0 fs/open.c:63 do_sys_ftruncate+0x492/0x560 fs/open.c:205 __do_sys_ftruncate fs/open.c:215 [inline] __se_sys_ftruncate fs/open.c:213 [inline] __x64_sys_ftruncate+0x59/0x80 fs/open.c:213 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reproducer: #include <stdio.h> #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long) #define KCOV_ENABLE _IO('c', 100) #define KCOV_DISABLE _IO('c', 101) #define COVER_SIZE (1024<<10) #define KCOV_TRACE_PC 0 #define KCOV_TRACE_CMP 1 int main(int argc, char **argv) { int fd; unsigned long *cover; system("mount -t debugfs none /sys/kernel/debug"); fd = open("/sys/kernel/debug/kcov", O_RDWR); ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE); cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); munmap(cover, COVER_SIZE * sizeof(unsigned long)); cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); memset(cover, 0, COVER_SIZE * sizeof(unsigned long)); ftruncate(fd, 3UL << 20); return 0; } This can be fixed by assigning anonymous VMAs own vm_ops and not relying on it being NULL. If ->mmap() failed to set ->vm_ops, mmap_region() will set it to dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs. [kirill(a)shutemov.name: add comments] Link: http://lkml.kernel.org/r/20180711121521.omugjfpuuyxscjjf@kshutemo-mobl1 [kirill.shutemov(a)linux.intel.com: v2] Link: http://lkml.kernel.org/r/20180712145626.41665-2-kirill.shutemov@linux.intel… [kirill.shutemov(a)linux.intel.com: fix splat reported by Marcel] Link: http://lkml.kernel.org/r/20180716142049.ioa2irsd2d7sphn4@black.fi.intel.com Link: http://lkml.kernel.org/r/20180710134821.84709-2-kirill.shutemov@linux.intel… Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Reported-by: syzbot+3f84280d52be9b7083cc(a)syzkaller.appspotmail.com Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: Dmitry Vyukov <dvyukov(a)google.com> Cc: Oleg Nesterov <oleg(a)redhat.com> Cc: Marcel Ziswiler <marcel.ziswiler(a)toradex.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/arm/kernel/process.c | 1 + arch/ia64/kernel/perfmon.c | 1 + arch/ia64/mm/init.c | 2 ++ drivers/char/mem.c | 1 + fs/exec.c | 1 + fs/hugetlbfs/inode.c | 1 + include/linux/mm.h | 5 ++++- mm/khugepaged.c | 4 ++-- mm/mmap.c | 11 +++++++++++ mm/nommu.c | 9 ++++++++- mm/shmem.c | 1 + mm/util.c | 12 ++++++++++++ 12 files changed, 45 insertions(+), 4 deletions(-) diff -puN arch/arm/kernel/process.c~mm-fix-vma_is_anonymous-false-positives arch/arm/kernel/process.c --- a/arch/arm/kernel/process.c~mm-fix-vma_is_anonymous-false-positives +++ a/arch/arm/kernel/process.c @@ -334,6 +334,7 @@ static struct vm_area_struct gate_vma = .vm_start = 0xffff0000, .vm_end = 0xffff0000 + PAGE_SIZE, .vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC, + .vm_ops = &dummy_vm_ops, }; static int __init gate_vma_init(void) diff -puN arch/ia64/kernel/perfmon.c~mm-fix-vma_is_anonymous-false-positives arch/ia64/kernel/perfmon.c --- a/arch/ia64/kernel/perfmon.c~mm-fix-vma_is_anonymous-false-positives +++ a/arch/ia64/kernel/perfmon.c @@ -2292,6 +2292,7 @@ pfm_smpl_buffer_alloc(struct task_struct vma->vm_file = get_file(filp); vma->vm_flags = VM_READ|VM_MAYREAD|VM_DONTEXPAND|VM_DONTDUMP; vma->vm_page_prot = PAGE_READONLY; /* XXX may need to change */ + vma->vm_ops = &dummy_vm_ops; /* * Now we have everything we need and we can initialize diff -puN arch/ia64/mm/init.c~mm-fix-vma_is_anonymous-false-positives arch/ia64/mm/init.c --- a/arch/ia64/mm/init.c~mm-fix-vma_is_anonymous-false-positives +++ a/arch/ia64/mm/init.c @@ -122,6 +122,7 @@ ia64_init_addr_space (void) vma->vm_end = vma->vm_start + PAGE_SIZE; vma->vm_flags = VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT; vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); + vma->vm_ops = &dummy_vm_ops; down_write(&current->mm->mmap_sem); if (insert_vm_struct(current->mm, vma)) { up_write(&current->mm->mmap_sem); @@ -141,6 +142,7 @@ ia64_init_addr_space (void) vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT); vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_ops = &dummy_vm_ops; down_write(&current->mm->mmap_sem); if (insert_vm_struct(current->mm, vma)) { up_write(&current->mm->mmap_sem); diff -puN drivers/char/mem.c~mm-fix-vma_is_anonymous-false-positives drivers/char/mem.c --- a/drivers/char/mem.c~mm-fix-vma_is_anonymous-false-positives +++ a/drivers/char/mem.c @@ -708,6 +708,7 @@ static int mmap_zero(struct file *file, #endif if (vma->vm_flags & VM_SHARED) return shmem_zero_setup(vma); + vma->vm_ops = &anon_vm_ops; return 0; } diff -puN fs/exec.c~mm-fix-vma_is_anonymous-false-positives fs/exec.c --- a/fs/exec.c~mm-fix-vma_is_anonymous-false-positives +++ a/fs/exec.c @@ -307,6 +307,7 @@ static int __bprm_mm_init(struct linux_b * configured yet. */ BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP); + vma->vm_ops = &anon_vm_ops; vma->vm_end = STACK_TOP_MAX; vma->vm_start = vma->vm_end - PAGE_SIZE; vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP; diff -puN fs/hugetlbfs/inode.c~mm-fix-vma_is_anonymous-false-positives fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c~mm-fix-vma_is_anonymous-false-positives +++ a/fs/hugetlbfs/inode.c @@ -597,6 +597,7 @@ static long hugetlbfs_fallocate(struct f memset(&pseudo_vma, 0, sizeof(struct vm_area_struct)); pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED); pseudo_vma.vm_file = file; + pseudo_vma.vm_ops = &dummy_vm_ops; for (index = start; index < end; index++) { /* diff -puN include/linux/mm.h~mm-fix-vma_is_anonymous-false-positives include/linux/mm.h --- a/include/linux/mm.h~mm-fix-vma_is_anonymous-false-positives +++ a/include/linux/mm.h @@ -1536,9 +1536,12 @@ int clear_page_dirty_for_io(struct page int get_cmdline(struct task_struct *task, char *buffer, int buflen); +extern const struct vm_operations_struct anon_vm_ops; +extern const struct vm_operations_struct dummy_vm_ops; + static inline bool vma_is_anonymous(struct vm_area_struct *vma) { - return !vma->vm_ops; + return vma->vm_ops == &anon_vm_ops; } #ifdef CONFIG_SHMEM diff -puN mm/khugepaged.c~mm-fix-vma_is_anonymous-false-positives mm/khugepaged.c --- a/mm/khugepaged.c~mm-fix-vma_is_anonymous-false-positives +++ a/mm/khugepaged.c @@ -440,7 +440,7 @@ int khugepaged_enter_vma_merge(struct vm * page fault if needed. */ return 0; - if (vma->vm_ops || (vm_flags & VM_NO_KHUGEPAGED)) + if (!vma_is_anonymous(vma) || (vm_flags & VM_NO_KHUGEPAGED)) /* khugepaged not yet working on file or special mappings */ return 0; hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; @@ -831,7 +831,7 @@ static bool hugepage_vma_check(struct vm return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, HPAGE_PMD_NR); } - if (!vma->anon_vma || vma->vm_ops) + if (!vma->anon_vma || !vma_is_anonymous(vma)) return false; if (is_vma_temporary_stack(vma)) return false; diff -puN mm/mmap.c~mm-fix-vma_is_anonymous-false-positives mm/mmap.c --- a/mm/mmap.c~mm-fix-vma_is_anonymous-false-positives +++ a/mm/mmap.c @@ -561,6 +561,8 @@ static unsigned long count_vma_pages_ran void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, struct rb_node **rb_link, struct rb_node *rb_parent) { + WARN_ONCE(!vma->vm_ops, "missing vma->vm_ops"); + /* Update tracking information for the gap following the new vma. */ if (vma->vm_next) vma_gap_update(vma->vm_next); @@ -1762,6 +1764,11 @@ unsigned long mmap_region(struct file *f */ vma->vm_file = get_file(file); error = call_mmap(file, vma); + + /* All mappings must have ->vm_ops set */ + if (!vma->vm_ops) + vma->vm_ops = &dummy_vm_ops; + if (error) goto unmap_and_free_vma; @@ -1780,6 +1787,9 @@ unsigned long mmap_region(struct file *f error = shmem_zero_setup(vma); if (error) goto free_vma; + } else { + /* vma_is_anonymous() relies on this. */ + vma->vm_ops = &anon_vm_ops; } vma_link(mm, vma, prev, rb_link, rb_parent); @@ -2992,6 +3002,7 @@ static int do_brk_flags(unsigned long ad INIT_LIST_HEAD(&vma->anon_vma_chain); vma->vm_mm = mm; + vma->vm_ops = &anon_vm_ops; vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_pgoff = pgoff; diff -puN mm/nommu.c~mm-fix-vma_is_anonymous-false-positives mm/nommu.c --- a/mm/nommu.c~mm-fix-vma_is_anonymous-false-positives +++ a/mm/nommu.c @@ -1058,6 +1058,8 @@ static int do_mmap_shared_file(struct vm int ret; ret = call_mmap(vma->vm_file, vma); + if (!vma->vm_ops) + vma->vm_ops = &dummy_vm_ops; if (ret == 0) { vma->vm_region->vm_top = vma->vm_region->vm_end; return 0; @@ -1089,6 +1091,8 @@ static int do_mmap_private(struct vm_are */ if (capabilities & NOMMU_MAP_DIRECT) { ret = call_mmap(vma->vm_file, vma); + if (!vma->vm_ops) + vma->vm_ops = &dummy_vm_ops; if (ret == 0) { /* shouldn't return success if we're not sharing */ BUG_ON(!(vma->vm_flags & VM_MAYSHARE)); @@ -1137,6 +1141,8 @@ static int do_mmap_private(struct vm_are fpos = vma->vm_pgoff; fpos <<= PAGE_SHIFT; + vma->vm_ops = &dummy_vm_ops; + ret = kernel_read(vma->vm_file, base, len, &fpos); if (ret < 0) goto error_free; @@ -1144,7 +1150,8 @@ static int do_mmap_private(struct vm_are /* clear the last little bit */ if (ret < len) memset(base + ret, 0, len - ret); - + } else { + vma->vm_ops = &anon_vm_ops; } return 0; diff -puN mm/shmem.c~mm-fix-vma_is_anonymous-false-positives mm/shmem.c --- a/mm/shmem.c~mm-fix-vma_is_anonymous-false-positives +++ a/mm/shmem.c @@ -1424,6 +1424,7 @@ static void shmem_pseudo_vma_init(struct /* Bias interleave by inode number to distribute better across nodes */ vma->vm_pgoff = index + info->vfs_inode.i_ino; vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index); + vma->vm_ops = &dummy_vm_ops; } static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma) diff -puN mm/util.c~mm-fix-vma_is_anonymous-false-positives mm/util.c --- a/mm/util.c~mm-fix-vma_is_anonymous-false-positives +++ a/mm/util.c @@ -20,6 +20,18 @@ #include "internal.h" +/* + * All anonymous VMAs have ->vm_ops set to anon_vm_ops. + * vma_is_anonymous() reiles on anon_vm_ops to detect anonymous VMA. + */ +const struct vm_operations_struct anon_vm_ops = {}; + +/* + * All VMAs have to have ->vm_ops set. dummy_vm_ops can be used if the VMA + * doesn't need to handle any of the operations. + */ +const struct vm_operations_struct dummy_vm_ops = {}; + static inline int is_kernel_rodata(unsigned long addr) { return addr >= (unsigned long)__start_rodata && _ Patches currently in -mm which might be from kirill.shutemov(a)linux.intel.com are mm-page_ext-drop-definition-of-unused-page_ext_debug_poison.patch mm-page_ext-constify-lookup_page_ext-argument.patch mm-drop-unneeded-vm_ops-checks-v2.patch

6 years, 11 months

1
0
0 0

[PATCH] MIPS: Change definition of cpu_relax() for Loongson-3

by Huacai Chen

Linux expects that if a CPU modifies a memory location, then that modification will eventually become visible to other CPUs in the system. On Loongson-3 processor with SFB (Store Fill Buffer), loads may be prioritised over stores so it is possible for a store operation to be postponed if a polling loop immediately follows it. If the variable being polled indirectly depends on the outstanding store [for example, another CPU may be polling the variable that is pending modification] then there is the potential for deadlock if interrupts are disabled. This deadlock occurs in qspinlock code. This patch changes the definition of cpu_relax() to smp_mb() for Loongson-3, forcing a flushing of the SFB on SMP systems before the next load takes place. If the Kernel is not compiled for SMP support, this will expand to a barrier() as before. References: 534be1d5a2da940 (ARM: 6194/1: change definition of cpu_relax() for ARM11MPCore) Cc: stable(a)vger.kernel.org Signed-off-by: Huacai Chen <chenhc(a)lemote.com> --- arch/mips/include/asm/processor.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h index af34afb..a8c4a3a 100644 --- a/arch/mips/include/asm/processor.h +++ b/arch/mips/include/asm/processor.h @@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p); #define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29]) #define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status) +#ifdef CONFIG_CPU_LOONGSON3 +/* + * Loongson-3's SFB (Store-Fill-Buffer) may get starved when stuck in a read + * loop. Since spin loops of any kind should have a cpu_relax() in them, force + * a Store-Fill-Buffer flush from cpu_relax() such that any pending writes will + * become available as expected. + */ +#define cpu_relax() smp_mb() +#else #define cpu_relax() barrier() +#endif /* * Return_address is a replacement for __builtin_return_address(count) -- 2.7.0

6 years, 11 months

5
7
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror July 2018