This is backport of the upstream patch
41c73a49df31151f4ff868f28fe4f129f113fa2c. It should be backported to
stable kernels because this performance problem was seen on the android
4.4 kernel.
commit 41c73a49df31151f4ff868f28fe4f129f113fa2c
Author: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Wed Nov 23 17:04:00 2016 -0500
dm bufio: drop the lock when doing GFP_NOIO allocation
If the first allocation attempt using GFP_NOWAIT fails, drop the lock
and retry using GFP_NOIO allocation (lock is dropped because the
allocation can take some time).
Note that we won't do GFP_NOIO allocation when we loop for the second
time, because the lock shouldn't be dropped between __wait_for_free_buffer
and __get_unclaimed_buffer.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 9ef88f0e2382..1c2e1dd7ca16 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -822,6 +822,7 @@ enum new_flag {
static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
{
struct dm_buffer *b;
+ bool tried_noio_alloc = false;
/*
* dm-bufio is resistant to allocation failures (it just keeps
@@ -846,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
if (nf == NF_PREFETCH)
return NULL;
+ if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
+ dm_bufio_unlock(c);
+ b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ dm_bufio_lock(c);
+ if (b)
+ return b;
+ tried_noio_alloc = true;
+ }
+
if (!list_empty(&c->reserved_buffers)) {
b = list_entry(c->reserved_buffers.next,
struct dm_buffer, lru_list);
This is backport of the upstream patch
9ea61cac0b1ad0c09022f39fd97e9b99a2cfc2dc. It should be backported to
stable kernels because this performance problem was seen on the android
4.4 kernel.
commit 9ea61cac0b1ad0c09022f39fd97e9b99a2cfc2dc
Author: Douglas Anderson <dianders(a)chromium.org>
Date: Thu Nov 17 11:24:20 2016 -0800
dm bufio: avoid sleeping while holding the dm_bufio lock
We've seen in-field reports showing _lots_ (18 in one case, 41 in
another) of tasks all sitting there blocked on:
mutex_lock+0x4c/0x68
dm_bufio_shrink_count+0x38/0x78
shrink_slab.part.54.constprop.65+0x100/0x464
shrink_zone+0xa8/0x198
In the two cases analyzed, we see one task that looks like this:
Workqueue: kverityd verity_prefetch_io
__switch_to+0x9c/0xa8
__schedule+0x440/0x6d8
schedule+0x94/0xb4
schedule_timeout+0x204/0x27c
schedule_timeout_uninterruptible+0x44/0x50
wait_iff_congested+0x9c/0x1f0
shrink_inactive_list+0x3a0/0x4cc
shrink_lruvec+0x418/0x5cc
shrink_zone+0x88/0x198
try_to_free_pages+0x51c/0x588
__alloc_pages_nodemask+0x648/0xa88
__get_free_pages+0x34/0x7c
alloc_buffer+0xa4/0x144
__bufio_new+0x84/0x278
dm_bufio_prefetch+0x9c/0x154
verity_prefetch_io+0xe8/0x10c
process_one_work+0x240/0x424
worker_thread+0x2fc/0x424
kthread+0x10c/0x114
...and that looks to be the one holding the mutex.
The problem has been reproduced on fairly easily:
0. Be running Chrome OS w/ verity enabled on the root filesystem
1. Pick test patch: http://crosreview.com/412360
2. Install launchBalloons.sh and balloon.arm from
http://crbug.com/468342
...that's just a memory stress test app.
3. On a 4GB rk3399 machine, run
nice ./launchBalloons.sh 4 900 100000
...that tries to eat 4 * 900 MB of memory and keep accessing.
4. Login to the Chrome web browser and restore many tabs
With that, I've seen printouts like:
DOUG: long bufio 90758 ms
...and stack trace always show's we're in dm_bufio_prefetch().
The problem is that we try to allocate memory with GFP_NOIO while
we're holding the dm_bufio lock. Instead we should be using
GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the
lock and that causes the above problems.
The current behavior explained by David Rientjes:
It will still try reclaim initially because __GFP_WAIT (or
__GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of
contention on dm_bufio_lock() that the thread holds. You want to
pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
mutex that can be contended by a concurrent slab shrinker (if
count_objects didn't use a trylock, this pattern would trivially
deadlock).
This change significantly increases responsiveness of the system while
in this state. It makes a real difference because it unblocks kswapd.
In the bug report analyzed, kswapd was hung:
kswapd0 D ffffffc000204fd8 0 72 2 0x00000000
Call trace:
[<ffffffc000204fd8>] __switch_to+0x9c/0xa8
[<ffffffc00090b794>] __schedule+0x440/0x6d8
[<ffffffc00090bac0>] schedule+0x94/0xb4
[<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44
[<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac
[<ffffffc00090d9d8>] mutex_lock+0x4c/0x68
[<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78
[<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464
[<ffffffc00030dbd8>] shrink_zone+0xa8/0x198
[<ffffffc00030e578>] balance_pgdat+0x328/0x508
[<ffffffc00030eb7c>] kswapd+0x424/0x51c
[<ffffffc00023f06c>] kthread+0x10c/0x114
[<ffffffc000203dd0>] ret_from_fork+0x10/0x40
By unblocking kswapd memory pressure should be reduced.
Suggested-by: David Rientjes <rientjes(a)google.com>
Reviewed-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 125aedc3875f..b5d3428d109e 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
* dm-bufio is resistant to allocation failures (it just keeps
* one buffer reserved in cases all the allocations fail).
* So set flags to not try too hard:
- * GFP_NOIO: don't recurse into the I/O layer
+ * GFP_NOWAIT: don't wait; if we need to sleep we'll release our
+ * mutex and wait ourselves.
* __GFP_NORETRY: don't retry and rather return failure
* __GFP_NOMEMALLOC: don't use emergency reserves
* __GFP_NOWARN: don't print a warning in case of failure
@@ -837,7 +838,7 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
*/
while (1) {
if (dm_bufio_cache_size_latch != 1) {
- b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (b)
return b;
}
On a powerpc 8xx, 'btc' fails as follows:
Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0
kdb_getarea: Bad address 0x0
when booting the kernel with 'debug_boot_weak_hash', it fails as well
Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry
kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0
kdb_getarea: Bad address 0xba99ad80
On other platforms, Oopses have been observed too, see
https://github.com/linuxppc/linux/issues/139
This is due to btc calling 'btt' with %p pointer as an argument.
This patch replaces %p by %px to get the real pointer value as
expected by 'btt'
Signed-off-by: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: <stable(a)vger.kernel.org> # 4.15+
---
kernel/debug/kdb/kdb_bt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/debug/kdb/kdb_bt.c b/kernel/debug/kdb/kdb_bt.c
index 6ad4a9fcbd6f..7921ae4fca8d 100644
--- a/kernel/debug/kdb/kdb_bt.c
+++ b/kernel/debug/kdb/kdb_bt.c
@@ -179,14 +179,14 @@ kdb_bt(int argc, const char **argv)
kdb_printf("no process for cpu %ld\n", cpu);
return 0;
}
- sprintf(buf, "btt 0x%p\n", KDB_TSK(cpu));
+ sprintf(buf, "btt 0x%px\n", KDB_TSK(cpu));
kdb_parse(buf);
return 0;
}
kdb_printf("btc: cpu status: ");
kdb_parse("cpu\n");
for_each_online_cpu(cpu) {
- sprintf(buf, "btt 0x%p\n", KDB_TSK(cpu));
+ sprintf(buf, "btt 0x%px\n", KDB_TSK(cpu));
kdb_parse(buf);
touch_nmi_watchdog();
}
--
2.13.3
On Fri, Oct 12, 2018 at 7:31 PM, Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
> On Fri, Oct 12, 2018 at 03:54:27AM -0700, Daniel Colascione wrote:
>> The map-in-map frequently serves as a mechanism for atomic
>> snapshotting of state that a BPF program might record. The current
>> implementation is dangerous to use in this way, however, since
>> userspace has no way of knowing when all programs that might have
>> retrieved the "old" value of the map may have completed.
>>
>> This change ensures that map update operations on map-in-map map types
>> always wait for all references to the old map to drop before returning
>> to userspace.
>>
>> Signed-off-by: Daniel Colascione <dancol(a)google.com>
>> ---
>> kernel/bpf/syscall.c | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 8339d81cba1d..d7c16ae1e85a 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -741,6 +741,18 @@ static int map_lookup_elem(union bpf_attr *attr)
>> return err;
>> }
>>
>> +static void maybe_wait_bpf_programs(struct bpf_map *map)
>> +{
>> + /* Wait for any running BPF programs to complete so that
>> + * userspace, when we return to it, knows that all programs
>> + * that could be running use the new map value.
>> + */
>> + if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
>> + map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
>> + synchronize_rcu();
>> + }
>
> extra {} were not necessary. I removed them while applying to bpf-next.
> Please run checkpatch.pl next time.
> Thanks
Thanks Alexei for taking it. Me and Lorenzo were discussing that not
having this causes incorrect behavior for apps using map-in-map for
this. So I CC'd stable as well.
-Joel
From: Daniel Wagner <daniel.wagner(a)siemens.com>
Sebastian writes:
"""
We reproducibly observe cache line starvation on a Core2Duo E6850 (2
cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
cores).
The problem can be triggered with a v4.9-RT kernel by starting
cyclictest -S -p98 -m -i2000 -b 200
and as "load"
stress-ng --ptrace 4
The reported maximal latency is usually less than 60us. If the problem
triggers then values around 400us, 800us or even more are reported. The
upperlimit is the -i parameter.
Reproduction with 4.9-RT is almost immediate on Core2Duo, ARM64 and SKL,
but it took 7.5 hours to trigger on v4.14-RT on the Core2Duo.
Instrumentation show always the picture:
CPU0 CPU1
=> do_syscall_64 => do_syscall_64
=> SyS_ptrace => syscall_slow_exit_work
=> ptrace_check_attach => ptrace_do_notify / rt_read_unlock
=> wait_task_inactive rt_spin_lock_slowunlock()
-> while task_running() __rt_mutex_unlock_common()
/ check_task_state() mark_wakeup_next_waiter()
| raw_spin_lock_irq(&p->pi_lock); raw_spin_lock(¤t->pi_lock);
| . .
| raw_spin_unlock_irq(&p->pi_lock); .
\ cpu_relax() .
- .
*IRQ* <lock acquired>
In the error case we observe that the while() loop is repeated more than
5000 times which indicates that the pi_lock can be acquired. CPU1 on the
other side does not make progress waiting for the same lock with interrupts
disabled.
This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
the other CPU is able to acquire pi_lock and the situation relaxes.
"""
This matches with the observeration for v4.4-rt on a Core2Duo E6850:
CPU 0:
- no progress for a very long time in rt_mutex_dequeue_pi):
stress-n-1931 0d..11 5060.891219: function: __try_to_take_rt_mutex
stress-n-1931 0d..11 5060.891219: function: rt_mutex_dequeue
stress-n-1931 0d..21 5060.891220: function: rt_mutex_enqueue_pi
stress-n-1931 0....2 5060.891220: signal_generate: sig=17 errno=0 code=262148 comm=stress-ng-ptrac pid=1928 grp=1 res=1
stress-n-1931 0d..21 5060.894114: function: rt_mutex_dequeue_pi
stress-n-1931 0d.h11 5060.894115: local_timer_entry: vector=239
CPU 1:
- IRQ at 5060.894114 on CPU 1 followed by the IRQ on CPU 0
stress-n-1928 1....0 5060.891215: sys_enter: NR 101 (18, 78b, 0, 0, 17, 788)
stress-n-1928 1d..11 5060.891216: function: __try_to_take_rt_mutex
stress-n-1928 1d..21 5060.891216: function: rt_mutex_enqueue_pi
stress-n-1928 1d..21 5060.891217: function: rt_mutex_dequeue_pi
stress-n-1928 1....1 5060.891217: function: rt_mutex_adjust_prio
stress-n-1928 1d..11 5060.891218: function: __rt_mutex_adjust_prio
stress-n-1928 1d.h10 5060.894114: local_timer_entry: vector=239
Thomas writes:
"""
This has nothing to do with RT. RT is merily exposing the
problem in an observable way. The same issue happens with upstream, it's
harder to trigger and it's harder to observe for obvious reasons.
If you read through the discussions [see the links below] then you
really see that there is an upstream issue with the x86 qrlock
implementation and Peter has posted fixes which resolve it, both at
the practical and the theoretical level.
"""
Backporting all qspinlock related patches is very likely to introduce
regressions on v4.4. Therefore, the recommended solution by Peter and
Thomas is to drop back to ticket spinlocks for v4.4.
Link :https://lkml.kernel.org/r/20180921120226.6xjgr4oiho22ex75@linutronix.de
Link: https://lkml.kernel.org/r/20180926110117.405325143@infradead.org
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Daniel Wagner <daniel.wagner(a)siemens.com>
---
Thomas suggest following plan for fixing the issues on the varous
stable trees:
4.4: Trivial by switching back to ticket locks.
4.9: Decide whether bringing back ticket locks or backporting all qrlock
fixes. Sebastian has done the latter already and it's probably the
right solution
4.14:
4.18: Backporting the qrlock fixes
4.19: Either the fix ends up in 4.19 final or it needs to be backported
arch/x86/Kconfig | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6df130a37d41..f00cab581e2d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -42,7 +42,6 @@ config X86
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if X86_64
select ARCH_USE_QUEUED_RWLOCKS
- select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANT_FRAME_POINTERS
--
2.14.4
From: Changwei Ge <ge.changwei(a)h3c.com>
Somehow, file system metadata was corrupted, which causes
ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().
According to the original design intention, if above happens we should
skip the problematic block and continue to retrieve dir entry. But there
is obviouse misuse of brelse around related code.
After failure of ocfs2_check_dir_entry(), currunt code just moves to next
position and uses the problematic buffer head again and again during
which the problematic buffer head is released for multiple times. I
suppose, this a serious issue which is long-lived in ocfs2. This may
cause other file systems which is also used in a the same host insane.
So we should also consider about bakcporting this patch into
linux -stable.
Suggested-by: Changkuo Shi <shi.changkuo(a)h3c.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
---
fs/ocfs2/dir.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index b048d4f..c121abb 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -1897,8 +1897,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
/* On error, skip the f_pos to the
next block. */
ctx->pos = (ctx->pos | (sb->s_blocksize - 1)) + 1;
- brelse(bh);
- continue;
+ break;
}
if (le64_to_cpu(de->inode)) {
unsigned char d_type = DT_UNKNOWN;
--
2.7.4
crypto_cfb_decrypt_segment() incorrectly XOR'ed generated keystream with
IV, rather than with data stream, resulting in incorrect decryption.
Test vectors will be added in the next patch.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
crypto/cfb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/crypto/cfb.c b/crypto/cfb.c
index a0d68c09e1b9..fd4e8500e121 100644
--- a/crypto/cfb.c
+++ b/crypto/cfb.c
@@ -144,7 +144,7 @@ static int crypto_cfb_decrypt_segment(struct skcipher_walk *walk,
do {
crypto_cfb_encrypt_one(tfm, iv, dst);
- crypto_xor(dst, iv, bsize);
+ crypto_xor(dst, src, bsize);
iv = src;
src += bsize;
--
2.19.1