Hi Greg,
For your consideration, few upstream fixes picked up from Qcom's android-4.9 BSP tree for OnePlus 6 device.
Cherry-picked and build tested for ARCH=x86_64/mips on v4.9.124. Few patches are applicable for 4.4.y and 3.18.y as well and explicitly marked so respectively.
Regards, Amit Pundir
Daniel Micay (1): staging/rts5208: Fix read overflow in memcpy
Jason A. Donenfeld (1): random: convert get_random_int/long into get_random_u32/u64
Jia-Ju Bai (1): staging: rt5208: Fix a sleep-in-atomic bug in xd_copy_page
Johannes Berg (1): nl80211: fix null-ptr dereference on invalid mesh configuration
Johannes Weiner (1): mm: remove seemingly spurious reclaimability check from laptop_mode gating
Kees Cook (1): IB/rxe: do not copy extra stack memory to skb
Mel Gorman (1): mm, vmscan: clear PGDAT_WRITEBACK when zone is balanced
Michal Hocko (1): selinux: use GFP_NOWAIT in the AVC kmem_caches
Prateek Sood (2): locking/rwsem-xadd: Fix missed wakeup due to reordering of load locking/osq_lock: Fix osq_lock queue corruption
Ritesh Harjani (1): cfq: Give a chance for arming slice idle timer in case of group_idle
Tejun Heo (1): block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg
Vegard Nossum (2): kthread: Fix use-after-free if kthread fork fails kthread: fix boot hang (regression) on MIPS/OpenRISC
arch/mips/kernel/process.c | 1 - arch/openrisc/kernel/process.c | 2 -- block/blk-cgroup.c | 9 +++--- block/cfq-iosched.c | 6 ++-- drivers/char/random.c | 55 ++++++++++++++++++------------------ drivers/infiniband/sw/rxe/rxe_resp.c | 4 ++- drivers/staging/rts5208/rtsx_scsi.c | 2 +- drivers/staging/rts5208/xd.c | 2 +- include/linux/random.h | 17 +++++++++-- kernel/fork.c | 17 +++++++---- kernel/locking/osq_lock.c | 13 +++++++++ kernel/locking/rwsem-xadd.c | 27 ++++++++++++++++++ mm/vmscan.c | 3 +- net/wireless/nl80211.c | 3 ++ security/selinux/avc.c | 14 ++++----- 15 files changed, 119 insertions(+), 56 deletions(-)
From: Ritesh Harjani riteshh@codeaurora.org
commit b3193bc0dca9bb69c8ba1ec1a318105c76eb4172 upstream.
In below scenario blkio cgroup does not work as per their assigned weights :- 1. When the underlying device is nonrotational with a single HW queue with depth of >= CFQ_HW_QUEUE_MIN 2. When the use case is forming two blkio cgroups cg1(weight 1000) & cg2(wight 100) and two processes(file1 and file2) doing sync IO in their respective blkio cgroups.
For above usecase result of fio (without this patch):- file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan 1 19:41:49 1970 write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec) <...> file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan 1 19:41:49 1970 write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec) <...> // both the process BW is equal even though they belong to diff. cgroups with weight of 1000(cg1) and 100(cg2)
In above case (for non rotational NCQ devices), as soon as the request from cg1 is completed and even though it is provided with higher set_slice=10, because of CFQ algorithm when the driver tries to fetch the request, CFQ expires this group without providing any idle time nor weight priority and schedules another cfq group (in this case cg2). And thus both cfq groups(cg1 & cg2) keep alternating to get the disk time and hence loses the cgroup weight based scheduling.
Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer) to arm the slice timer in case group_idle is enabled. In case if group_idle is also not required (including for nonrotational NCQ drives), we need to explicitly set group_idle = 0 from sysfs for such cases.
With this patch result of fio(for above usecase) :- file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan 1 00:06:08 1970 write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec) <..> file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan 1 00:06:08 1970 write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec) <..> // In this processes BW is as per their respective cgroups weight.
Signed-off-by: Ritesh Harjani riteshh@codeaurora.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y and 3.18.y as well. Build tested on v4.4.153 and v3.18.120.
block/cfq-iosched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index c7c3d4e6bc27..b2dc1c1f08c6 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -2951,7 +2951,8 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) * for devices that support queuing, otherwise we still have a problem * with sync vs async workloads. */ - if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) + if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && + !cfqd->cfq_group_idle) return;
WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
From: Vegard Nossum vegard.nossum@oracle.com
commit 4d6501dce079c1eb6bf0b1d8f528a5e81770109e upstream.
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct().
kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() | - _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() | - tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) | - kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF
The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data.
A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong.
Debugged-by: Jamie Iles jamie.iles@oracle.com Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed") Signed-off-by: Vegard Nossum vegard.nossum@oracle.com Acked-by: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Andy Lutomirski luto@kernel.org Cc: Frederic Weisbecker fweisbec@gmail.com Cc: Jamie Iles jamie.iles@oracle.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y and 3.18.y as well. Build tested on v4.4.153 and v3.18.120.
kernel/fork.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 2c98b987808d..46a6b0311ca3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1530,6 +1530,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out;
+ /* + * This _must_ happen before we call free_task(), i.e. before we jump + * to any of the bad_fork_* labels. This is to avoid freeing + * p->set_child_tid which is (ab)used as a kthread's data pointer for + * kernel threads (PF_KTHREAD). + */ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* + * Clear TID on mm_release()? + */ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p);
rt_mutex_init_task(p); @@ -1691,11 +1703,6 @@ static __latent_entropy struct task_struct *copy_process( } }
- p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* - * Clear TID on mm_release()? - */ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif
From: Vegard Nossum vegard.nossum@oracle.com
commit b0f5a8f32e8bbdaae1abb8abe2d3cbafaba57e08 upstream.
This fixes a regression in commit 4d6501dce079 where I didn't notice that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to NULL after our initialisation in copy_process().
We can simply get rid of the arch-specific initialisation here since it is now always done in copy_process() before hitting copy_thread{,_tls}().
Review notes:
- As far as I can tell, copy_process() is the only user of copy_thread_tls(), which is the only caller of copy_thread() for architectures that don't implement copy_thread_tls().
- After this patch, there is no arch-specific code touching p->set_child_tid or p->clear_child_tid whatsoever.
- It may look like MIPS/OpenRISC wanted to always have these fields be NULL, but that's not true, as copy_process() would unconditionally set them again _after_ calling copy_thread_tls() before commit 4d6501dce079.
Fixes: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e ("kthread: Fix use-after-free if kthread fork fails") Reported-by: Guenter Roeck linux@roeck-us.net Tested-by: Guenter Roeck linux@roeck-us.net # MIPS only Acked-by: Stafford Horne shorne@gmail.com Acked-by: Oleg Nesterov oleg@redhat.com Cc: Ralf Baechle ralf@linux-mips.org Cc: linux-mips@linux-mips.org Cc: Jonas Bonn jonas@southpole.se Cc: Stefan Kristiansson stefan.kristiansson@saunalahti.fi Cc: openrisc@lists.librecores.org Cc: Jamie Iles jamie.iles@oracle.com Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Vegard Nossum vegard.nossum@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y and 3.18.y as well. Build tested on v4.4.153 and v3.18.120.
arch/mips/kernel/process.c | 1 - arch/openrisc/kernel/process.c | 2 -- 2 files changed, 3 deletions(-)
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index 513a63b9b991..ba315e523b33 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -118,7 +118,6 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, struct thread_info *ti = task_thread_info(p); struct pt_regs *childregs, *regs = current_pt_regs(); unsigned long childksp; - p->set_child_tid = p->clear_child_tid = NULL;
childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32;
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c index 7095dfe7666b..962372143fda 100644 --- a/arch/openrisc/kernel/process.c +++ b/arch/openrisc/kernel/process.c @@ -152,8 +152,6 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
top_of_kernel_stack = sp;
- p->set_child_tid = p->clear_child_tid = NULL; - /* Locate userspace context on stack... */ sp -= STACK_FRAME_OVERHEAD; /* redzone */ sp -= sizeof(struct pt_regs);
From: "Jason A. Donenfeld" Jason@zx2c4.com
commit c440408cf6901eeb2c09563397e24a9097907078 upstream.
Many times, when a user wants a random number, he wants a random number of a guaranteed size. So, thinking of get_random_int and get_random_long in terms of get_random_u32 and get_random_u64 makes it much easier to achieve this. It also makes the code simpler.
On 32-bit platforms, get_random_int and get_random_long are both aliased to get_random_u32. On 64-bit platforms, int->u32 and long->u64.
Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Theodore Ts'o tytso@mit.edu Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Amit Pundir amit.pundir@linaro.org --- drivers/char/random.c | 55 +++++++++++++++++++++++++------------------------- include/linux/random.h | 17 ++++++++++++++-- 2 files changed, 42 insertions(+), 30 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c index 81b65d0e7563..464b95a63dc5 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -2110,8 +2110,8 @@ struct ctl_table random_table[] = {
struct batched_entropy { union { - unsigned long entropy_long[CHACHA20_BLOCK_SIZE / sizeof(unsigned long)]; - unsigned int entropy_int[CHACHA20_BLOCK_SIZE / sizeof(unsigned int)]; + u64 entropy_u64[CHACHA20_BLOCK_SIZE / sizeof(u64)]; + u32 entropy_u32[CHACHA20_BLOCK_SIZE / sizeof(u32)]; }; unsigned int position; }; @@ -2121,52 +2121,51 @@ struct batched_entropy { * number is either as good as RDRAND or as good as /dev/urandom, with the * goal of being quite fast and not depleting entropy. */ -static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_long); -unsigned long get_random_long(void) +static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_u64); +u64 get_random_u64(void) { - unsigned long ret; + u64 ret; struct batched_entropy *batch;
- if (arch_get_random_long(&ret)) +#if BITS_PER_LONG == 64 + if (arch_get_random_long((unsigned long *)&ret)) return ret; +#else + if (arch_get_random_long((unsigned long *)&ret) && + arch_get_random_long((unsigned long *)&ret + 1)) + return ret; +#endif
- batch = &get_cpu_var(batched_entropy_long); - if (batch->position % ARRAY_SIZE(batch->entropy_long) == 0) { - extract_crng((u8 *)batch->entropy_long); + batch = &get_cpu_var(batched_entropy_u64); + if (batch->position % ARRAY_SIZE(batch->entropy_u64) == 0) { + extract_crng((u8 *)batch->entropy_u64); batch->position = 0; } - ret = batch->entropy_long[batch->position++]; - put_cpu_var(batched_entropy_long); + ret = batch->entropy_u64[batch->position++]; + put_cpu_var(batched_entropy_u64); return ret; } -EXPORT_SYMBOL(get_random_long); +EXPORT_SYMBOL(get_random_u64);
-#if BITS_PER_LONG == 32 -unsigned int get_random_int(void) -{ - return get_random_long(); -} -#else -static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_int); -unsigned int get_random_int(void) +static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_u32); +u32 get_random_u32(void) { - unsigned int ret; + u32 ret; struct batched_entropy *batch;
if (arch_get_random_int(&ret)) return ret;
- batch = &get_cpu_var(batched_entropy_int); - if (batch->position % ARRAY_SIZE(batch->entropy_int) == 0) { - extract_crng((u8 *)batch->entropy_int); + batch = &get_cpu_var(batched_entropy_u32); + if (batch->position % ARRAY_SIZE(batch->entropy_u32) == 0) { + extract_crng((u8 *)batch->entropy_u32); batch->position = 0; } - ret = batch->entropy_int[batch->position++]; - put_cpu_var(batched_entropy_int); + ret = batch->entropy_u32[batch->position++]; + put_cpu_var(batched_entropy_u32); return ret; } -#endif -EXPORT_SYMBOL(get_random_int); +EXPORT_SYMBOL(get_random_u32);
/** * randomize_page - Generate a random, page aligned address diff --git a/include/linux/random.h b/include/linux/random.h index 16ab429735a7..ed5c3838780d 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -42,8 +42,21 @@ extern void get_random_bytes_arch(void *buf, int nbytes); extern const struct file_operations random_fops, urandom_fops; #endif
-unsigned int get_random_int(void); -unsigned long get_random_long(void); +u32 get_random_u32(void); +u64 get_random_u64(void); +static inline unsigned int get_random_int(void) +{ + return get_random_u32(); +} +static inline unsigned long get_random_long(void) +{ +#if BITS_PER_LONG == 64 + return get_random_u64(); +#else + return get_random_u32(); +#endif +} + unsigned long randomize_page(unsigned long start, unsigned long range);
u32 prandom_u32(void);
On Wed, Aug 29, 2018 at 01:43:15AM +0530, Amit Pundir wrote:
From: "Jason A. Donenfeld" Jason@zx2c4.com
commit c440408cf6901eeb2c09563397e24a9097907078 upstream.
Many times, when a user wants a random number, he wants a random number of a guaranteed size. So, thinking of get_random_int and get_random_long in terms of get_random_u32 and get_random_u64 makes it much easier to achieve this. It also makes the code simpler.
On 32-bit platforms, get_random_int and get_random_long are both aliased to get_random_u32. On 64-bit platforms, int->u32 and long->u64.
What bug is this fixing that it needs to be in a stable kernel tree? The end result is the same before and after this patch, right?
thanks,
greg k-h
From: Jia-Ju Bai baijiaju1990@163.com
commit 498c4b4e9c23855d17ecc2a108d949bb68020481 upstream.
The driver may sleep under a spin lock, and the function call path is: rtsx_exclusive_enter_ss (acquire the lock by spin_lock) rtsx_enter_ss rtsx_power_off_card xd_cleanup_work xd_delay_write xd_finish_write xd_copy_page wait_timeout schedule_timeout --> may sleep
To fix it, "wait_timeout" is replaced with mdelay in xd_copy_page.
Signed-off-by: Jia-Ju Bai baijiaju1990@163.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y and 3.18.y as well. Build tested on v4.4.153 and v3.18.120.
drivers/staging/rts5208/xd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/staging/rts5208/xd.c b/drivers/staging/rts5208/xd.c index 1de02bb98839..647f6beb4c65 100644 --- a/drivers/staging/rts5208/xd.c +++ b/drivers/staging/rts5208/xd.c @@ -1247,7 +1247,7 @@ static int xd_copy_page(struct rtsx_chip *chip, u32 old_blk, u32 new_blk, reg = 0; rtsx_read_register(chip, XD_CTL, ®); if (reg & (XD_ECC1_ERROR | XD_ECC2_ERROR)) { - wait_timeout(100); + mdelay(100);
if (detect_card_cd(chip, XD_CARD) != STATUS_SUCCESS) {
From: Daniel Micay danielmicay@gmail.com
commit 88a5b39b69ab1828fd4130e2baadd184109cea69 upstream.
Noticed by FORTIFY_SOURCE, this swaps memcpy() for strncpy() to zero-value fill the end of the buffer instead of over-reading a string from .rodata.
Signed-off-by: Daniel Micay danielmicay@gmail.com [kees: wrote commit log] Signed-off-by: Kees Cook keescook@chromium.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Wayne Porter wporter82@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y and 3.18.y as well. Build tested on v4.4.153 and v3.18.120.
drivers/staging/rts5208/rtsx_scsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/staging/rts5208/rtsx_scsi.c b/drivers/staging/rts5208/rtsx_scsi.c index b3790334fd3f..f50076d0fb6c 100644 --- a/drivers/staging/rts5208/rtsx_scsi.c +++ b/drivers/staging/rts5208/rtsx_scsi.c @@ -536,7 +536,7 @@ static int inquiry(struct scsi_cmnd *srb, struct rtsx_chip *chip)
if (sendbytes > 8) { memcpy(buf, inquiry_buf, 8); - memcpy(buf + 8, inquiry_string, sendbytes - 8); + strncpy(buf + 8, inquiry_string, sendbytes - 8); if (pro_formatter_flag) { /* Additional Length */ buf[4] = 0x33;
From: Kees Cook keescook@chromium.org
commit 4c93496f18ce5044d78e4f7f9e018682a4f44b3d upstream.
This fixes a over-read condition detected by FORTIFY_SOURCE for this line:
memcpy(SKB_TO_PKT(skb), &ack_pkt, sizeof(skb->cb));
The error was:
In file included from ./include/linux/bitmap.h:8:0, from ./include/linux/cpumask.h:11, from ./include/linux/mm_types_task.h:13, from ./include/linux/mm_types.h:4, from ./include/linux/kmemcheck.h:4, from ./include/linux/skbuff.h:18, from drivers/infiniband/sw/rxe/rxe_resp.c:34: In function 'memcpy', inlined from 'send_atomic_ack.constprop' at drivers/infiniband/sw/rxe/rxe_resp.c:998:2, inlined from 'acknowledge' at drivers/infiniband/sw/rxe/rxe_resp.c:1026:3, inlined from 'rxe_responder' at drivers/infiniband/sw/rxe/rxe_resp.c:1286:10: ./include/linux/string.h:309:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter __read_overflow2();
Daniel Micay noted that struct rxe_pkt_info is 32 bytes on 32-bit architectures, but skb->cb is still 64. The memcpy() over-reads 32 bytes. This fixes it by zeroing the unused bytes in skb->cb.
Link: http://lkml.kernel.org/r/1497903987-21002-5-git-send-email-keescook@chromium... Signed-off-by: Kees Cook keescook@chromium.org Cc: Moni Shoua monis@mellanox.com Cc: Doug Ledford dledford@redhat.com Cc: Sean Hefty sean.hefty@intel.com Cc: Daniel Micay danielmicay@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- drivers/infiniband/sw/rxe/rxe_resp.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 0d25dc84d294..2152c71a99d3 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -978,7 +978,9 @@ static int send_atomic_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt, free_rd_atomic_resource(qp, res); rxe_advance_resp_resource(qp);
- memcpy(SKB_TO_PKT(skb), &ack_pkt, sizeof(skb->cb)); + memcpy(SKB_TO_PKT(skb), &ack_pkt, sizeof(ack_pkt)); + memset((unsigned char *)SKB_TO_PKT(skb) + sizeof(ack_pkt), 0, + sizeof(skb->cb) - sizeof(ack_pkt));
res->type = RXE_ATOMIC_MASK; res->atomic.skb = skb;
From: Tejun Heo tj@kernel.org
commit e00f4f4d0ff7e13b9115428a245b49108d625f09 upstream.
blkcg allocates some per-cgroup data structures with GFP_NOWAIT and when that fails falls back to operations which aren't specific to the cgroup. Occassional failures are expected under pressure and falling back to non-cgroup operation is the right thing to do.
Unfortunately, I forgot to add __GFP_NOWARN to these allocations and these expected failures end up creating a lot of noise. Add __GFP_NOWARN.
Signed-off-by: Tejun Heo tj@kernel.org Reported-by: Marc MERLIN marc@merlins.org Reported-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Jens Axboe axboe@fb.com Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y as well. Build tested on v4.4.153.
block/blk-cgroup.c | 9 +++++---- block/cfq-iosched.c | 3 ++- 2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 6cd839c1f507..f570f387034d 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -185,7 +185,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, }
wb_congested = wb_congested_get_create(&q->backing_dev_info, - blkcg->css.id, GFP_NOWAIT); + blkcg->css.id, + GFP_NOWAIT | __GFP_NOWARN); if (!wb_congested) { ret = -ENOMEM; goto err_put_css; @@ -193,7 +194,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
/* allocate */ if (!new_blkg) { - new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT); + new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN); if (unlikely(!new_blkg)) { ret = -ENOMEM; goto err_put_congested; @@ -1022,7 +1023,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) }
spin_lock_init(&blkcg->lock); - INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT); + INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN); INIT_HLIST_HEAD(&blkcg->blkg_list); #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&blkcg->cgwb_list); @@ -1238,7 +1239,7 @@ int blkcg_activate_policy(struct request_queue *q, if (blkg->pd[pol->plid]) continue;
- pd = pol->pd_alloc_fn(GFP_NOWAIT, q->node); + pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node); if (!pd) swap(pd, pd_prealloc); if (!pd) { diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b2dc1c1f08c6..a4e2d0104af4 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3868,7 +3868,8 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic, goto out; }
- cfqq = kmem_cache_alloc_node(cfq_pool, GFP_NOWAIT | __GFP_ZERO, + cfqq = kmem_cache_alloc_node(cfq_pool, + GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN, cfqd->queue->node); if (!cfqq) { cfqq = &cfqd->oom_cfqq;
From: Johannes Berg johannes.berg@intel.com
commit 265698d7e6132a2d41471135534f4f36ad15b09c upstream.
If TX rates are specified during mesh join, the channel must also be specified. Check the channel pointer to avoid a null pointer dereference if it isn't.
Reported-by: Jouni Malinen j@w1.fi Fixes: 8564e38206de ("cfg80211: add checks for beacon rate, extend to mesh") Signed-off-by: Johannes Berg johannes.berg@intel.com Signed-off-by: Amit Pundir amit.pundir@linaro.org --- net/wireless/nl80211.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 5b75468b5acd..d8002be808f2 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -9480,6 +9480,9 @@ static int nl80211_join_mesh(struct sk_buff *skb, struct genl_info *info) if (err) return err;
+ if (!setup.chandef.chan) + return -EINVAL; + err = validate_beacon_tx_rate(rdev, setup.chandef.chan->band, &setup.beacon_rate); if (err)
From: Prateek Sood prsood@codeaurora.org
commit 9c29c31830a4eca724e137a9339137204bbb31be upstream.
If a spinner is present, there is a chance that the load of rwsem_has_spinner() in rwsem_wake() can be reordered with respect to decrement of rwsem count in __up_write() leading to wakeup being missed:
spinning writer up_write caller --------------- ----------------------- [S] osq_unlock() [L] osq spin_lock(wait_lock) sem->count=0xFFFFFFFF00000001 +0xFFFFFFFF00000000 count=sem->count MB sem->count=0xFFFFFFFE00000001 -0xFFFFFFFF00000001 spin_trylock(wait_lock) return rwsem_try_write_lock(count) spin_unlock(wait_lock) schedule()
Reordering of atomic_long_sub_return_release() in __up_write() and rwsem_has_spinner() in rwsem_wake() can cause missing of wakeup in up_write() context. In spinning writer, sem->count and local variable count is 0XFFFFFFFE00000001. It would result in rwsem_try_write_lock() failing to acquire rwsem and spinning writer going to sleep in rwsem_down_write_failed().
The smp_rmb() will make sure that the spinner state is consulted after sem->count is updated in up_write context.
Signed-off-by: Prateek Sood prsood@codeaurora.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: dave@stgolabs.net Cc: longman@redhat.com Cc: parri.andrea@gmail.com Cc: sramana@codeaurora.org Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y as well. Build tested on v4.4.153.
kernel/locking/rwsem-xadd.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4bb2366..a4112dfcd0fb 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -574,6 +574,33 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) WAKE_Q(wake_q);
/* + * __rwsem_down_write_failed_common(sem) + * rwsem_optimistic_spin(sem) + * osq_unlock(sem->osq) + * ... + * atomic_long_add_return(&sem->count) + * + * - VS - + * + * __up_write() + * if (atomic_long_sub_return_release(&sem->count) < 0) + * rwsem_wake(sem) + * osq_is_locked(&sem->osq) + * + * And __up_write() must observe !osq_is_locked() when it observes the + * atomic_long_add_return() in order to not miss a wakeup. + * + * This boils down to: + * + * [S.rel] X = 1 [RmW] r0 = (Y += 0) + * MB RMB + * [RmW] Y += 1 [L] r1 = X + * + * exists (r0=1 /\ r1=0) + */ + smp_rmb(); + + /* * If a spinner is present, it is not necessary to do the wakeup. * Try to do wakeup only if the trylock succeeds to minimize * spinlock contention which may introduce too much delay in the
From: Michal Hocko mhocko@kernel.org
commit 476accbe2f6ef69caeebe99f52a286e12ac35aee upstream.
There is a strange __GFP_NOMEMALLOC usage pattern in SELinux, specifically GFP_ATOMIC | __GFP_NOMEMALLOC which doesn't make much sense. GFP_ATOMIC on its own allows to access memory reserves while __GFP_NOMEMALLOC dictates we cannot use memory reserves. Replace this with the much more sane GFP_NOWAIT in the AVC code as we can tolerate memory allocation failures in that code.
Signed-off-by: Michal Hocko mhocko@kernel.org Acked-by: Mel Gorman mgorman@suse.de Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y as well. Build tested on v4.4.153.
security/selinux/avc.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/security/selinux/avc.c b/security/selinux/avc.c index e60c79de13e1..52f3c550abcc 100644 --- a/security/selinux/avc.c +++ b/security/selinux/avc.c @@ -348,27 +348,26 @@ static struct avc_xperms_decision_node struct avc_xperms_decision_node *xpd_node; struct extended_perms_decision *xpd;
- xpd_node = kmem_cache_zalloc(avc_xperms_decision_cachep, - GFP_ATOMIC | __GFP_NOMEMALLOC); + xpd_node = kmem_cache_zalloc(avc_xperms_decision_cachep, GFP_NOWAIT); if (!xpd_node) return NULL;
xpd = &xpd_node->xpd; if (which & XPERMS_ALLOWED) { xpd->allowed = kmem_cache_zalloc(avc_xperms_data_cachep, - GFP_ATOMIC | __GFP_NOMEMALLOC); + GFP_NOWAIT); if (!xpd->allowed) goto error; } if (which & XPERMS_AUDITALLOW) { xpd->auditallow = kmem_cache_zalloc(avc_xperms_data_cachep, - GFP_ATOMIC | __GFP_NOMEMALLOC); + GFP_NOWAIT); if (!xpd->auditallow) goto error; } if (which & XPERMS_DONTAUDIT) { xpd->dontaudit = kmem_cache_zalloc(avc_xperms_data_cachep, - GFP_ATOMIC | __GFP_NOMEMALLOC); + GFP_NOWAIT); if (!xpd->dontaudit) goto error; } @@ -396,8 +395,7 @@ static struct avc_xperms_node *avc_xperms_alloc(void) { struct avc_xperms_node *xp_node;
- xp_node = kmem_cache_zalloc(avc_xperms_cachep, - GFP_ATOMIC|__GFP_NOMEMALLOC); + xp_node = kmem_cache_zalloc(avc_xperms_cachep, GFP_NOWAIT); if (!xp_node) return xp_node; INIT_LIST_HEAD(&xp_node->xpd_head); @@ -550,7 +548,7 @@ static struct avc_node *avc_alloc_node(void) { struct avc_node *node;
- node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC); + node = kmem_cache_zalloc(avc_node_cachep, GFP_NOWAIT); if (!node) goto out;
From: Prateek Sood prsood@codeaurora.org
commit 50972fe78f24f1cd0b9d7bbf1f87d2be9e4f412e upstream.
Fix ordering of link creation between node->prev and prev->next in osq_lock(). A case in which the status of optimistic spin queue is CPU6->CPU2 in which CPU6 has acquired the lock.
tail v ,-. <- ,-. |6| |2| `-' -> `-'
At this point if CPU0 comes in to acquire osq_lock, it will update the tail count.
CPU2 CPU0 ----------------------------------
tail v ,-. <- ,-. ,-. |6| |2| |0| `-' -> `-' `-'
After tail count update if CPU2 starts to unqueue itself from optimistic spin queue, it will find an updated tail count with CPU0 and update CPU2 node->next to NULL in osq_wait_next().
unqueue-A
tail v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' `-'
unqueue-B
->tail != curr && !node->next
If reordering of following stores happen then prev->next where prev being CPU2 would be updated to point to CPU0 node:
tail v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' -> `-'
osq_wait_next() node->next <- 0 xchg(node->next, NULL)
tail v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' `-'
unqueue-C
At this point if next instruction WRITE_ONCE(next->prev, prev); in CPU2 path is committed before the update of CPU0 node->prev = prev then CPU0 node->prev will point to CPU6 node.
tail v----------. v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' `-' `----------^
At this point if CPU0 path's node->prev = prev is committed resulting in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL currently,
tail v ,-. <- ,-. <- ,-. |6| |2| |0| `-' `-' `-' `----------^
so if CPU0 gets into unqueue path of osq_lock it will keep spinning in infinite loop as condition prev->next == node will never be true.
Signed-off-by: Prateek Sood prsood@codeaurora.org [ Added pictures, rewrote comments. ] Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: sramana@codeaurora.org Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- To be applied on 4.4.y as well. Build tested on v4.4.153.
kernel/locking/osq_lock.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a37857ab55..8d7047ecef4e 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -104,6 +104,19 @@ bool osq_lock(struct optimistic_spin_queue *lock)
prev = decode_cpu(old); node->prev = prev; + + /* + * osq_lock() unqueue + * + * node->prev = prev osq_wait_next() + * WMB MB + * prev->next = node next->prev = prev // unqueue-C + * + * Here 'node->prev' and 'next->prev' are the same variable and we need + * to ensure these stores happen in-order to avoid corrupting the list. + */ + smp_wmb(); + WRITE_ONCE(prev->next, node);
/*
From: Mel Gorman mgorman@suse.de
commit c2f83143f1c67d186520b72b6cefbf0aa07a34ee upstream.
Hillf Danton pointed out that since commit 1d82de618dd ("mm, vmscan: make kswapd reclaim in terms of nodes") that PGDAT_WRITEBACK is no longer cleared.
It was not noticed as triggering it requires pages under writeback to cycle twice through the LRU and before kswapd gets stalled. Historically, such issues tended to occur on small machines writing heavily to slow storage such as a USB stick.
Once kswapd stalls, direct reclaim stalls may be higher but due to the fact that memory pressure is required, it would not be very noticable.
Michal Hocko suggested removing the flag entirely but the conservative fix is to restore the intended PGDAT_WRITEBACK behaviour and clear the flag when a suitable zone is balanced.
Fixes: 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") Link: http://lkml.kernel.org/r/20170203203222.gq7hk66yc36lpgtb@suse.de Signed-off-by: Mel Gorman mgorman@suse.de Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Michal Hocko mhocko@suse.com Acked-by: Hillf Danton hillf.zj@alibaba-inc.com Cc: Minchan Kim minchan.kim@gmail.com Cc: Rik van Riel riel@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- mm/vmscan.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c index f03ca5ab86b1..cfffef1f26a8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3123,6 +3123,7 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) */ clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); + clear_bit(PGDAT_WRITEBACK, &zone->zone_pgdat->flags);
return true; }
From: Johannes Weiner hannes@cmpxchg.org
commit 047d72c30eedcb953222810f1e7dcaae663aa452 upstream.
Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") allowed laptop_mode=1 to start writing not just when the priority drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
That appears to be a spurious change in this patch as I doubt the series was tested with laptop_mode, and neither is that particular change mentioned in the changelog. Remove it, it's still recent.
Link: http://lkml.kernel.org/r/20170228214007.5621-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Hillf Danton hillf.zj@alibaba-inc.com Acked-by: Mel Gorman mgorman@techsingularity.net Acked-by: Michal Hocko mhocko@suse.com Cc: Jia He hejianet@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Amit Pundir amit.pundir@linaro.org --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index cfffef1f26a8..4e5846b8b5eb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3301,7 +3301,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * If we're getting trouble reclaiming, start doing writepage * even in laptop mode. */ - if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat)) + if (sc.priority < DEF_PRIORITY - 2) sc.may_writepage = 1;
/* Call soft limit reclaim before calling shrink_node. */
On Wed, Aug 29, 2018 at 01:43:11AM +0530, Amit Pundir wrote:
Hi Greg,
For your consideration, few upstream fixes picked up from Qcom's android-4.9 BSP tree for OnePlus 6 device.
Cherry-picked and build tested for ARCH=x86_64/mips on v4.9.124. Few patches are applicable for 4.4.y and 3.18.y as well and explicitly marked so respectively.
All but the random patch are now queued up, thanks.
gre k-h
linux-stable-mirror@lists.linaro.org