Linux-stable-mirror February 2018

linux-stable-mirror@lists.linaro.org

304 participants
3308 discussions

[Linux-stable-mirror] [PATCH 4.4 1/3] bpf: fix branch pruning logic

by Ben Hutchings

commit c131187db2d3fa2f8bf32fdf4e9a4ef805168467 upstream. when the verifier detects that register contains a runtime constant and it's compared with another constant it will prune exploration of the branch that is guaranteed not to be taken at runtime. This is all correct, but malicious program may be constructed in such a way that it always has a constant comparison and the other branch is never taken under any conditions. In this case such path through the program will not be explored by the verifier. It won't be taken at run-time either, but since all instructions are JITed the malicious program may cause JITs to complain about using reserved fields, etc. To fix the issue we have to track the instructions explored by the verifier and sanitize instructions that are dead at run time with NOPs. We cannot reject such dead code, since llvm generates it for valid C code, since it doesn't do as much data flow analysis as the verifier does. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Alexei Starovoitov <ast(a)kernel.org> Acked-by: Daniel Borkmann <daniel(a)iogearbox.net> Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net> [bwh: Backported to 4.4: - s/bpf_verifier_env/verifier_env/ - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings(a)codethink.co.uk> --- kernel/bpf/verifier.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 014c2d759916..a62679711de0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -191,6 +191,7 @@ struct bpf_insn_aux_data { enum bpf_reg_type ptr_type; /* pointer type for load/store insns */ struct bpf_map *map_ptr; /* pointer for call insn into lookup_elem */ }; + bool seen; /* this insn was processed by the verifier */ }; #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ @@ -1793,6 +1794,7 @@ static int do_check(struct verifier_env *env) print_bpf_insn(env, insn); } + env->insn_aux_data[insn_idx].seen = true; if (class == BPF_ALU || class == BPF_ALU64) { err = check_alu_op(env, insn); if (err) @@ -1988,6 +1990,7 @@ process_bpf_exit: return err; insn_idx++; + env->insn_aux_data[insn_idx].seen = true; } else { verbose("invalid BPF_LD mode\n"); return -EINVAL; @@ -2125,6 +2128,7 @@ static int adjust_insn_aux_data(struct verifier_env *env, u32 prog_len, u32 off, u32 cnt) { struct bpf_insn_aux_data *new_data, *old_data = env->insn_aux_data; + int i; if (cnt == 1) return 0; @@ -2134,6 +2138,8 @@ static int adjust_insn_aux_data(struct verifier_env *env, u32 prog_len, memcpy(new_data, old_data, sizeof(struct bpf_insn_aux_data) * off); memcpy(new_data + off + cnt - 1, old_data + off, sizeof(struct bpf_insn_aux_data) * (prog_len - off - cnt + 1)); + for (i = off; i < off + cnt - 1; i++) + new_data[i].seen = true; env->insn_aux_data = new_data; vfree(old_data); return 0; @@ -2152,6 +2158,25 @@ static struct bpf_prog *bpf_patch_insn_data(struct verifier_env *env, u32 off, return new_prog; } +/* The verifier does more data flow analysis than llvm and will not explore + * branches that are dead at run time. Malicious programs can have dead code + * too. Therefore replace all dead at-run-time code with nops. + */ +static void sanitize_dead_code(struct verifier_env *env) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + struct bpf_insn nop = BPF_MOV64_REG(BPF_REG_0, BPF_REG_0); + struct bpf_insn *insn = env->prog->insnsi; + const int insn_cnt = env->prog->len; + int i; + + for (i = 0; i < insn_cnt; i++) { + if (aux_data[i].seen) + continue; + memcpy(insn + i, &nop, sizeof(nop)); + } +} + /* convert load instructions that access fields of 'struct __sk_buff' * into sequence of instructions that access fields of 'struct sk_buff' */ @@ -2370,6 +2395,9 @@ skip_full_check: while (pop_stack(env, NULL) >= 0); free_states(env); + if (ret == 0) + sanitize_dead_code(env); + if (ret == 0) /* program is valid, convert *(u32*)(ctx + off) accesses */ ret = convert_ctx_accesses(env); -- 2.15.0.rc0

7 years, 5 months

[Linux-stable-mirror] [PATCH 4.4] x86/pti: Make unpoison of pgd for trusted boot work for real

by Hugh Dickins

From: Dave Hansen <dave.hansen(a)linux.intel.com> commit 445b69e3b75e42362a5bdc13c8b8f61599e2228a upstream The inital fix for trusted boot and PTI potentially misses the pgd clearing if pud_alloc() sets a PGD. It probably works in *practice* because for two adjacent calls to map_tboot_page() that share a PGD entry, the first will clear NX, *then* allocate and set the PGD (without NX clear). The second call will *not* allocate but will clear the NX bit. Defer the NX clearing to a point after it is known that all top-level allocations have occurred. Add a comment to clarify why. [ tglx: Massaged changelog ] Fixes: 262b6b30087 ("x86/tboot: Unbreak tboot with PTI enabled") Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Jon Masters <jcm(a)redhat.com> Cc: Tim Chen <tim.c.chen(a)linux.intel.com> Cc: gnomes(a)lxorguk.ukuu.org.uk Cc: peterz(a)infradead.org Cc: ning.sun(a)intel.com Cc: tboot-devel(a)lists.sourceforge.net Cc: andi(a)firstfloor.org Cc: luto(a)kernel.org Cc: law(a)redhat.com Cc: pbonzini(a)redhat.com Cc: torvalds(a)linux-foundation.org Cc: gregkh(a)linux-foundation.org Cc: dwmw(a)amazon.co.uk Cc: nickc(a)redhat.com Cc: stable(a)vger.kernel.org Link: https://lkml.kernel.org/r/20180110224939.2695CD47@viggo.jf.intel.com Cc: Jiri Kosina <jkosina(a)suse.cz> Signed-off-by: Hugh Dickins <hughd(a)google.com> hughd notes: I have not tested tboot, but this looks to me as necessary and as safe in old-Kaiser backports as it is upstream; I'm not submitting the commit-to-be-fixed 262b6b30087, since it was undone by 445b69e3b75e, and makes conflict trouble because of 5-level's p4d versus 4-level's pgd. --- arch/x86/kernel/tboot.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index 91a4496db434..c77ab1f51fbe 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -140,6 +140,16 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, return -1; set_pte_at(&tboot_mm, vaddr, pte, pfn_pte(pfn, prot)); pte_unmap(pte); + + /* + * PTI poisons low addresses in the kernel page tables in the + * name of making them unusable for userspace. To execute + * code at such a low address, the poison must be cleared. + * + * Note: 'pgd' actually gets set in pud_alloc(). + */ + pgd->pgd &= ~_PAGE_NX; + return 0; } -- 2.16.0.rc1.238.g530d649a79-goog

7 years, 5 months

[Linux-stable-mirror] [PATCH stable 4.4 0/9] BPF stable patches

by Daniel Borkmann

All for 4.4 backported and (limited) testing. Thanks! Alexei Starovoitov (3): bpf: fix bpf_tail_call() x64 JIT bpf: introduce BPF_JIT_ALWAYS_ON config bpf: fix 32-bit divide by zero Daniel Borkmann (4): bpf: fix branch pruning logic bpf: arsh is not supported in 32 bit alu thus reject it bpf: avoid false sharing of map refcount with max_entries bpf: reject stores into ctx via st and xadd Eric Dumazet (2): x86: bpf_jit: small optimization in emit_bpf_tail_call() bpf: fix divides by zero arch/arm64/Kconfig | 1 + arch/s390/Kconfig | 1 + arch/x86/Kconfig | 1 + arch/x86/net/bpf_jit_comp.c | 13 ++++----- include/linux/bpf.h | 16 ++++++++--- init/Kconfig | 7 +++++ kernel/bpf/core.c | 30 ++++++++++++++++--- kernel/bpf/verifier.c | 70 +++++++++++++++++++++++++++++++++++++++++++++ lib/test_bpf.c | 13 +++++---- net/Kconfig | 3 ++ net/core/filter.c | 8 +++++- net/core/sysctl_net_core.c | 6 ++++ net/socket.c | 9 ++++++ 13 files changed, 157 insertions(+), 21 deletions(-) -- 2.9.5

7 years, 5 months

[Linux-stable-mirror] Patch "Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops" has been added to the 4.14-stable tree

by gregkh＠linuxfoundation.org

This is a note to let you know that I've just added the patch titled Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops to the 4.14-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum… The filename of the patch is: bluetooth-hci_serdev-init-hci_uart-proto_lock-to-avoid-oops.patch and it can be found in the queue-4.14 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable(a)vger.kernel.org> know about it. >From d73e172816652772114827abaa2dbc053eecbbd7 Mon Sep 17 00:00:00 2001 From: Lukas Wunner <lukas(a)wunner.de> Date: Fri, 17 Nov 2017 00:54:53 +0100 Subject: Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Lukas Wunner <lukas(a)wunner.de> commit d73e172816652772114827abaa2dbc053eecbbd7 upstream. John Stultz reports a boot time crash with the HiKey board (which uses hci_serdev) occurring in hci_uart_tx_wakeup(). That function is contained in hci_ldisc.c, but also called from the newer hci_serdev.c. It acquires the proto_lock in struct hci_uart and it turns out that we forgot to init the lock in the serdev code path, thus causing the crash. John bisected the crash to commit 67d2f8781b9f ("Bluetooth: hci_ldisc: Allow sleeping while proto locks are held"), but the issue was present before and the commit merely exposed it. (Perhaps by luck, the crash did not occur with rwlocks.) Init the proto_lock in the serdev code path to avoid the oops. Stack trace for posterity: Unable to handle kernel read from unreadable memory at 406f127000 [000000406f127000] user address but active_mm is swapper Internal error: Oops: 96000005 [#1] PREEMPT SMP Hardware name: HiKey Development Board (DT) Call trace: hci_uart_tx_wakeup+0x38/0x148 hci_uart_send_frame+0x28/0x38 hci_send_frame+0x64/0xc0 hci_cmd_work+0x98/0x110 process_one_work+0x134/0x330 worker_thread+0x130/0x468 kthread+0xf8/0x128 ret_from_fork+0x10/0x18 Link: https://lkml.org/lkml/2017/11/15/908 Reported-and-tested-by: John Stultz <john.stultz(a)linaro.org> Cc: Ronald Tschalär <ronald(a)innovation.ch> Cc: Rob Herring <rob.herring(a)linaro.org> Cc: Sumit Semwal <sumit.semwal(a)linaro.org> Signed-off-by: Lukas Wunner <lukas(a)wunner.de> Signed-off-by: Marcel Holtmann <marcel(a)holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> --- drivers/bluetooth/hci_serdev.c | 1 + 1 file changed, 1 insertion(+) --- a/drivers/bluetooth/hci_serdev.c +++ b/drivers/bluetooth/hci_serdev.c @@ -304,6 +304,7 @@ int hci_uart_register_device(struct hci_ hci_set_drvdata(hdev, hu); INIT_WORK(&hu->write_work, hci_uart_write_work); + percpu_init_rwsem(&hu->proto_lock); /* Only when vendor specific setup callback is provided, consider * the manufacturer information valid. This avoids filling in the Patches currently in stable-queue which might be from lukas(a)wunner.de are queue-4.14/bluetooth-hci_serdev-init-hci_uart-proto_lock-to-avoid-oops.patch

7 years, 5 months

[Linux-stable-mirror] Patch "Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops" has been added to the 4.15-stable tree

by gregkh＠linuxfoundation.org

This is a note to let you know that I've just added the patch titled Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops to the 4.15-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum… The filename of the patch is: bluetooth-hci_serdev-init-hci_uart-proto_lock-to-avoid-oops.patch and it can be found in the queue-4.15 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable(a)vger.kernel.org> know about it. >From d73e172816652772114827abaa2dbc053eecbbd7 Mon Sep 17 00:00:00 2001 From: Lukas Wunner <lukas(a)wunner.de> Date: Fri, 17 Nov 2017 00:54:53 +0100 Subject: Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Lukas Wunner <lukas(a)wunner.de> commit d73e172816652772114827abaa2dbc053eecbbd7 upstream. John Stultz reports a boot time crash with the HiKey board (which uses hci_serdev) occurring in hci_uart_tx_wakeup(). That function is contained in hci_ldisc.c, but also called from the newer hci_serdev.c. It acquires the proto_lock in struct hci_uart and it turns out that we forgot to init the lock in the serdev code path, thus causing the crash. John bisected the crash to commit 67d2f8781b9f ("Bluetooth: hci_ldisc: Allow sleeping while proto locks are held"), but the issue was present before and the commit merely exposed it. (Perhaps by luck, the crash did not occur with rwlocks.) Init the proto_lock in the serdev code path to avoid the oops. Stack trace for posterity: Unable to handle kernel read from unreadable memory at 406f127000 [000000406f127000] user address but active_mm is swapper Internal error: Oops: 96000005 [#1] PREEMPT SMP Hardware name: HiKey Development Board (DT) Call trace: hci_uart_tx_wakeup+0x38/0x148 hci_uart_send_frame+0x28/0x38 hci_send_frame+0x64/0xc0 hci_cmd_work+0x98/0x110 process_one_work+0x134/0x330 worker_thread+0x130/0x468 kthread+0xf8/0x128 ret_from_fork+0x10/0x18 Link: https://lkml.org/lkml/2017/11/15/908 Reported-and-tested-by: John Stultz <john.stultz(a)linaro.org> Cc: Ronald Tschalär <ronald(a)innovation.ch> Cc: Rob Herring <rob.herring(a)linaro.org> Cc: Sumit Semwal <sumit.semwal(a)linaro.org> Signed-off-by: Lukas Wunner <lukas(a)wunner.de> Signed-off-by: Marcel Holtmann <marcel(a)holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> --- drivers/bluetooth/hci_serdev.c | 1 + 1 file changed, 1 insertion(+) --- a/drivers/bluetooth/hci_serdev.c +++ b/drivers/bluetooth/hci_serdev.c @@ -303,6 +303,7 @@ int hci_uart_register_device(struct hci_ hci_set_drvdata(hdev, hu); INIT_WORK(&hu->write_work, hci_uart_write_work); + percpu_init_rwsem(&hu->proto_lock); /* Only when vendor specific setup callback is provided, consider * the manufacturer information valid. This avoids filling in the Patches currently in stable-queue which might be from lukas(a)wunner.de are queue-4.15/bluetooth-hci_serdev-init-hci_uart-proto_lock-to-avoid-oops.patch

7 years, 5 months

Re: [Linux-stable-mirror] [PATCH v2] xfs: preserve i_rdev when recycling a reclaimable inode

by Amir Goldstein

On Thu, Feb 1, 2018 at 2:27 AM, Amir Goldstein <amir73il(a)gmail.com> wrote: > On Mon, Jan 29, 2018 at 5:50 PM, Darrick J. Wong > <darrick.wong(a)oracle.com> wrote: >> On Mon, Jan 29, 2018 at 01:07:36PM +0200, Amir Goldstein wrote: >>> On Fri, Jan 26, 2018 at 11:44 PM, Darrick J. Wong >>> <darrick.wong(a)oracle.com> wrote: >>> > On Fri, Jan 26, 2018 at 09:44:29AM +0200, Amir Goldstein wrote: >>> >> Commit 66f364649d870 ("xfs: remove if_rdev") moved storing of rdev >>> >> value for special inodes to VFS inodes, but forgot to preserve the >>> >> value of i_rdev when recycling a reclaimable xfs_inode. >>> >> >>> >> This was detected by xfstest overlay/017 with inodex=on mount option >>> >> and xfs base fs. The test does a lookup of overlay chardev and blockdev >>> >> right after drop caches. >>> >> >>> >> Overlayfs inodes hold a reference on underlying xfs inodes when mount >>> >> option index=on is configured. If drop caches reclaim xfs inodes, before >>> >> it relclaims overlayfs inodes, that can sometimes leave a reclaimable xfs >>> >> inode and that test hits that case quite often. >>> >> >>> >> When that happens, the xfs inode cache remains broken (zere i_rdev) >>> >> until the next cycle mount or drop caches. >>> >> >>> >> Fixes: 66f364649d870 ("xfs: remove if_rdev") >>> >> Signed-off-by: Amir Goldstein <amir73il(a)gmail.com> >>> > >>> > Looks ok, >>> > Reviewed-by: Darrick J. Wong <darrick.wong(a)oracle.com> >>> > >>> >>> I recon that now we should now also strap: >>> Cc: <stable(a)vger.kernel.org> #v4.15 >>> >>> Can I assume, you'll add it on apply? >> >> I'll do a proper backport of this and a couple other critical cow >> fixes after I get the 4.16 stuff merged. >> > > I am not sure what "proper backport" means in the context of > this patch. > This is a v4.15-rc1 regression fix that is based on v4.15-rc8. > It applied cleanly on v4.15. > > CC'ing stable for attention. > > This patch is now in master, but due to its timing it did not > get the CC: stable tag. > Now really CC stable. Amir.

7 years, 5 months

[Linux-stable-mirror] [patch 116/119] mm, memory_hotplug: fix memmap initialization

by akpm＠linux-foundation.org

From: Michal Hocko <mhocko(a)suse.com> Subject: mm, memory_hotplug: fix memmap initialization Bharata has noticed that onlining a newly added memory doesn't increase the total memory, pointing to f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") as a culprit. This commit has changed the way how the memory for memmaps is initialized and moves it from the allocation time to the initialization time. This works properly for the early memmap init path. It doesn't work for the memory hotplug though because we need to mark page as reserved when the sparsemem section is created and later initialize it completely during onlining. memmap_init_zone is called in the early stage of onlining. With the current code it calls __init_single_page and as such it clears up the whole stage and therefore online_pages_range skips those pages. Fix this by skipping mm_zero_struct_page in __init_single_page for memory hotplug path. This is quite uggly but unifying both early init and memory hotplug init paths is a large project. Make sure we plug the regression at least. Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Michal Hocko <mhocko(a)suse.com> Reported-by: Bharata B Rao <bharata(a)linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata(a)linux.vnet.ibm.com> Reviewed-by: Pavel Tatashin <pasha.tatashin(a)oracle.com> Cc: Steven Sistare <steven.sistare(a)oracle.com> Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com> Cc: Bob Picco <bob.picco(a)oracle.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/page_alloc.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff -puN mm/page_alloc.c~mm-memory_hotplug-fix-memmap-initialization mm/page_alloc.c --- a/mm/page_alloc.c~mm-memory_hotplug-fix-memmap-initialization +++ a/mm/page_alloc.c @@ -1177,9 +1177,10 @@ static void free_one_page(struct zone *z } static void __meminit __init_single_page(struct page *page, unsigned long pfn, - unsigned long zone, int nid) + unsigned long zone, int nid, bool zero) { - mm_zero_struct_page(page); + if (zero) + mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); @@ -1194,9 +1195,9 @@ static void __meminit __init_single_page } static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, - int nid) + int nid, bool zero) { - return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); + return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero); } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @@ -1217,7 +1218,7 @@ static void __meminit init_reserved_page if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) break; } - __init_single_pfn(pfn, zid, nid); + __init_single_pfn(pfn, zid, nid, true); } #else static inline void init_reserved_page(unsigned long pfn) @@ -1534,7 +1535,7 @@ static unsigned long __init deferred_in } else { page++; } - __init_single_page(page, pfn, zid, nid); + __init_single_page(page, pfn, zid, nid, true); nr_pages++; } return (nr_pages); @@ -5399,15 +5400,20 @@ not_early: * can be created for invalid pages (for alignment) * check here not to call set_pageblock_migratetype() against * pfn out of zone. + * + * Please note that MEMMAP_HOTPLUG path doesn't clear memmap + * because this is done early in sparse_add_one_section */ if (!(pfn & (pageblock_nr_pages - 1))) { struct page *page = pfn_to_page(pfn); - __init_single_page(page, pfn, zone, nid); + __init_single_page(page, pfn, zone, nid, + context != MEMMAP_HOTPLUG); set_pageblock_migratetype(page, MIGRATE_MOVABLE); cond_resched(); } else { - __init_single_pfn(pfn, zone, nid); + __init_single_pfn(pfn, zone, nid, + context != MEMMAP_HOTPLUG); } } } _

7 years, 5 months

[Linux-stable-mirror] [patch 013/119] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

by akpm＠linux-foundation.org

From: Gang He <ghe(a)suse.com> Subject: ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE If we can't get inode lock immediately in the function ocfs2_inode_lock_with_page() when reading a page, we should not return directly here, since this will lead to a softlockup problem when the kernel is configured with CONFIG_PREEMPT is not set. The method is to get a blocking lock and immediately unlock before returning, this can avoid CPU resource waste due to lots of retries, and benefits fairness in getting lock among multiple nodes, increase efficiency in case modifying the same file frequently from multiple nodes. The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) looks like: Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x82 panic+0xd5/0x21e watchdog_timer_fn+0x208/0x210 ? watchdog_park_threads+0x70/0x70 __hrtimer_run_queues+0xcc/0x200 hrtimer_interrupt+0xa6/0x1f0 smp_apic_timer_interrupt+0x34/0x50 apic_timer_interrupt+0x96/0xa0 </IRQ> RIP: 0010:unlock_page+0x17/0x30 RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004 RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00 R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518 R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300 ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] ocfs2_readpage+0x41/0x2d0 [ocfs2] ? pagecache_get_page+0x30/0x200 filemap_fault+0x12b/0x5c0 ? recalc_sigpending+0x17/0x50 ? __set_task_blocked+0x28/0x70 ? __set_current_blocked+0x3d/0x60 ocfs2_fault+0x29/0xb0 [ocfs2] __do_fault+0x1a/0xa0 __handle_mm_fault+0xbe8/0x1090 handle_mm_fault+0xaa/0x1f0 __do_page_fault+0x235/0x4b0 trace_do_page_fault+0x3c/0x110 async_page_fault+0x28/0x30 RIP: 0033:0x7fa75ded638e RSP: 002b:00007ffd6657db18 EFLAGS: 00010287 RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700 RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700 RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000 R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770 R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000 About performance improvement, we can see the testing time is reduced, and CPU utilization decreases, the detailed data is as follows. I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. Before applying this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 14:44:52 CST 2017 multi_mmap..................................................Passed. Runtime 783 seconds. After apply this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 15:04:12 CST 2017 multi_mmap..................................................Passed. Runtime 487 seconds. Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock") Signed-off-by: Gang He <ghe(a)suse.com> Reviewed-by: Eric Ren <zren(a)suse.com> Acked-by: alex chen <alex.chen(a)huawei.com> Acked-by: piaojun <piaojun(a)huawei.com> Cc: Mark Fasheh <mfasheh(a)versity.com> Cc: Joel Becker <jlbec(a)evilplan.org> Cc: Junxiao Bi <junxiao.bi(a)oracle.com> Cc: Joseph Qi <jiangqi903(a)gmail.com> Cc: Changwei Ge <ge.changwei(a)h3c.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/ocfs2/dlmglue.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff -puN fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page fs/ocfs2/dlmglue.c --- a/fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page +++ a/fs/ocfs2/dlmglue.c @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct in ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); if (ret == -EAGAIN) { unlock_page(page); + /* + * If we can't get inode lock immediately, we should not return + * directly here, since this will lead to a softlockup problem. + * The method is to get a blocking lock and immediately unlock + * before returning, this can avoid CPU resource waste due to + * lots of retries, and benefits fairness in getting lock. + */ + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) + ocfs2_inode_unlock(inode, ex); ret = AOP_TRUNCATED_PAGE; } _

7 years, 5 months

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror February 2018