- Linux-stable-mirror - lists.linaro.org

[merged] userfaultfd-disable-irqs-when-taking-the-waitqueue-lock.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: userfaultfd: disable irqs when taking the waitqueue lock has been removed from the -mm tree. Its filename was userfaultfd-disable-irqs-when-taking-the-waitqueue-lock.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Christoph Hellwig <hch(a)lst.de> Subject: userfaultfd: disable irqs when taking the waitqueue lock userfaultfd contains howe-grown locking of the waitqueue lock, and does not disable interrupts. This relies on the fact that no one else takes it from interrupt context and violates an invariat of the normal waitqueue locking scheme. With aio poll it is easy to trigger other locks that disable interrupts (or are called from interrupt context). Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com> Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: <stable(a)vger.kernel.org> [4.19.x] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/userfaultfd.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- a/fs/userfaultfd.c~userfaultfd-disable-irqs-when-taking-the-waitqueue-lock +++ a/fs/userfaultfd.c @@ -1026,7 +1026,7 @@ static ssize_t userfaultfd_ctx_read(stru struct userfaultfd_ctx *fork_nctx = NULL; /* always take the fd_wqh lock before the fault_pending_wqh lock */ - spin_lock(&ctx->fd_wqh.lock); + spin_lock_irq(&ctx->fd_wqh.lock); __add_wait_queue(&ctx->fd_wqh, &wait); for (;;) { set_current_state(TASK_INTERRUPTIBLE); @@ -1112,13 +1112,13 @@ static ssize_t userfaultfd_ctx_read(stru ret = -EAGAIN; break; } - spin_unlock(&ctx->fd_wqh.lock); + spin_unlock_irq(&ctx->fd_wqh.lock); schedule(); - spin_lock(&ctx->fd_wqh.lock); + spin_lock_irq(&ctx->fd_wqh.lock); } __remove_wait_queue(&ctx->fd_wqh, &wait); __set_current_state(TASK_RUNNING); - spin_unlock(&ctx->fd_wqh.lock); + spin_unlock_irq(&ctx->fd_wqh.lock); if (!ret && msg->event == UFFD_EVENT_FORK) { ret = resolve_userfault_fork(ctx, fork_nctx, msg); _ Patches currently in -mm which might be from hch(a)lst.de are

6 years, 9 months

1
0
0 0

[merged] mm-proc-pid-smaps_rollup-fix-null-pointer-deref-in-smaps_pte_range.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range() has been removed from the -mm tree. Its filename was mm-proc-pid-smaps_rollup-fix-null-pointer-deref-in-smaps_pte_range.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Vlastimil Babka <vbabka(a)suse.cz> Subject: mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range() Leonardo reports an apparent regression in 4.19-rc7: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631 Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018 RIP: 0010:smaps_pte_range+0x32d/0x540 Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8 RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001 RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578 R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728 R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48 FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __walk_page_range+0x3c2/0x6f0 walk_page_vma+0x42/0x60 smap_gather_stats+0x79/0xe0 ? gather_pte_stats+0x320/0x320 ? gather_hugetlb_stats+0x70/0x70 show_smaps_rollup+0xcd/0x1c0 seq_read+0x157/0x400 __vfs_read+0x3a/0x180 ? security_file_permission+0x93/0xc0 ? security_file_permission+0x93/0xc0 vfs_read+0x8f/0x140 ksys_read+0x55/0xc0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Decoded code matched to local compilation+disassembly points to smaps_pte_entry(): } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = find_get_entry(vma->vm_file->f_mapping, linear_page_index(vma, addr)); Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that case, however for smaps_rollup, smap_gather_stats() can set the flag true for one vma and leave it true for subsequent vma's where it should be false. To fix, reset the check_shmem_swap flag to false. There's also related bug which sets mss->swap to shmem_swapped, which in the context of smaps_rollup overwrites any value accumulated from previous vma's. Fix that as well. Note that the report suggests a regression between 4.17.19 and 4.19-rc7, which makes the 4.19 series ending with commit 258f669e7e88 ("mm: /proc/pid/smaps_rollup: convert to single value seq_file") suspicious. But the mss was reused for rollup since 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") so let's play it safe with the stable backport. Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377 Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz> Reported-by: Leonardo Soares Müller <leozinho29_eu(a)hotmail.com> Tested-by: Leonardo Soares Müller <leozinho29_eu(a)hotmail.com> Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Cc: Daniel Colascione <dancol(a)google.com> Cc: Alexey Dobriyan <adobriyan(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/proc/task_mmu.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/fs/proc/task_mmu.c~mm-proc-pid-smaps_rollup-fix-null-pointer-deref-in-smaps_pte_range +++ a/fs/proc/task_mmu.c @@ -713,6 +713,8 @@ static void smap_gather_stats(struct vm_ smaps_walk.private = mss; #ifdef CONFIG_SHMEM + /* In case of smaps_rollup, reset the value from previous vma */ + mss->check_shmem_swap = false; if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) { /* * For shared or readonly shmem mappings we know that all @@ -728,7 +730,7 @@ static void smap_gather_stats(struct vm_ if (!shmem_swapped || (vma->vm_flags & VM_SHARED) || !(vma->vm_flags & VM_WRITE)) { - mss->swap = shmem_swapped; + mss->swap += shmem_swapped; } else { mss->check_shmem_swap = true; smaps_walk.pte_hole = smaps_pte_hole; _ Patches currently in -mm which might be from vbabka(a)suse.cz are

6 years, 9 months

1
0
0 0

[PATCH AUTOSEL 4.9 01/98] perf symbols: Fix memory corruption because of zero length symbols

by Sasha Levin

From: Ravi Bangoria <ravi.bangoria(a)linux.vnet.ibm.com> [ Upstream commit 331c7cb307971eac38e9470340e10c87855bf4bc ] Perf top is often crashing at very random locations on powerpc. After investigating, I found the crash only happens when sample is of zero length symbol. Powerpc kernel has many such symbols which does not contain length details in vmlinux binary and thus start and end addresses of such symbols are same. Structure struct sym_hist { u64 nr_samples; u64 period; struct sym_hist_entry addr[0]; }; has last member 'addr[]' of size zero. 'addr[]' is an array of addresses that belongs to one symbol (function). If function consist of 100 instructions, 'addr' points to an array of 100 'struct sym_hist_entry' elements. For zero length symbol, it points to the *empty* array, i.e. no members in the array and thus offset 0 is also invalid for such array. static int __symbol__inc_addr_samples(...) { ... offset = addr - sym->start; h = annotation__histogram(notes, evidx); h->nr_samples++; h->addr[offset].nr_samples++; h->period += sample->period; h->addr[offset].period += sample->period; ... } Here, when 'addr' is same as 'sym->start', 'offset' becomes 0, which is valid for normal symbols but *invalid* for zero length symbols and thus updating h->addr[offset] causes memory corruption. Fix this by adding one dummy element for zero length symbols. Link: https://lkml.org/lkml/2016/10/10/148 Fixes: edee44be5919 ("perf annotate: Don't throw error for zero length symbols") Signed-off-by: Ravi Bangoria <ravi.bangoria(a)linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa(a)kernel.org> Acked-by: Namhyung Kim <namhyung(a)kernel.org> Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com> Cc: Jin Yao <yao.jin(a)linux.intel.com> Cc: Kim Phillips <kim.phillips(a)arm.com> Cc: Naveen N. Rao <naveen.n.rao(a)linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Taeung Song <treeze.taeung(a)gmail.com> Link: http://lkml.kernel.org/r/1508854806-10542-1-git-send-email-ravi.bangoria@li… Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- tools/perf/util/annotate.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index a38227eb5450..3336cbc6ec48 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -495,9 +495,19 @@ static struct ins *ins__find(const char *name) int symbol__alloc_hist(struct symbol *sym) { struct annotation *notes = symbol__annotation(sym); - const size_t size = symbol__size(sym); + size_t size = symbol__size(sym); size_t sizeof_sym_hist; + /* + * Add buffer of one element for zero length symbol. + * When sample is taken from first instruction of + * zero length symbol, perf still resolves it and + * shows symbol name in perf report and allows to + * annotate it. + */ + if (size == 0) + size = 1; + /* Check for overflow when calculating sizeof_sym_hist */ if (size > (SIZE_MAX - sizeof(struct sym_hist)) / sizeof(u64)) return -1; -- 2.17.1

6 years, 9 months

3
101
0 0

[PATCH] kbuild: set no-integrated-as before incl. arch Makefile

by ndesaulniers＠google.com

From: Stefan Agner <stefan(a)agner.ch> In order to make sure compiler flag detection for ARM works correctly the no-integrated-as flags need to be set before including the arch specific Makefile. commit 0f0e8de334c54c38818a4a5390a39aa09deff5bf upstream Fixes: cfe17c9bbe6a ("kbuild: move cc-option and cc-disable-warning after incl. arch Makefile") Signed-off-by: Stefan Agner <stefan(a)agner.ch> Signed-off-by: Masahiro Yamada <yamada.masahiro(a)socionext.com> Reported-by: Nathan Chancellor <natechancellor(a)gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com> --- Sending to stable for inclusion in 4.14. Needed for CONFIG_ARM64_LSE_ATOMICS. Makefile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index e02d092bc2d6..9e7f69c26aba 100644 --- a/Makefile +++ b/Makefile @@ -487,6 +487,8 @@ CLANG_GCC_TC := --gcc-toolchain=$(GCC_TOOLCHAIN) endif KBUILD_CFLAGS += $(CLANG_TARGET) $(CLANG_GCC_TC) KBUILD_AFLAGS += $(CLANG_TARGET) $(CLANG_GCC_TC) +KBUILD_CFLAGS += $(call cc-option, -no-integrated-as) +KBUILD_AFLAGS += $(call cc-option, -no-integrated-as) endif RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register @@ -743,8 +745,6 @@ KBUILD_CFLAGS += $(call cc-disable-warning, tautological-compare) # See modpost pattern 2 KBUILD_CFLAGS += $(call cc-option, -mno-global-merge,) KBUILD_CFLAGS += $(call cc-option, -fcatch-undefined-behavior) -KBUILD_CFLAGS += $(call cc-option, -no-integrated-as) -KBUILD_AFLAGS += $(call cc-option, -no-integrated-as) else # These warnings generated too much noise in a regular build. -- 2.19.1.568.g152ad8e336-goog

6 years, 9 months

3
2
0 0

[PATCH AUTOSEL 4.14 01/46] iwlwifi: mvm: check for short GI only for OFDM

by Sasha Levin

From: Sara Sharon <sara.sharon(a)intel.com> [ Upstream commit 4c59ff5a9a9c54cc26c807dc2fa6933f7e9fa4ef ] This bit will be used in CCK to indicate short preamble. Signed-off-by: Sara Sharon <sara.sharon(a)intel.com> Signed-off-by: Luca Coelho <luciano.coelho(a)intel.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- drivers/net/wireless/intel/iwlwifi/mvm/rx.c | 3 ++- drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c | 4 +++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rx.c b/drivers/net/wireless/intel/iwlwifi/mvm/rx.c index 2d14a58cbdd7..c73e4be9bde3 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/rx.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/rx.c @@ -439,7 +439,8 @@ void iwl_mvm_rx_rx_mpdu(struct iwl_mvm *mvm, struct napi_struct *napi, rx_status->bw = RATE_INFO_BW_160; break; } - if (rate_n_flags & RATE_MCS_SGI_MSK) + if (!(rate_n_flags & RATE_MCS_CCK_MSK) && + rate_n_flags & RATE_MCS_SGI_MSK) rx_status->enc_flags |= RX_ENC_FLAG_SHORT_GI; if (rate_n_flags & RATE_HT_MCS_GF_MSK) rx_status->enc_flags |= RX_ENC_FLAG_HT_GF; diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c b/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c index e2196dc35dc6..8ba8c70571fb 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c @@ -981,7 +981,9 @@ void iwl_mvm_rx_mpdu_mq(struct iwl_mvm *mvm, struct napi_struct *napi, rx_status->bw = RATE_INFO_BW_160; break; } - if (rate_n_flags & RATE_MCS_SGI_MSK) + + if (!(rate_n_flags & RATE_MCS_CCK_MSK) && + rate_n_flags & RATE_MCS_SGI_MSK) rx_status->enc_flags |= RX_ENC_FLAG_SHORT_GI; if (rate_n_flags & RATE_HT_MCS_GF_MSK) rx_status->enc_flags |= RX_ENC_FLAG_HT_GF; -- 2.17.1

6 years, 9 months

3
48
0 0

[PATCH] perf/stat: Handle different PMU names with common prefix

by Thomas Richter

On s390 the CPU Measurement Facility for counters now supports 2 PMUs named cpum_cf (CPU Measurement Facility for counters) and cpum_cf_diag (CPU Measurement Facility for diagnostic counters) for one and the same CPU. Running command [root@s35lp76 perf]# ./perf stat -e tx_c_tend \ -- ~/mytests/cf-tx-events 1 Measuring transactions TX_C_TABORT_NO_SPECIAL: 0 expected:0 TX_C_TABORT_SPECIAL: 0 expected:0 TX_C_TEND: 1 expected:1 TX_NC_TABORT: 11 expected:11 TX_NC_TEND: 1 expected:1 Performance counter stats for '/root/mytests/cf-tx-events 1': 2 tx_c_tend 0.002120091 seconds time elapsed 0.000121000 seconds user 0.002127000 seconds sys [root@s35lp76 perf]# displays output which is unexpected (and wrong): 2 tx_c_tend The test program definitely triggers only one transaction, as shown in line 'TX_C_TEND: 1 expected:1'. This is caused by the following call sequence: pmu_lookup() scans and installs a PMU. +--> pmu_aliases() parses all aliases in directory .../<pmu-name>/events/* which are file names. +--> pmu_aliases_parse() Read each file in directory and create an new alias entry. This is done with +--> perf_pmu__new_alias() and +--> __perf_pmu__new_alias() which also check for identical alias names. After pmu_aliases() returns, a complete list of event names for this pmu has been created. Now function pmu_add_cpu_aliases() is called to add the events listed in the json | files to the alias list of the cpu. +--> perf_pmu__find_map() Returns a pointer to the json events. Now function pmu_add_cpu_aliases() scans through all events listed in the JSON files for this CPU. Each json event pmu name is compared with the current PMU being built up and if they mismatch, the json event is added to the current PMUs alias list. To avoid duplicate entries the following comparison is done: if (!is_arm_pmu_core(name)) { pname = pe->pmu ? pe->pmu : "cpu"; if (strncmp(pname, name, strlen(pname))) continue; } The culprit is the strncmp() function. Using current s390 PMU naming, the first PMU is 'cpum_cf' and a long list of events is added, among them 'tx_c_tend' When the second PMU named 'cpum_cf_diag' is added, only one event named 'CF_DIAG' is added by the pmu_aliases() function. Now function pmu_add_cpu_aliases() is invoked for PMU 'cpum_cf_diag'. Since the CPUID string is the same for both PMUs, json file events for PMU named 'cpum_cf' are added to the PMU 'cpm_cf_diag' This happens because the strncmp() actually compares: strncmp("cpum_cf", "cpum_cf_diag", 6); The first parameter is the pmu name taken from the event in the json file. The second parameter is the pmu name of the PMU currently being built. They are different, but the length of the compare only tests the common prefix and this returns 0(true) when it should return false. Now all events for PMU cpum_cf are added to the alias list for pmu cpum_cf_diag. Later on in function parse_events_add_pmu() the event 'tx_c_end' is searched in all available PMUs and found twice, adding it two times to the evsel_list global variable which is the root of all events. This results in a counter value of 2 instead of 1. Output with this patch: [root@s35lp76 perf]# ./perf stat -e tx_c_tend \ -- ~/mytests/cf-tx-events 1 Measuring transactions TX_C_TABORT_NO_SPECIAL: 0 expected:0 TX_C_TABORT_SPECIAL: 0 expected:0 TX_C_TEND: 1 expected:1 TX_NC_TABORT: 11 expected:11 TX_NC_TEND: 1 expected:1 Performance counter stats for '/root/mytests/cf-tx-events 1': 1 tx_c_tend 0.001815365 seconds time elapsed 0.000123000 seconds user 0.001756000 seconds sys [root@s35lp76 perf]# Fixes: 292c34c10249 ("perf pmu: Fix core PMU alias list for X86 platform") Signed-off-by: Thomas Richter <tmricht(a)linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner(a)linux.ibm.com> Cc: Kan Liang <kan.liang(a)linux.intel.com> Cc: <stable(a)vger.kernel.org> # 4.18+ --- tools/perf/util/pmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c index 7799788f662f..7e49baad304d 100644 --- a/tools/perf/util/pmu.c +++ b/tools/perf/util/pmu.c @@ -773,7 +773,7 @@ static void pmu_add_cpu_aliases(struct list_head *head, struct perf_pmu *pmu) if (!is_arm_pmu_core(name)) { pname = pe->pmu ? pe->pmu : "cpu"; - if (strncmp(pname, name, strlen(pname))) + if (strcmp(pname, name)) continue; } -- 2.14.3

6 years, 9 months

4
3
0 0

[PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

by Michal Hocko

From: Andrea Arcangeli <aarcange(a)redhat.com> THP allocation might be really disruptive when allocated on NUMA system with the local node full or hard to reclaim. Stefan has posted an allocation stall report on 4.12 based SLES kernel which suggests the same issue: [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) [245513.363983] kvm cpuset=/ mems_allowed=0-1 [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 [245513.365905] Call Trace: [245513.366535] dump_stack+0x5c/0x84 [245513.367148] warn_alloc+0xe0/0x180 [245513.367769] __alloc_pages_slowpath+0x820/0xc90 [245513.368406] ? __slab_free+0xa9/0x2f0 [245513.369048] ? __slab_free+0xa9/0x2f0 [245513.369671] __alloc_pages_nodemask+0x1cc/0x210 [245513.370300] alloc_pages_vma+0x1e5/0x280 [245513.370921] do_huge_pmd_wp_page+0x83f/0xf00 [245513.371554] ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0 [245513.372184] ? do_huge_pmd_anonymous_page+0x631/0x6d0 [245513.372812] __handle_mm_fault+0x93d/0x1060 [245513.373439] handle_mm_fault+0xc6/0x1b0 [245513.374042] __do_page_fault+0x230/0x430 [245513.374679] ? get_vtime_delta+0x13/0xb0 [245513.375411] do_page_fault+0x2a/0x70 [245513.376145] ? page_fault+0x65/0x80 [245513.376882] page_fault+0x7b/0x80 [...] [245513.382056] Mem-Info: [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5 active_file:60183 inactive_file:245285 isolated_file:0 unevictable:15657 dirty:286 writeback:1 unstable:0 slab_reclaimable:75543 slab_unreclaimable:2509111 mapped:81814 shmem:31764 pagetables:370616 bounce:0 free:32294031 free_pcp:6233 free_cma:0 [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no The defrag mode is "madvise" and from the above report it is clear that the THP has been allocated for MADV_HUGEPAGA vma. Andrea has identified that the main source of the problem is __GFP_THISNODE usage: : The problem is that direct compaction combined with the NUMA : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very : hard the local node, instead of failing the allocation if there's no : THP available in the local node. : : Such logic was ok until __GFP_THISNODE was added to the THP allocation : path even with MPOL_DEFAULT. : : The idea behind the __GFP_THISNODE addition, is that it is better to : provide local memory in PAGE_SIZE units than to use remote NUMA THP : backed memory. That largely depends on the remote latency though, on : threadrippers for example the overhead is relatively low in my : experience. : : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in : extremely slow qemu startup with vfio, if the VM is larger than the : size of one host NUMA node. This is because it will try very hard to : unsuccessfully swapout get_user_pages pinned pages as result of the : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE : allocations and instead of trying to allocate THP on other nodes (it : would be even worse without vfio type1 GUP pins of course, except it'd : be swapping heavily instead). Fix this by removing __GFP_THISNODE for THP requests which are requesting the direct reclaim. This effectivelly reverts 5265047ac301 on the grounds that the zone/node reclaim was known to be disruptive due to premature reclaim when there was memory free. While it made sense at the time for HPC workloads without NUMA awareness on rare machines, it was ultimately harmful in the majority of cases. The existing behaviour is similiar, if not as widespare as it applies to a corner case but crucially, it cannot be tuned around like zone_reclaim_mode can. The default behaviour should always be to cause the least harm for the common case. If there are specialised use cases out there that want zone_reclaim_mode in specific cases, then it can be built on top. Longterm we should consider a memory policy which allows for the node reclaim like behavior for the specific memory ranges which would allow a [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com [mhocko(a)suse.com: rewrote the changelog based on the one from Andrea] Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") Cc: Zi Yan <zi.yan(a)cs.rutgers.edu> Cc: stable # 4.1+ Reported-by: Stefan Priebe <s.priebe(a)profihost.ag> Debugged-by: Andrea Arcangeli <aarcange(a)redhat.com> Reported-by: Alex Williamson <alex.williamson(a)redhat.com> Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com> Signed-off-by: Michal Hocko <mhocko(a)suse.com> --- mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index da858f794eb6..149b6f4cf023 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); - page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + /* + * We cannot invoke reclaim if __GFP_THISNODE + * is set. Invoking reclaim with + * __GFP_THISNODE set, would cause THP + * allocations to trigger heavy swapping + * despite there may be tons of free memory + * (including potentially plenty of THP + * already available in the buddy) on all the + * other NUMA nodes. + * + * At most we could invoke compaction when + * __GFP_THISNODE is set (but we would need to + * refrain from invoking reclaim even if + * compaction returned COMPACT_SKIPPED because + * there wasn't not enough memory to succeed + * compaction). For now just avoid + * __GFP_THISNODE instead of limiting the + * allocation path to a strict and single + * compaction invocation. + * + * Supposedly if direct reclaim was enabled by + * the caller, the app prefers THP regardless + * of the node it comes from so this would be + * more desiderable behavior than only + * providing THP originated from the local + * node in such case. + */ + if (!(gfp & __GFP_DIRECT_RECLAIM)) + gfp |= __GFP_THISNODE; + page = __alloc_pages_node(hpage_node, gfp, order); goto out; } } -- 2.18.0

6 years, 9 months

9
45
0 0

urgent

by Les Scadding

Hello Dear Do you have the passion for humanitarian welfare? Can you devote your time and be totally committed and devoted to run multi-million pounds humanitarian charity project sponsored totally by me; with an incentive/compensation accrual to you for your time and effort and at no cost to you. If interested, reply me for the full details Thanks Les Scadding

6 years, 9 months

1
0
0 0

[PATCH 0/3] block: make sure discard/writesame bio is aligned with logical block size

by Ming Lei

Hi, The 1st & 3rd patch fixes bio size alignment issue. The 2nd patch cleans up __blkdev_issue_discard() a bit. Thanks, Ming Lei (3): block: make sure discard bio is aligned with logical block size block: cleanup __blkdev_issue_discard() block: make sure writesame bio is aligned with logical block size block/blk-lib.c | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-) Cc: Rui Salvaterra <rsalvaterra(a)gmail.com> Cc: stable(a)vger.kernel.org Cc: Mike Snitzer <snitzer(a)redhat.com> Cc: Christoph Hellwig <hch(a)lst.de> Cc: Xiao Ni <xni(a)redhat.com> Cc: Mariusz Dabrowski <mariusz.dabrowski(a)intel.com> -- 2.9.5

6 years, 9 months

2
6
0 0

[PATCH v3] libata: Apply NOLPM quirk for SAMSUNG MZ7TD256HAFV-000L9

by Diego Viola

med_power_with_dipm causes my T450 to freeze with a SAMSUNG MZ7TD256HAFV-000L9 SSD (firmware DXT02L5Q). Switching the LPM to max_performance fixes this issue. Signed-off-by: Diego Viola <diego.viola(a)gmail.com> --- drivers/ata/libata-core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index a9dd4ea7467d..6e594644cb1d 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4553,6 +4553,7 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { /* These specific Samsung models/firmware-revs do not handle LPM well */ { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, }, { "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_HORKAGE_NOLPM, }, + { "SAMSUNG MZ7TD256HAFV-000L9", "DXT02L5Q", ATA_HORKAGE_NOLPM, }, /* devices that don't properly handle queued TRIM commands */ { "Micron_M500IT_*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM | -- 2.19.1

6 years, 9 months

3
6
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror