From: Ralph Campbell <rcampbell(a)nvidia.com>
Private ZONE_DEVICE pages use a special pte entry and thus are not
present. Properly handle this case in map_pte(), it is already handled
in check_pte(), the map_pte() part was lost in some rebase most probably.
Without this patch the slow migration path can not migrate back private
ZONE_DEVICE memory to regular memory. This was found after stress
testing migration back to system memory. This ultimatly can lead the
CPU to an infinite page fault loop on the special swap entry.
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse(a)redhat.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
mm/page_vma_mapped.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae3c2a35d61b..1cf5b9bfb559 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -21,6 +21,15 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
if (!is_swap_pte(*pvmw->pte))
return false;
} else {
+ if (is_swap_pte(*pvmw->pte)) {
+ swp_entry_t entry;
+
+ /* Handle un-addressable ZONE_DEVICE memory */
+ entry = pte_to_swp_entry(*pvmw->pte);
+ if (is_device_private_entry(entry))
+ return true;
+ }
+
if (!pte_present(*pvmw->pte))
return false;
}
--
2.17.1
From: Jason Wang <jasowang(a)redhat.com>
commit b196d88aba8ac72b775137854121097f4c4c6862 upstream.
We used to initialize ptr_ring during TUNSETIFF, this is because its
size depends on the tx_queue_len of netdevice. And we try to clean it
up when socket were detached from netdevice. A race were spotted when
trying to do uninit during a read which will lead a use after free for
pointer ring. Solving this by always initialize a zero size ptr_ring
in open() and do resizing during TUNSETIFF, and then we can safely do
cleanup during close(). With this, there's no need for the workaround
that was introduced by commit 4df0bfc79904 ("tun: fix a memory leak
for tfile->tx_array").
Backport Note :-
This is a backport of following 2 upstream patches(the second fixes the
first).
b196d88aba ("tun: fix use after free for ptr_ring")
7063efd33b ("tuntap: fix use after free during release")
Comparison with the upstream patch:
[1] A "semantic revert" of the changes made in
4df0bfc799("tun: fix a memory leak for tfile->tx_array").
4df0bfc799 was applied upstream, and then skb array was changed
to use ptr_ring. The upstream fix then removes the changes introduced
by 4df0bfc799. This backport does the same; "revert" the changes
made by 4df0bfc799.
[2] xdp_rxq_info_unreg() being called in relevant locations
As xdp_rxq_info related patches are not present in 4.14, these
changes are not needed in the backport.
[3] An instance of ptr_ring_init needs to be replaced by skb_array_init.
[4] ptr_ring_cleanup needs to be replaced by skb_array_cleanup.
b196d88ab places the cleanup function in tun_chr_close() only to
later move it into __tun_detach in upstream commit
7063efd33bb("tuntap: fix use after free during release"). So
place skb_array_cleanup in __tun_detach.
Reported-by: syzbot+e8b902c3c3fadf0a9dba(a)syzkaller.appspotmail.com
Cc: Eric Dumazet <eric.dumazet(a)gmail.com>
Cc: Cong Wang <xiyou.wangcong(a)gmail.com>
Cc: Michael S. Tsirkin <mst(a)redhat.com>
Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Acked-by: Michael S. Tsirkin <mst(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Zubin Mithra <zsm(a)chromium.org>
---
drivers/net/tun.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cb17ffadfc30..e0baea2dfd3c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -534,14 +534,6 @@ static void tun_queue_purge(struct tun_file *tfile)
skb_queue_purge(&tfile->sk.sk_error_queue);
}
-static void tun_cleanup_tx_array(struct tun_file *tfile)
-{
- if (tfile->tx_array.ring.queue) {
- skb_array_cleanup(&tfile->tx_array);
- memset(&tfile->tx_array, 0, sizeof(tfile->tx_array));
- }
-}
-
static void __tun_detach(struct tun_file *tfile, bool clean)
{
struct tun_file *ntfile;
@@ -583,7 +575,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
- tun_cleanup_tx_array(tfile);
+ skb_array_cleanup(&tfile->tx_array);
sock_put(&tfile->sk);
}
}
@@ -623,13 +615,11 @@ static void tun_detach_all(struct net_device *dev)
/* Drop read queue */
tun_queue_purge(tfile);
sock_put(&tfile->sk);
- tun_cleanup_tx_array(tfile);
}
list_for_each_entry_safe(tfile, tmp, &tun->disabled, next) {
tun_enable_queue(tfile);
tun_queue_purge(tfile);
sock_put(&tfile->sk);
- tun_cleanup_tx_array(tfile);
}
BUG_ON(tun->numdisabled != 0);
@@ -675,7 +665,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte
}
if (!tfile->detached &&
- skb_array_init(&tfile->tx_array, dev->tx_queue_len, GFP_KERNEL)) {
+ skb_array_resize(&tfile->tx_array, dev->tx_queue_len, GFP_KERNEL)) {
err = -ENOMEM;
goto out;
}
@@ -2624,6 +2614,11 @@ static int tun_chr_open(struct inode *inode, struct file * file)
&tun_proto, 0);
if (!tfile)
return -ENOMEM;
+ if (skb_array_init(&tfile->tx_array, 0, GFP_KERNEL)) {
+ sk_free(&tfile->sk);
+ return -ENOMEM;
+ }
+
RCU_INIT_POINTER(tfile->tun, NULL);
tfile->flags = 0;
tfile->ifindex = 0;
@@ -2644,8 +2639,6 @@ static int tun_chr_open(struct inode *inode, struct file * file)
sock_set_flag(&tfile->sk, SOCK_ZEROCOPY);
- memset(&tfile->tx_array, 0, sizeof(tfile->tx_array));
-
return 0;
}
--
2.19.0.rc0.228.g281dcd1b4d0-goog
clang-7 has a new warning (-Wreturn-stack-address) for warning when a
function returns the address of a local variable. This is in general a
good warning, but the kernel has a few places where GNU statement
expressions return the address of a label in order to get the current
instruction pointer (see _THIS_IP_ and current_text_addr).
In order to disable a warning at a single call site, the kernel already
has __diag macros for inserting compiler and compiler-version specific
_Pragma's.
This series adds CLANG_VERSION macros necessary for proper __diag
support, and whitelists the case in _THIS_IP_. current_text_addr will be
consolidated in a follow up series.
Nick Desaulniers (2):
compiler-clang.h: Add CLANG_VERSION and __diag macros
kernel.h: Disable -Wreturn-stack-address for _THIS_IP_
include/linux/compiler-clang.h | 19 +++++++++++++++++++
include/linux/compiler_types.h | 4 ++++
include/linux/kernel.h | 10 +++++++++-
3 files changed, 32 insertions(+), 1 deletion(-)
--
2.18.0.345.g5c9ce644c3-goog
4.4.y, 4.9.y:
fs/cifs/cifsfs.c: In function 'cifs_statfs':
fs/cifs/cifsfs.c:198:27: error: 'struct cifs_tcon' has no member named 'vol_serial_number'
fs/cifs/cifsfs.c:200:45: error: 'struct cifs_tcon' has no member named 'vol_create_time'
4.14.y, 4.18.y:
kernel/printk/printk_safe.o: In function `vprintk_func':
kernel/printk/printk_safe.c:386: undefined reference to `vprintk_store'
kernel/printk/printk_safe.c:388: undefined reference to `defer_console_output'
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 86658b819cd0a9aa584cd84453ed268a6f013770 Mon Sep 17 00:00:00 2001
From: Punit Agrawal <punit.agrawal(a)arm.com>
Date: Mon, 13 Aug 2018 11:43:50 +0100
Subject: [PATCH] KVM: arm/arm64: Skip updating PMD entry if no change
Contention on updating a PMD entry by a large number of vcpus can lead
to duplicate work when handling stage 2 page faults. As the page table
update follows the break-before-make requirement of the architecture,
it can lead to repeated refaults due to clearing the entry and
flushing the tlbs.
This problem is more likely when -
* there are large number of vcpus
* the mapping is large block mapping
such as when using PMD hugepages (512MB) with 64k pages.
Fix this by skipping the page table update if there is no change in
the entry being updated.
Cc: stable(a)vger.kernel.org
Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages")
Reviewed-by: Suzuki Poulose <suzuki.poulose(a)arm.com>
Acked-by: Christoffer Dall <christoffer.dall(a)arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal(a)arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier(a)arm.com>
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 97d27cd9c654..13dfe36501aa 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1044,19 +1044,35 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
pmd = stage2_get_pmd(kvm, cache, addr);
VM_BUG_ON(!pmd);
- /*
- * Mapping in huge pages should only happen through a fault. If a
- * page is merged into a transparent huge page, the individual
- * subpages of that huge page should be unmapped through MMU
- * notifiers before we get here.
- *
- * Merging of CompoundPages is not supported; they should become
- * splitting first, unmapped, merged, and mapped back in on-demand.
- */
- VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
-
old_pmd = *pmd;
if (pmd_present(old_pmd)) {
+ /*
+ * Multiple vcpus faulting on the same PMD entry, can
+ * lead to them sequentially updating the PMD with the
+ * same value. Following the break-before-make
+ * (pmd_clear() followed by tlb_flush()) process can
+ * hinder forward progress due to refaults generated
+ * on missing translations.
+ *
+ * Skip updating the page table if the entry is
+ * unchanged.
+ */
+ if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+ return 0;
+
+ /*
+ * Mapping in huge pages should only happen through a
+ * fault. If a page is merged into a transparent huge
+ * page, the individual subpages of that huge page
+ * should be unmapped through MMU notifiers before we
+ * get here.
+ *
+ * Merging of CompoundPages is not supported; they
+ * should become splitting first, unmapped, merged,
+ * and mapped back in on-demand.
+ */
+ VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
+
pmd_clear(pmd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
} else {
commit ID: f1ed3df20d2d223e0852cc4ac1f19bba869a7e3c
Please merge this patch into stable tree (already exist in Linus’ tree).
The initial patch submission lacked "Cc: stable(a)vger.kernel.org" by
mistake. The kernel versions that should get patch:
4.19
4.18
4.14
>From f1ed3df20d2d223e0852cc4ac1f19bba869a7e3c Mon Sep 17 00:00:00 2001
From: Michal Wnukowski <wnukowski(a)google.com>
Date: Wed, 15 Aug 2018 15:51:57 -0700
Subject: nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event
In many architectures loads may be reordered with older stores to
different locations. In the nvme driver the following two operations
could be reordered:
- Write shadow doorbell (dbbuf_db) into memory.
- Read EventIdx (dbbuf_ei) from memory.
This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell). If that occurs, then the NVMe controller may decide to
wait for MMIO doorbell from guest operating system, and guest driver may
decide not to issue MMIO doorbell on any of subsequent commands.
This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run "Oracle IO
Numbers" (orion) that is shipped with Oracle DB:
orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
concat -write 40 -duration 120 -matrix row -testname nvme_test
Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).
Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.
Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
Signed-off-by: Michal Wnukowski <wnukowski(a)google.com>
Reviewed-by: Keith Busch <keith.busch(a)intel.com>
Reviewed-by: Sagi Grimberg <sagi(a)grimberg.me>
[hch: updated changelog and comment a bit]
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
---
drivers/nvme/host/pci.c | 8 ++++++++
1 file changed, 8 insertions(+)
(limited to 'drivers/nvme/host/pci.c')
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1b9951d2067e..d668682f91df 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -316,6 +316,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
old_value = *dbbuf_db;
*dbbuf_db = value;
+ /*
+ * Ensure that the doorbell is updated before reading the event
+ * index from memory. The controller needs to provide similar
+ * ordering to ensure the envent index is updated before reading
+ * the doorbell.
+ */
+ mb();
+
if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
return false;
}
--
cgit 1.2-0.3.lf.el7
From: Randy Dunlap <rdunlap(a)infradead.org>
When $DEPMOD is not found, only print a warning instead of exiting
with an error message and error status:
Warning: 'make modules_install' requires /sbin/depmod. Please install it.
This is probably in the kmod package.
Change the Error to a Warning because "not all build hosts for cross
compiling Linux are Linux systems and are able to provide a working
port of depmod, especially at the file patch /sbin/depmod."
I.e., "make modules_install" may be used to copy/install the
loadable modules files to a target directory on a build system and
then transferred to an embedded device where /sbin/depmod is run
instead of it being run on the build system.
Fixes: 934193a654c1 ("kbuild: verify that $DEPMOD is installed")
Signed-off-by: Randy Dunlap <rdunlap(a)infradead.org>
Reported-by: H. Nikolaus Schaller <hns(a)goldelico.com>
Cc: stable(a)vger.kernel.org
Cc: Lucas De Marchi <lucas.demarchi(a)profusion.mobi>
Cc: Lucas De Marchi <lucas.de.marchi(a)gmail.com>
Cc: Michal Marek <michal.lkml(a)markovi.net>
Cc: Jessica Yu <jeyu(a)kernel.org>
Cc: Chih-Wei Huang <cwhuang(a)linux.org.tw>
Cc: H. Nikolaus Schaller <hns(a)goldelico.com>
---
v2: add missing "exit 0" and update the commit message (no Error).
v3: add Fixes: and Cc: stable
v4: add Reported-by: and more explanation for the patch.
scripts/depmod.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- lnx-418.orig/scripts/depmod.sh
+++ lnx-418/scripts/depmod.sh
@@ -15,9 +15,9 @@ if ! test -r System.map ; then
fi
if [ -z $(command -v $DEPMOD) ]; then
- echo "'make modules_install' requires $DEPMOD. Please install it." >&2
+ echo "Warning: 'make modules_install' requires $DEPMOD. Please install it." >&2
echo "This is probably in the kmod package." >&2
- exit 1
+ exit 0
fi
# older versions of depmod require the version string to start with three
On 08/31/2018 10:38 AM, Kalle Valo wrote:
> Larry Finger <Larry.Finger(a)lwfinger.net> wrote:
>
>> In commit 66cffd6daab7 ("b43: fix transmit failure when VT is switched"),
>> a condition is noted where the network controller needs to be reset. Note
>> that this situation happens when running the open-source firmware
>> (http://netweb.ing.unibs.it/~openfwwf/), plus a number of other special
>> conditions.
>>
>> for a different card model, it is reported that this change breaks
>> operation running the proprietary firmware
>> (https://marc.info/?l=linux-wireless&m=153504546924558&w=2). Rather
>> than reverting the previous patch, the code is tweaked to avoid the
>> reset unless the open-source firmware is being used.
>>
>> Fixes: 66cffd6daab7 ("b43: fix transmit failure when VT is switched")
>> Cc: Stable <stable(a)vger.kernel.org> # 4.18+
>> Cc: Taketo Kabe <kabe(a)sra-tohoku.co.jp>
>> Reported-and-tested-by: D. Prabhu <d.praabhu(a)gmail.com>
>> Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net>
>
> I'll change the title to something more descriptive:
>
> b43: fix DMA error related regression with proprietary firmware
>
> Does that make sense?
Yes, that is fine.
Larry