Hi,
As reported by Thorsten Knabe <linux(a)thorsten-knabe.de>.
commit 4f4fd7c5798b ("Don't jump to compute_result state from check_result state")
was back ported to v3.16+. However, this fix was wrong.
Please back port the following two commits to fix this issue.
commit a25d8c327bb4 ("Revert "Don't jump to compute_result state from check_result state"")
commit b2176a1dfb51 ("md/raid: raid5 preserve the writeback action after the parity check")
Thanks,
Song
Ensure that we flush any cache dirt out to main memory before the user
changes the cache-level as they may elect to bypass the cache (even after
declaring their access cache-coherent) via use of unprivileged MOCS.
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/i915/gem/i915_gem_domain.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_domain.c b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
index 2e3ce2a69653..5d41e769a428 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_domain.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
@@ -277,6 +277,11 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
list_for_each_entry(vma, &obj->vma.list, obj_link)
vma->node.color = cache_level;
+
+ /* Flush any previous cache dirt in case of cache bypass */
+ if (obj->cache_dirty & ~obj->cache_coherent)
+ i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
+
i915_gem_object_set_cache_coherency(obj, cache_level);
obj->cache_dirty = true; /* Always invalidate stale cachelines */
--
2.22.0
Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
throttling by removing expiration of cpu-local slices") fixes a major
performance issue for containerized clouds such as Kubernetes.
Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
bandwidth timer clock drift condition").
This should be applied to all stable kernels that applied commit
512ac999d275, and should probably be applied to all others as well.
The issues introduced by these pathes can be read about on the
Kubernetes github.
https://github.com/kubernetes/kubernetes/issues/67577
It may also be prudent to also apply the not yet accepted patch that
fixes some introduced compiler warnings discussed here.
https://lkml.org/lkml/2019/9/18/925
Thank you,
Dave Chiluk
The x86 version of get_user_pages_fast() relies on disabled interrupts to
synchronize gup_pte_range() between gup_get_pte(ptep); and get_page() against
a parallel munmap. The munmap side nulls the pte, then flushes TLBs, then
releases the page. As TLB flush is done synchronously via IPI disabling
interrupts blocks the page release, and get_page(), which assumes existing
reference on page, is thus safe.
However when TLB flush is done by a hypercall, e.g. in a Xen PV guest, there is
no blocking thanks to disabled interrupts, and get_page() can succeed on a page
that was already freed or even reused.
We have recently seen this happen with our 4.4 and 4.12 based kernels, with
userspace (java) that exits a thread, where mm_release() performs a futex_wake()
on tsk->clear_child_tid, and another thread in parallel unmaps the page where
tsk->clear_child_tid points to. The spurious get_page() succeeds, but futex code
immediately releases the page again, while it's already on a freelist. Symptoms
include a bad page state warning, general protection faults acessing a poisoned
list prev/next pointer in the freelist, or free page pcplists of two cpus joined
together in a single list. Oscar has also reproduced this scenario, with a
patch inserting delays before the get_page() to make the race window larger.
Fix this by removing the dependency on TLB flush interrupts the same way as the
generic get_user_pages_fast() code by using page_cache_add_speculative() and
revalidating the PTE contents after pinning the page. Mainline is safe since
4.13 where the x86 gup code was removed in favor of the common code. Accessing
the page table itself safely also relies on disabled interrupts and TLB flush
IPIs that don't happen with hypercalls, which was acknowledged in commit
9e52fc2b50de ("x86/mm: Enable RCU based page table freeing
(CONFIG_HAVE_RCU_TABLE_FREE=y)"). That commit with follups should also be
backported for full safety, although our reproducer didn't hit a problem
without that backport.
Reproduced-by: Oscar Salvador <osalvador(a)suse.de>
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
---
Hi, I'm sending this stable-only patch for consideration because it's probably
unrealistic to backport the 4.13 switch to generic GUP. I can look at 4.4 and
3.16 if accepted. The RCU page table freeing could be also considered.
Note the patch also includes page refcount protection. I found out that
8fde12ca79af ("mm: prevent get_user_pages() from overflowing page refcount")
backport to 4.9 missed the arch-specific gup implementations:
https://lore.kernel.org/lkml/6650323f-dbc9-f069-000b-f6b0f941a065@suse.cz/
arch/x86/mm/gup.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 1680768d392c..d7db45bdfb3b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -97,6 +97,20 @@ static inline int pte_allows_gup(unsigned long pteval, int write)
return 1;
}
+/*
+ * Return the compund head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+ struct page *head = compound_head(page);
+ if (WARN_ON_ONCE(page_ref_count(head) < 0))
+ return NULL;
+ if (unlikely(!page_cache_add_speculative(head, refs)))
+ return NULL;
+ return head;
+}
+
/*
* The performance critical leaf functions are made noinline otherwise gcc
* inlines everything into a single function which results in too much
@@ -112,7 +126,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
ptep = pte_offset_map(&pmd, addr);
do {
pte_t pte = gup_get_pte(ptep);
- struct page *page;
+ struct page *head, *page;
/* Similar to the PMD case, NUMA hinting must take slow path */
if (pte_protnone(pte)) {
@@ -138,7 +152,21 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
}
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
- get_page(page);
+
+ head = try_get_compound_head(page, 1);
+ if (!head) {
+ put_dev_pagemap(pgmap);
+ pte_unmap(ptep);
+ return 0;
+ }
+
+ if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ put_page(head);
+ put_dev_pagemap(pgmap);
+ pte_unmap(ptep);
+ return 0;
+ }
+
put_dev_pagemap(pgmap);
SetPageReferenced(page);
pages[*nr] = page;
--
2.22.0
Following a try_to_unmap() we may want to remove the userptr and so call
put_pages(). However, try_to_unmap() acquires the page lock and so we
must avoid recursively locking the pages ourselves -- which means that
we cannot safely acquire the lock around set_page_dirty(). Since we
can't be sure of the lock, we have to risk skip dirtying the page, or
else risk calling set_page_dirty() without a lock and so risk fs
corruption.
Reported-by: Lionel Landwerlin <lionel.g.landwerlin(a)intel.com>
Fixes: cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Lionel Landwerlin <lionel.g.landwerlin(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index b9d2bb15e4a6..1ad2047a6dbd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -672,7 +672,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
obj->mm.dirty = false;
for_each_sgt_page(page, sgt_iter, pages) {
- if (obj->mm.dirty)
+ if (obj->mm.dirty && trylock_page(page)) {
/*
* As this may not be anonymous memory (e.g. shmem)
* but exist on a real mapping, we have to lock
@@ -680,8 +680,20 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
* the page reference is not sufficient to
* prevent the inode from being truncated.
* Play safe and take the lock.
+ *
+ * However...!
+ *
+ * The mmu-notifier can be invalidated for a
+ * migrate_page, that is alreadying holding the lock
+ * on the page. Such a try_to_unmap() will result
+ * in us calling put_pages() and so recursively try
+ * to lock the page. We avoid that deadlock with
+ * a trylock_page() and in exchange we risk missing
+ * some page dirtying.
*/
- set_page_dirty_lock(page);
+ set_page_dirty(page);
+ unlock_page(page);
+ }
mark_page_accessed(page);
put_page(page);
--
2.22.0
Only the kernel random pool should be used for generating random numbers.
TPM contributes to that pool among the other sources of entropy. In here it
is not, agreed, absolutely critical because TPM is what is trusted anyway
but in order to remove tpm_get_random() we need to first remove all the
call sites.
Cc: stable(a)vger.kernel.org
Fixes: 0c36264aa1d5 ("KEYS: asym_tpm: Add loadkey2 and flushspecific [ver #2]")
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
---
crypto/asymmetric_keys/asym_tpm.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/crypto/asymmetric_keys/asym_tpm.c b/crypto/asymmetric_keys/asym_tpm.c
index 76d2ce3a1b5b..c14b8d186e93 100644
--- a/crypto/asymmetric_keys/asym_tpm.c
+++ b/crypto/asymmetric_keys/asym_tpm.c
@@ -6,6 +6,7 @@
#include <linux/kernel.h>
#include <linux/seq_file.h>
#include <linux/scatterlist.h>
+#include <linux/random.h>
#include <linux/tpm.h>
#include <linux/tpm_command.h>
#include <crypto/akcipher.h>
@@ -54,11 +55,7 @@ static int tpm_loadkey2(struct tpm_buf *tb,
}
/* generate odd nonce */
- ret = tpm_get_random(NULL, nonceodd, TPM_NONCE_SIZE);
- if (ret < 0) {
- pr_info("tpm_get_random failed (%d)\n", ret);
- return ret;
- }
+ get_random_bytes(nonceodd, TPM_NONCE_SIZE);
/* calculate authorization HMAC value */
ret = TSS_authhmac(authdata, keyauth, SHA1_DIGEST_SIZE, enonce,
--
2.20.1
This brings some fixes and allows to start more VMs with an in-kernel
XIVE or XICS-on-XIVE device.
Changes since v1 (https://patchwork.ozlabs.org/cover/1166099/):
- drop a useless patch
- add a patch to show VP ids in debugfs
- update some changelogs
- fix buggy check in patch 5
- Cc: stable
--
Greg
---
Greg Kurz (6):
KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
KVM: PPC: Book3S HV: XIVE: Show VP id in debugfs
KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
Documentation/virt/kvm/devices/xics.txt | 14 +++
Documentation/virt/kvm/devices/xive.txt | 8 ++
arch/powerpc/include/uapi/asm/kvm.h | 3 +
arch/powerpc/kvm/book3s_xive.c | 142 ++++++++++++++++++++++++-------
arch/powerpc/kvm/book3s_xive.h | 17 ++++
arch/powerpc/kvm/book3s_xive_native.c | 40 +++------
6 files changed, 167 insertions(+), 57 deletions(-)
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen(a)huawei.com>
v1->v2:
fix wrong return value in super_90_load
---
drivers/md/md.c | 44 ++++++++++++++++++++++++--------------------
1 file changed, 24 insertions(+), 20 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1be7abeb24fd..0a91c20071b3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1097,7 +1097,7 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
{
char b[BDEVNAME_SIZE], b2[BDEVNAME_SIZE];
mdp_super_t *sb;
- int ret;
+ int ret = 0;
/*
* Calculate the position of the superblock (512byte sectors),
@@ -1111,14 +1111,12 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
if (ret)
return ret;
- ret = -EINVAL;
-
bdevname(rdev->bdev, b);
sb = page_address(rdev->sb_page);
if (sb->md_magic != MD_SB_MAGIC) {
pr_warn("md: invalid raid superblock magic on %s\n", b);
- goto abort;
+ return -EINVAL;
}
if (sb->major_version != 0 ||
@@ -1126,15 +1124,15 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
sb->minor_version > 91) {
pr_warn("Bad version number %d.%d on %s\n",
sb->major_version, sb->minor_version, b);
- goto abort;
+ return -EINVAL;
}
if (sb->raid_disks <= 0)
- goto abort;
+ return -EINVAL;
if (md_csum_fold(calc_sb_csum(sb)) != md_csum_fold(sb->sb_csum)) {
pr_warn("md: invalid superblock checksum on %s\n", b);
- goto abort;
+ return -EINVAL;
}
rdev->preferred_minor = sb->md_minor;
@@ -1156,19 +1154,22 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
if (!md_uuid_equal(refsb, sb)) {
pr_warn("md: %s has different UUID to %s\n",
b, bdevname(refdev->bdev,b2));
- goto abort;
+ return -EINVAL;
}
if (!md_sb_equal(refsb, sb)) {
pr_warn("md: %s has same UUID but different superblock to %s\n",
b, bdevname(refdev->bdev, b2));
- goto abort;
+ return -EINVAL;
}
ev1 = md_event(sb);
ev2 = md_event(refsb);
- if (ev1 > ev2)
- ret = 1;
- else
- ret = 0;
+
+ /* Insist on good event counter while assembling, except
+ * for spares (which don't need an event count) */
+ if (sb->disks[rdev->desc_nr].state & (
+ (1<<MD_DISK_SYNC) | (1 << MD_DISK_ACTIVE)))
+ if (ev1 > ev2)
+ ret = 1;
}
rdev->sectors = rdev->sb_start;
/* Limit to 4TB as metadata cannot record more than that.
@@ -1180,9 +1181,8 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
if (rdev->sectors < ((sector_t)sb->size) * 2 && sb->level >= 1)
/* "this cannot possibly happen" ... */
- ret = -EINVAL;
+ return -EINVAL;
- abort:
return ret;
}
@@ -1520,7 +1520,7 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version)
{
struct mdp_superblock_1 *sb;
- int ret;
+ int ret = 0;
sector_t sb_start;
sector_t sectors;
char b[BDEVNAME_SIZE], b2[BDEVNAME_SIZE];
@@ -1676,10 +1676,14 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
ev1 = le64_to_cpu(sb->events);
ev2 = le64_to_cpu(refsb->events);
- if (ev1 > ev2)
- ret = 1;
- else
- ret = 0;
+ /* Insist of good event counter while assembling, except for
+ * spares (which don't need an event count) */
+ if (rdev->desc_nr >= 0 &&
+ rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
+ (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
+ le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
+ if (ev1 > ev2)
+ ret = 1;
}
if (minor_version) {
sectors = (i_size_read(rdev->bdev->bd_inode) >> 9);
--
2.17.2
>From Tegra186 onwards OUTSTANDING_REQUESTS field is added in channel
configuration register(bits 7:4) which defines the maximum number of reads
from the source and writes to the destination that may be outstanding at
any given point of time. This field must be programmed with a value
between 1 and 8. A value of 0 will prevent any transfers from happening.
Thus added 'has_outstanding_reqs' bool member in chip data structure and is
set to false for Tegra210, since the field is not applicable. For Tegra186
it is set to true and channel configuration is updated with maximum
outstanding requests.
Fixes: 433de642a76c ("dmaengine: tegra210-adma: add support for Tegra186/Tegra194")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sameer Pujar <spujar(a)nvidia.com>
---
drivers/dma/tegra210-adma.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/dma/tegra210-adma.c b/drivers/dma/tegra210-adma.c
index 5f8adf5..6e12685 100644
--- a/drivers/dma/tegra210-adma.c
+++ b/drivers/dma/tegra210-adma.c
@@ -40,6 +40,7 @@
#define ADMA_CH_CONFIG_MAX_BURST_SIZE 16
#define ADMA_CH_CONFIG_WEIGHT_FOR_WRR(val) ((val) & 0xf)
#define ADMA_CH_CONFIG_MAX_BUFS 8
+#define TEGRA186_ADMA_CH_CONFIG_OUTSTANDING_REQS(reqs) (reqs << 4)
#define ADMA_CH_FIFO_CTRL 0x2c
#define TEGRA210_ADMA_CH_FIFO_CTRL_TXSIZE(val) (((val) & 0xf) << 8)
@@ -77,6 +78,7 @@ struct tegra_adma;
* @ch_req_tx_shift: Register offset for AHUB transmit channel select.
* @ch_req_rx_shift: Register offset for AHUB receive channel select.
* @ch_base_offset: Register offset of DMA channel registers.
+ * @has_outstanding_reqs: If DMA channel can have outstanding requests.
* @ch_fifo_ctrl: Default value for channel FIFO CTRL register.
* @ch_req_mask: Mask for Tx or Rx channel select.
* @ch_req_max: Maximum number of Tx or Rx channels available.
@@ -95,6 +97,7 @@ struct tegra_adma_chip_data {
unsigned int ch_req_max;
unsigned int ch_reg_size;
unsigned int nr_channels;
+ bool has_outstanding_reqs;
};
/*
@@ -594,6 +597,8 @@ static int tegra_adma_set_xfer_params(struct tegra_adma_chan *tdc,
ADMA_CH_CTRL_FLOWCTRL_EN;
ch_regs->config |= cdata->adma_get_burst_config(burst_size);
ch_regs->config |= ADMA_CH_CONFIG_WEIGHT_FOR_WRR(1);
+ if (cdata->has_outstanding_reqs)
+ ch_regs->config |= TEGRA186_ADMA_CH_CONFIG_OUTSTANDING_REQS(8);
ch_regs->fifo_ctrl = cdata->ch_fifo_ctrl;
ch_regs->tc = desc->period_len & ADMA_CH_TC_COUNT_MASK;
@@ -778,6 +783,7 @@ static const struct tegra_adma_chip_data tegra210_chip_data = {
.ch_req_tx_shift = 28,
.ch_req_rx_shift = 24,
.ch_base_offset = 0,
+ .has_outstanding_reqs = false,
.ch_fifo_ctrl = TEGRA210_FIFO_CTRL_DEFAULT,
.ch_req_mask = 0xf,
.ch_req_max = 10,
@@ -792,6 +798,7 @@ static const struct tegra_adma_chip_data tegra186_chip_data = {
.ch_req_tx_shift = 27,
.ch_req_rx_shift = 22,
.ch_base_offset = 0x10000,
+ .has_outstanding_reqs = true,
.ch_fifo_ctrl = TEGRA186_FIFO_CTRL_DEFAULT,
.ch_req_mask = 0x1f,
.ch_req_max = 20,
--
2.7.4