The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e5a95c83ed1492c0f442b448b20c90c8faaf702b Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:54 +0200
Subject: [PATCH] drm/i915/gt: Skip TLB invalidations once wedged
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Skip all further TLB invalidations once the device is wedged and
had been reset, as, on such cases, it can no longer process instructions
on the GPU and the user no longer has access to the TLB's in each engine.
So, an attempt to do a TLB cache invalidation will produce a timeout.
That helps to reduce the performance regression introduced by TLB
invalidate logic.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/5aa86564b9ec5fe7fe605c1dd7de7…
(cherry picked from commit be0366f168033374a93e4c43fdaa1a90ab905184)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 1d84418e8676..5c55a90672f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -934,6 +934,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
return;
+ if (intel_gt_is_wedged(gt))
+ return;
+
if (GRAPHICS_VER(i915) == 12) {
regs = gen12_regs;
num = ARRAY_SIZE(gen12_regs);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e5a95c83ed1492c0f442b448b20c90c8faaf702b Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:54 +0200
Subject: [PATCH] drm/i915/gt: Skip TLB invalidations once wedged
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Skip all further TLB invalidations once the device is wedged and
had been reset, as, on such cases, it can no longer process instructions
on the GPU and the user no longer has access to the TLB's in each engine.
So, an attempt to do a TLB cache invalidation will produce a timeout.
That helps to reduce the performance regression introduced by TLB
invalidate logic.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/5aa86564b9ec5fe7fe605c1dd7de7…
(cherry picked from commit be0366f168033374a93e4c43fdaa1a90ab905184)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 1d84418e8676..5c55a90672f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -934,6 +934,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
return;
+ if (intel_gt_is_wedged(gt))
+ return;
+
if (GRAPHICS_VER(i915) == 12) {
regs = gen12_regs;
num = ARRAY_SIZE(gen12_regs);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e5a95c83ed1492c0f442b448b20c90c8faaf702b Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:54 +0200
Subject: [PATCH] drm/i915/gt: Skip TLB invalidations once wedged
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Skip all further TLB invalidations once the device is wedged and
had been reset, as, on such cases, it can no longer process instructions
on the GPU and the user no longer has access to the TLB's in each engine.
So, an attempt to do a TLB cache invalidation will produce a timeout.
That helps to reduce the performance regression introduced by TLB
invalidate logic.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/5aa86564b9ec5fe7fe605c1dd7de7…
(cherry picked from commit be0366f168033374a93e4c43fdaa1a90ab905184)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 1d84418e8676..5c55a90672f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -934,6 +934,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
return;
+ if (intel_gt_is_wedged(gt))
+ return;
+
if (GRAPHICS_VER(i915) == 12) {
regs = gen12_regs;
num = ARRAY_SIZE(gen12_regs);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e5a95c83ed1492c0f442b448b20c90c8faaf702b Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:54 +0200
Subject: [PATCH] drm/i915/gt: Skip TLB invalidations once wedged
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Skip all further TLB invalidations once the device is wedged and
had been reset, as, on such cases, it can no longer process instructions
on the GPU and the user no longer has access to the TLB's in each engine.
So, an attempt to do a TLB cache invalidation will produce a timeout.
That helps to reduce the performance regression introduced by TLB
invalidate logic.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/5aa86564b9ec5fe7fe605c1dd7de7…
(cherry picked from commit be0366f168033374a93e4c43fdaa1a90ab905184)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 1d84418e8676..5c55a90672f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -934,6 +934,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
return;
+ if (intel_gt_is_wedged(gt))
+ return;
+
if (GRAPHICS_VER(i915) == 12) {
regs = gen12_regs;
num = ARRAY_SIZE(gen12_regs);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e5a95c83ed1492c0f442b448b20c90c8faaf702b Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:54 +0200
Subject: [PATCH] drm/i915/gt: Skip TLB invalidations once wedged
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Skip all further TLB invalidations once the device is wedged and
had been reset, as, on such cases, it can no longer process instructions
on the GPU and the user no longer has access to the TLB's in each engine.
So, an attempt to do a TLB cache invalidation will produce a timeout.
That helps to reduce the performance regression introduced by TLB
invalidate logic.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/5aa86564b9ec5fe7fe605c1dd7de7…
(cherry picked from commit be0366f168033374a93e4c43fdaa1a90ab905184)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 1d84418e8676..5c55a90672f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -934,6 +934,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
return;
+ if (intel_gt_is_wedged(gt))
+ return;
+
if (GRAPHICS_VER(i915) == 12) {
regs = gen12_regs;
num = ARRAY_SIZE(gen12_regs);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 180abeb2c5032704787151135b6a38c6b71295a6 Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:53 +0200
Subject: [PATCH] drm/i915/gt: Invalidate TLB of the OA unit at TLB
invalidations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Ensure that the TLB of the OA unit is also invalidated
on gen12 HW, as just invalidating the TLB of an engine is not
enough.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/59724d9f5cf1e93b1620d01b8332a…
(cherry picked from commit dfc83de118ff7930acc9a4c8dfdba7c153aa44d6)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index c4d43da84d8e..1d84418e8676 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -11,6 +11,7 @@
#include "pxp/intel_pxp.h"
#include "i915_drv.h"
+#include "i915_perf_oa_regs.h"
#include "intel_context.h"
#include "intel_engine_pm.h"
#include "intel_engine_regs.h"
@@ -969,6 +970,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
awake |= engine->mask;
}
+ /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */
+ if (awake &&
+ (IS_TIGERLAKE(i915) ||
+ IS_DG1(i915) ||
+ IS_ROCKETLAKE(i915) ||
+ IS_ALDERLAKE_S(i915) ||
+ IS_ALDERLAKE_P(i915)))
+ intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
+
spin_unlock_irq(&uncore->lock);
for_each_engine_masked(engine, gt, awake, tmp) {
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 180abeb2c5032704787151135b6a38c6b71295a6 Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:53 +0200
Subject: [PATCH] drm/i915/gt: Invalidate TLB of the OA unit at TLB
invalidations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Ensure that the TLB of the OA unit is also invalidated
on gen12 HW, as just invalidating the TLB of an engine is not
enough.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/59724d9f5cf1e93b1620d01b8332a…
(cherry picked from commit dfc83de118ff7930acc9a4c8dfdba7c153aa44d6)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index c4d43da84d8e..1d84418e8676 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -11,6 +11,7 @@
#include "pxp/intel_pxp.h"
#include "i915_drv.h"
+#include "i915_perf_oa_regs.h"
#include "intel_context.h"
#include "intel_engine_pm.h"
#include "intel_engine_regs.h"
@@ -969,6 +970,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
awake |= engine->mask;
}
+ /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */
+ if (awake &&
+ (IS_TIGERLAKE(i915) ||
+ IS_DG1(i915) ||
+ IS_ROCKETLAKE(i915) ||
+ IS_ALDERLAKE_S(i915) ||
+ IS_ALDERLAKE_P(i915)))
+ intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
+
spin_unlock_irq(&uncore->lock);
for_each_engine_masked(engine, gt, awake, tmp) {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 180abeb2c5032704787151135b6a38c6b71295a6 Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:53 +0200
Subject: [PATCH] drm/i915/gt: Invalidate TLB of the OA unit at TLB
invalidations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Ensure that the TLB of the OA unit is also invalidated
on gen12 HW, as just invalidating the TLB of an engine is not
enough.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/59724d9f5cf1e93b1620d01b8332a…
(cherry picked from commit dfc83de118ff7930acc9a4c8dfdba7c153aa44d6)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index c4d43da84d8e..1d84418e8676 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -11,6 +11,7 @@
#include "pxp/intel_pxp.h"
#include "i915_drv.h"
+#include "i915_perf_oa_regs.h"
#include "intel_context.h"
#include "intel_engine_pm.h"
#include "intel_engine_regs.h"
@@ -969,6 +970,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
awake |= engine->mask;
}
+ /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */
+ if (awake &&
+ (IS_TIGERLAKE(i915) ||
+ IS_DG1(i915) ||
+ IS_ROCKETLAKE(i915) ||
+ IS_ALDERLAKE_S(i915) ||
+ IS_ALDERLAKE_P(i915)))
+ intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
+
spin_unlock_irq(&uncore->lock);
for_each_engine_masked(engine, gt, awake, tmp) {
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 180abeb2c5032704787151135b6a38c6b71295a6 Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris.p.wilson(a)intel.com>
Date: Wed, 27 Jul 2022 14:29:53 +0200
Subject: [PATCH] drm/i915/gt: Invalidate TLB of the OA unit at TLB
invalidations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Ensure that the TLB of the OA unit is also invalidated
on gen12 HW, as just invalidating the TLB of an engine is not
enough.
Cc: stable(a)vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson(a)intel.com>
Cc: Fei Yang <fei.yang(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/59724d9f5cf1e93b1620d01b8332a…
(cherry picked from commit dfc83de118ff7930acc9a4c8dfdba7c153aa44d6)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index c4d43da84d8e..1d84418e8676 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -11,6 +11,7 @@
#include "pxp/intel_pxp.h"
#include "i915_drv.h"
+#include "i915_perf_oa_regs.h"
#include "intel_context.h"
#include "intel_engine_pm.h"
#include "intel_engine_regs.h"
@@ -969,6 +970,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
awake |= engine->mask;
}
+ /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */
+ if (awake &&
+ (IS_TIGERLAKE(i915) ||
+ IS_DG1(i915) ||
+ IS_ROCKETLAKE(i915) ||
+ IS_ALDERLAKE_S(i915) ||
+ IS_ALDERLAKE_P(i915)))
+ intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
+
spin_unlock_irq(&uncore->lock);
for_each_engine_masked(engine, gt, awake, tmp) {
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cc5d1692600613e72f32af60e27330fe0c79f4fe Mon Sep 17 00:00:00 2001
From: Wenbin Mei <wenbin.mei(a)mediatek.com>
Date: Thu, 28 Jul 2022 16:00:48 +0800
Subject: [PATCH] mmc: mtk-sd: Clear interrupts when cqe off/disable
Currently we don't clear MSDC interrupts when cqe off/disable, which led
to the data complete interrupt will be reserved for the next command.
If the next command with data transfer after cqe off/disable, we process
the CMD ready interrupt and trigger DMA start for data, but the data
complete interrupt is already exists, then SW assume that the data transfer
is complete, SW will trigger DMA stop, but the data may not be transmitted
yet or is transmitting, so we may encounter the following error:
mtk-msdc 11230000.mmc: CMD bus busy detected.
Signed-off-by: Wenbin Mei <wenbin.mei(a)mediatek.com>
Fixes: 88bd652b3c74 ("mmc: mediatek: command queue support")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220728080048.21336-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index 4ff73d1883de..69d78604d1fc 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -2446,6 +2446,9 @@ static void msdc_cqe_disable(struct mmc_host *mmc, bool recovery)
/* disable busy check */
sdr_clr_bits(host->base + MSDC_PATCH_BIT1, MSDC_PB1_BUSY_CHECK_SEL);
+ val = readl(host->base + MSDC_INT);
+ writel(val, host->base + MSDC_INT);
+
if (recovery) {
sdr_set_field(host->base + MSDC_DMA_CTRL,
MSDC_DMA_CTRL_STOP, 1);
@@ -2932,11 +2935,14 @@ static int __maybe_unused msdc_suspend(struct device *dev)
struct mmc_host *mmc = dev_get_drvdata(dev);
struct msdc_host *host = mmc_priv(mmc);
int ret;
+ u32 val;
if (mmc->caps2 & MMC_CAP2_CQE) {
ret = cqhci_suspend(mmc);
if (ret)
return ret;
+ val = readl(host->base + MSDC_INT);
+ writel(val, host->base + MSDC_INT);
}
/*
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cc5d1692600613e72f32af60e27330fe0c79f4fe Mon Sep 17 00:00:00 2001
From: Wenbin Mei <wenbin.mei(a)mediatek.com>
Date: Thu, 28 Jul 2022 16:00:48 +0800
Subject: [PATCH] mmc: mtk-sd: Clear interrupts when cqe off/disable
Currently we don't clear MSDC interrupts when cqe off/disable, which led
to the data complete interrupt will be reserved for the next command.
If the next command with data transfer after cqe off/disable, we process
the CMD ready interrupt and trigger DMA start for data, but the data
complete interrupt is already exists, then SW assume that the data transfer
is complete, SW will trigger DMA stop, but the data may not be transmitted
yet or is transmitting, so we may encounter the following error:
mtk-msdc 11230000.mmc: CMD bus busy detected.
Signed-off-by: Wenbin Mei <wenbin.mei(a)mediatek.com>
Fixes: 88bd652b3c74 ("mmc: mediatek: command queue support")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220728080048.21336-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index 4ff73d1883de..69d78604d1fc 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -2446,6 +2446,9 @@ static void msdc_cqe_disable(struct mmc_host *mmc, bool recovery)
/* disable busy check */
sdr_clr_bits(host->base + MSDC_PATCH_BIT1, MSDC_PB1_BUSY_CHECK_SEL);
+ val = readl(host->base + MSDC_INT);
+ writel(val, host->base + MSDC_INT);
+
if (recovery) {
sdr_set_field(host->base + MSDC_DMA_CTRL,
MSDC_DMA_CTRL_STOP, 1);
@@ -2932,11 +2935,14 @@ static int __maybe_unused msdc_suspend(struct device *dev)
struct mmc_host *mmc = dev_get_drvdata(dev);
struct msdc_host *host = mmc_priv(mmc);
int ret;
+ u32 val;
if (mmc->caps2 & MMC_CAP2_CQE) {
ret = cqhci_suspend(mmc);
if (ret)
return ret;
+ val = readl(host->base + MSDC_INT);
+ writel(val, host->base + MSDC_INT);
}
/*
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cc5d1692600613e72f32af60e27330fe0c79f4fe Mon Sep 17 00:00:00 2001
From: Wenbin Mei <wenbin.mei(a)mediatek.com>
Date: Thu, 28 Jul 2022 16:00:48 +0800
Subject: [PATCH] mmc: mtk-sd: Clear interrupts when cqe off/disable
Currently we don't clear MSDC interrupts when cqe off/disable, which led
to the data complete interrupt will be reserved for the next command.
If the next command with data transfer after cqe off/disable, we process
the CMD ready interrupt and trigger DMA start for data, but the data
complete interrupt is already exists, then SW assume that the data transfer
is complete, SW will trigger DMA stop, but the data may not be transmitted
yet or is transmitting, so we may encounter the following error:
mtk-msdc 11230000.mmc: CMD bus busy detected.
Signed-off-by: Wenbin Mei <wenbin.mei(a)mediatek.com>
Fixes: 88bd652b3c74 ("mmc: mediatek: command queue support")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220728080048.21336-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index 4ff73d1883de..69d78604d1fc 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -2446,6 +2446,9 @@ static void msdc_cqe_disable(struct mmc_host *mmc, bool recovery)
/* disable busy check */
sdr_clr_bits(host->base + MSDC_PATCH_BIT1, MSDC_PB1_BUSY_CHECK_SEL);
+ val = readl(host->base + MSDC_INT);
+ writel(val, host->base + MSDC_INT);
+
if (recovery) {
sdr_set_field(host->base + MSDC_DMA_CTRL,
MSDC_DMA_CTRL_STOP, 1);
@@ -2932,11 +2935,14 @@ static int __maybe_unused msdc_suspend(struct device *dev)
struct mmc_host *mmc = dev_get_drvdata(dev);
struct msdc_host *host = mmc_priv(mmc);
int ret;
+ u32 val;
if (mmc->caps2 & MMC_CAP2_CQE) {
ret = cqhci_suspend(mmc);
if (ret)
return ret;
+ val = readl(host->base + MSDC_INT);
+ writel(val, host->base + MSDC_INT);
}
/*
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 769030e11847c5412270c0726ff21d3a1f0a3131 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 1 Aug 2022 14:57:52 +0100
Subject: [PATCH] btrfs: fix warning during log replay when bumping inode link
count
During log replay, at add_link(), we may increment the link count of
another inode that has a reference that conflicts with a new reference
for the inode currently being processed.
During log replay, at add_link(), we may drop (unlink) a reference from
some inode in the subvolume tree if that reference conflicts with a new
reference found in the log for the inode we are currently processing.
After the unlink, If the link count has decreased from 1 to 0, then we
increment the link count to prevent the inode from being deleted if it's
evicted by an iput() call, because we may have references to add to that
inode later on (and we will fixup its link count later during log replay).
However incrementing the link count from 0 to 1 triggers a warning:
$ cat fs/inode.c
(...)
void inc_nlink(struct inode *inode)
{
if (unlikely(inode->i_nlink == 0)) {
WARN_ON(!(inode->i_state & I_LINKABLE));
atomic_long_dec(&inode->i_sb->s_remove_count);
}
(...)
The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
never set during log replay.
Most of the time, the warning isn't triggered even if we dropped the last
reference of the conflicting inode, and this is because:
1) The conflicting inode was previously marked for fixup, through a call
to link_to_fixup_dir(), which increments the inode's link count;
2) And the last iput() on the inode has not triggered eviction of the
inode, nor was eviction triggered after the iput(). So at add_link(),
even if we unlink the last reference of the inode, its link count ends
up being 1 and not 0.
So this means that if eviction is triggered after link_to_fixup_dir() is
called, at add_link() we will read the inode back from the subvolume tree
and have it with a correct link count, matching the number of references
it has on the subvolume tree. So if when we are at add_link() the inode
has exactly one reference only, its link count is 1, and after the unlink
its link count becomes 0.
So fix this by using set_nlink() instead of inc_nlink(), as the former
accepts a transition from 0 to 1 and it's what we use in other similar
contexts (like at link_to_fixup_dir().
Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
bump the link count from 0 to 1.
The warning is actually harmless, but it may scare users. Josef also ran
into it recently.
CC: stable(a)vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c1fdd6ef3f4a..9205c4a5ca81 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1459,7 +1459,7 @@ static int add_link(struct btrfs_trans_handle *trans,
* on the inode will not free it. We will fixup the link count later.
*/
if (other_inode->i_nlink == 0)
- inc_nlink(other_inode);
+ set_nlink(other_inode, 1);
add_link:
ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode),
name, namelen, 0, ref_index);
@@ -1602,7 +1602,7 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
* free it. We will fixup the link count later.
*/
if (!ret && inode->i_nlink == 0)
- inc_nlink(inode);
+ set_nlink(inode, 1);
}
if (ret < 0)
goto out;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 769030e11847c5412270c0726ff21d3a1f0a3131 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 1 Aug 2022 14:57:52 +0100
Subject: [PATCH] btrfs: fix warning during log replay when bumping inode link
count
During log replay, at add_link(), we may increment the link count of
another inode that has a reference that conflicts with a new reference
for the inode currently being processed.
During log replay, at add_link(), we may drop (unlink) a reference from
some inode in the subvolume tree if that reference conflicts with a new
reference found in the log for the inode we are currently processing.
After the unlink, If the link count has decreased from 1 to 0, then we
increment the link count to prevent the inode from being deleted if it's
evicted by an iput() call, because we may have references to add to that
inode later on (and we will fixup its link count later during log replay).
However incrementing the link count from 0 to 1 triggers a warning:
$ cat fs/inode.c
(...)
void inc_nlink(struct inode *inode)
{
if (unlikely(inode->i_nlink == 0)) {
WARN_ON(!(inode->i_state & I_LINKABLE));
atomic_long_dec(&inode->i_sb->s_remove_count);
}
(...)
The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
never set during log replay.
Most of the time, the warning isn't triggered even if we dropped the last
reference of the conflicting inode, and this is because:
1) The conflicting inode was previously marked for fixup, through a call
to link_to_fixup_dir(), which increments the inode's link count;
2) And the last iput() on the inode has not triggered eviction of the
inode, nor was eviction triggered after the iput(). So at add_link(),
even if we unlink the last reference of the inode, its link count ends
up being 1 and not 0.
So this means that if eviction is triggered after link_to_fixup_dir() is
called, at add_link() we will read the inode back from the subvolume tree
and have it with a correct link count, matching the number of references
it has on the subvolume tree. So if when we are at add_link() the inode
has exactly one reference only, its link count is 1, and after the unlink
its link count becomes 0.
So fix this by using set_nlink() instead of inc_nlink(), as the former
accepts a transition from 0 to 1 and it's what we use in other similar
contexts (like at link_to_fixup_dir().
Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
bump the link count from 0 to 1.
The warning is actually harmless, but it may scare users. Josef also ran
into it recently.
CC: stable(a)vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c1fdd6ef3f4a..9205c4a5ca81 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1459,7 +1459,7 @@ static int add_link(struct btrfs_trans_handle *trans,
* on the inode will not free it. We will fixup the link count later.
*/
if (other_inode->i_nlink == 0)
- inc_nlink(other_inode);
+ set_nlink(other_inode, 1);
add_link:
ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode),
name, namelen, 0, ref_index);
@@ -1602,7 +1602,7 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
* free it. We will fixup the link count later.
*/
if (!ret && inode->i_nlink == 0)
- inc_nlink(inode);
+ set_nlink(inode, 1);
}
if (ret < 0)
goto out;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 769030e11847c5412270c0726ff21d3a1f0a3131 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 1 Aug 2022 14:57:52 +0100
Subject: [PATCH] btrfs: fix warning during log replay when bumping inode link
count
During log replay, at add_link(), we may increment the link count of
another inode that has a reference that conflicts with a new reference
for the inode currently being processed.
During log replay, at add_link(), we may drop (unlink) a reference from
some inode in the subvolume tree if that reference conflicts with a new
reference found in the log for the inode we are currently processing.
After the unlink, If the link count has decreased from 1 to 0, then we
increment the link count to prevent the inode from being deleted if it's
evicted by an iput() call, because we may have references to add to that
inode later on (and we will fixup its link count later during log replay).
However incrementing the link count from 0 to 1 triggers a warning:
$ cat fs/inode.c
(...)
void inc_nlink(struct inode *inode)
{
if (unlikely(inode->i_nlink == 0)) {
WARN_ON(!(inode->i_state & I_LINKABLE));
atomic_long_dec(&inode->i_sb->s_remove_count);
}
(...)
The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
never set during log replay.
Most of the time, the warning isn't triggered even if we dropped the last
reference of the conflicting inode, and this is because:
1) The conflicting inode was previously marked for fixup, through a call
to link_to_fixup_dir(), which increments the inode's link count;
2) And the last iput() on the inode has not triggered eviction of the
inode, nor was eviction triggered after the iput(). So at add_link(),
even if we unlink the last reference of the inode, its link count ends
up being 1 and not 0.
So this means that if eviction is triggered after link_to_fixup_dir() is
called, at add_link() we will read the inode back from the subvolume tree
and have it with a correct link count, matching the number of references
it has on the subvolume tree. So if when we are at add_link() the inode
has exactly one reference only, its link count is 1, and after the unlink
its link count becomes 0.
So fix this by using set_nlink() instead of inc_nlink(), as the former
accepts a transition from 0 to 1 and it's what we use in other similar
contexts (like at link_to_fixup_dir().
Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
bump the link count from 0 to 1.
The warning is actually harmless, but it may scare users. Josef also ran
into it recently.
CC: stable(a)vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c1fdd6ef3f4a..9205c4a5ca81 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1459,7 +1459,7 @@ static int add_link(struct btrfs_trans_handle *trans,
* on the inode will not free it. We will fixup the link count later.
*/
if (other_inode->i_nlink == 0)
- inc_nlink(other_inode);
+ set_nlink(other_inode, 1);
add_link:
ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode),
name, namelen, 0, ref_index);
@@ -1602,7 +1602,7 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
* free it. We will fixup the link count later.
*/
if (!ret && inode->i_nlink == 0)
- inc_nlink(inode);
+ set_nlink(inode, 1);
}
if (ret < 0)
goto out;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a0753ef66c34c1739580219dca664eda648164b7 Mon Sep 17 00:00:00 2001
From: Liming Sun <limings(a)nvidia.com>
Date: Tue, 9 Aug 2022 13:37:42 -0400
Subject: [PATCH] mmc: sdhci-of-dwcmshc: Re-enable support for the BlueField-3
SoC
The commit 08f3dff799d4 (mmc: sdhci-of-dwcmshc: add rockchip platform
support") introduces the use of_device_get_match_data() to check for some
chips. Unfortunately, it also breaks the BlueField-3 FW, which uses ACPI.
To fix the problem, let's add the ACPI match data and the corresponding
quirks to re-enable the support for the BlueField-3 SoC.
Reviewed-by: David Woods <davwoods(a)nvidia.com>
Signed-off-by: Liming Sun <limings(a)nvidia.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Fixes: 08f3dff799d4 ("mmc: sdhci-of-dwcmshc: add rockchip platform support")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220809173742.178440-1-limings@nvidia.com
[Ulf: Clarified the commit message a bit]
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c
index 4e904850973c..a7343d4bc50e 100644
--- a/drivers/mmc/host/sdhci-of-dwcmshc.c
+++ b/drivers/mmc/host/sdhci-of-dwcmshc.c
@@ -349,6 +349,15 @@ static const struct sdhci_pltfm_data sdhci_dwcmshc_pdata = {
.quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN,
};
+#ifdef CONFIG_ACPI
+static const struct sdhci_pltfm_data sdhci_dwcmshc_bf3_pdata = {
+ .ops = &sdhci_dwcmshc_ops,
+ .quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN,
+ .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
+ SDHCI_QUIRK2_ACMD23_BROKEN,
+};
+#endif
+
static const struct sdhci_pltfm_data sdhci_dwcmshc_rk35xx_pdata = {
.ops = &sdhci_dwcmshc_rk35xx_ops,
.quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN |
@@ -431,7 +440,10 @@ MODULE_DEVICE_TABLE(of, sdhci_dwcmshc_dt_ids);
#ifdef CONFIG_ACPI
static const struct acpi_device_id sdhci_dwcmshc_acpi_ids[] = {
- { .id = "MLNXBF30" },
+ {
+ .id = "MLNXBF30",
+ .driver_data = (kernel_ulong_t)&sdhci_dwcmshc_bf3_pdata,
+ },
{}
};
#endif
@@ -447,7 +459,7 @@ static int dwcmshc_probe(struct platform_device *pdev)
int err;
u32 extra;
- pltfm_data = of_device_get_match_data(&pdev->dev);
+ pltfm_data = device_get_match_data(&pdev->dev);
if (!pltfm_data) {
dev_err(&pdev->dev, "Error: No device match data found\n");
return -ENODEV;
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a0753ef66c34c1739580219dca664eda648164b7 Mon Sep 17 00:00:00 2001
From: Liming Sun <limings(a)nvidia.com>
Date: Tue, 9 Aug 2022 13:37:42 -0400
Subject: [PATCH] mmc: sdhci-of-dwcmshc: Re-enable support for the BlueField-3
SoC
The commit 08f3dff799d4 (mmc: sdhci-of-dwcmshc: add rockchip platform
support") introduces the use of_device_get_match_data() to check for some
chips. Unfortunately, it also breaks the BlueField-3 FW, which uses ACPI.
To fix the problem, let's add the ACPI match data and the corresponding
quirks to re-enable the support for the BlueField-3 SoC.
Reviewed-by: David Woods <davwoods(a)nvidia.com>
Signed-off-by: Liming Sun <limings(a)nvidia.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Fixes: 08f3dff799d4 ("mmc: sdhci-of-dwcmshc: add rockchip platform support")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220809173742.178440-1-limings@nvidia.com
[Ulf: Clarified the commit message a bit]
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c
index 4e904850973c..a7343d4bc50e 100644
--- a/drivers/mmc/host/sdhci-of-dwcmshc.c
+++ b/drivers/mmc/host/sdhci-of-dwcmshc.c
@@ -349,6 +349,15 @@ static const struct sdhci_pltfm_data sdhci_dwcmshc_pdata = {
.quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN,
};
+#ifdef CONFIG_ACPI
+static const struct sdhci_pltfm_data sdhci_dwcmshc_bf3_pdata = {
+ .ops = &sdhci_dwcmshc_ops,
+ .quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN,
+ .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
+ SDHCI_QUIRK2_ACMD23_BROKEN,
+};
+#endif
+
static const struct sdhci_pltfm_data sdhci_dwcmshc_rk35xx_pdata = {
.ops = &sdhci_dwcmshc_rk35xx_ops,
.quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN |
@@ -431,7 +440,10 @@ MODULE_DEVICE_TABLE(of, sdhci_dwcmshc_dt_ids);
#ifdef CONFIG_ACPI
static const struct acpi_device_id sdhci_dwcmshc_acpi_ids[] = {
- { .id = "MLNXBF30" },
+ {
+ .id = "MLNXBF30",
+ .driver_data = (kernel_ulong_t)&sdhci_dwcmshc_bf3_pdata,
+ },
{}
};
#endif
@@ -447,7 +459,7 @@ static int dwcmshc_probe(struct platform_device *pdev)
int err;
u32 extra;
- pltfm_data = of_device_get_match_data(&pdev->dev);
+ pltfm_data = device_get_match_data(&pdev->dev);
if (!pltfm_data) {
dev_err(&pdev->dev, "Error: No device match data found\n");
return -ENODEV;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b3e1cf31154136da855f3cb6117c17eb0b6bcfb4 Mon Sep 17 00:00:00 2001
From: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Date: Sun, 7 Aug 2022 08:56:38 +0200
Subject: [PATCH] mmc: meson-gx: Fix an error handling path in
meson_mmc_probe()
The commit in Fixes has introduced a new error handling which should goto
the existing error handling path.
Otherwise some resources leak.
Fixes: 19c6beaa064c ("mmc: meson-gx: add device reset")
Signed-off-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/be4b863bacf323521ba3a02efdc4fca9cdedd1a6.16598553…
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/meson-gx-mmc.c b/drivers/mmc/host/meson-gx-mmc.c
index 2f08d442e557..fc462995cf94 100644
--- a/drivers/mmc/host/meson-gx-mmc.c
+++ b/drivers/mmc/host/meson-gx-mmc.c
@@ -1172,8 +1172,10 @@ static int meson_mmc_probe(struct platform_device *pdev)
}
ret = device_reset_optional(&pdev->dev);
- if (ret)
- return dev_err_probe(&pdev->dev, ret, "device reset failed\n");
+ if (ret) {
+ dev_err_probe(&pdev->dev, ret, "device reset failed\n");
+ goto free_host;
+ }
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
host->regs = devm_ioremap_resource(&pdev->dev, res);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b3e1cf31154136da855f3cb6117c17eb0b6bcfb4 Mon Sep 17 00:00:00 2001
From: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Date: Sun, 7 Aug 2022 08:56:38 +0200
Subject: [PATCH] mmc: meson-gx: Fix an error handling path in
meson_mmc_probe()
The commit in Fixes has introduced a new error handling which should goto
the existing error handling path.
Otherwise some resources leak.
Fixes: 19c6beaa064c ("mmc: meson-gx: add device reset")
Signed-off-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/be4b863bacf323521ba3a02efdc4fca9cdedd1a6.16598553…
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/meson-gx-mmc.c b/drivers/mmc/host/meson-gx-mmc.c
index 2f08d442e557..fc462995cf94 100644
--- a/drivers/mmc/host/meson-gx-mmc.c
+++ b/drivers/mmc/host/meson-gx-mmc.c
@@ -1172,8 +1172,10 @@ static int meson_mmc_probe(struct platform_device *pdev)
}
ret = device_reset_optional(&pdev->dev);
- if (ret)
- return dev_err_probe(&pdev->dev, ret, "device reset failed\n");
+ if (ret) {
+ dev_err_probe(&pdev->dev, ret, "device reset failed\n");
+ goto free_host;
+ }
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
host->regs = devm_ioremap_resource(&pdev->dev, res);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 415d832497098030241605c52ea83d4e2cfa7879 Mon Sep 17 00:00:00 2001
From: Hector Martin <marcan(a)marcan.st>
Date: Tue, 16 Aug 2022 16:03:11 +0900
Subject: [PATCH] locking/atomic: Make test_and_*_bit() ordered on failure
These operations are documented as always ordered in
include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
type use cases where one side needs to ensure a flag is left pending
after some shared data was updated rely on this ordering, even in the
failure case.
This is the case with the workqueue code, which currently suffers from a
reproducible ordering violation on Apple M1 platforms (which are
notoriously out-of-order) that ends up causing the TTY layer to fail to
deliver data to userspace properly under the right conditions. This
change fixes that bug.
Change the documentation to restrict the "no order on failure" story to
the _lock() variant (for which it makes sense), and remove the
early-exit from the generic implementation, which is what causes the
missing barrier semantics in that case. Without this, the remaining
atomic op is fully ordered (including on ARM64 LSE, as of recent
versions of the architecture spec).
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: stable(a)vger.kernel.org
Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
Signed-off-by: Hector Martin <marcan(a)marcan.st>
Acked-by: Will Deacon <will(a)kernel.org>
Reviewed-by: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
index 093cdaefdb37..d8b101c97031 100644
--- a/Documentation/atomic_bitops.txt
+++ b/Documentation/atomic_bitops.txt
@@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is:
- RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply. In the case of test_and_{}_bit() operations,
+ otherwise the above rules apply. In the case of test_and_set_bit_lock(),
if the bit in memory is unchanged by the operation then it is deemed to have
failed.
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
index 3096f086b5a3..71ab4ba9c25d 100644
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
- return 1;
-
old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
- return 0;
-
old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
return !!(old & mask);
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 415d832497098030241605c52ea83d4e2cfa7879 Mon Sep 17 00:00:00 2001
From: Hector Martin <marcan(a)marcan.st>
Date: Tue, 16 Aug 2022 16:03:11 +0900
Subject: [PATCH] locking/atomic: Make test_and_*_bit() ordered on failure
These operations are documented as always ordered in
include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
type use cases where one side needs to ensure a flag is left pending
after some shared data was updated rely on this ordering, even in the
failure case.
This is the case with the workqueue code, which currently suffers from a
reproducible ordering violation on Apple M1 platforms (which are
notoriously out-of-order) that ends up causing the TTY layer to fail to
deliver data to userspace properly under the right conditions. This
change fixes that bug.
Change the documentation to restrict the "no order on failure" story to
the _lock() variant (for which it makes sense), and remove the
early-exit from the generic implementation, which is what causes the
missing barrier semantics in that case. Without this, the remaining
atomic op is fully ordered (including on ARM64 LSE, as of recent
versions of the architecture spec).
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: stable(a)vger.kernel.org
Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
Signed-off-by: Hector Martin <marcan(a)marcan.st>
Acked-by: Will Deacon <will(a)kernel.org>
Reviewed-by: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
index 093cdaefdb37..d8b101c97031 100644
--- a/Documentation/atomic_bitops.txt
+++ b/Documentation/atomic_bitops.txt
@@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is:
- RMW operations that have a return value are fully ordered.
- RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply. In the case of test_and_{}_bit() operations,
+ otherwise the above rules apply. In the case of test_and_set_bit_lock(),
if the bit in memory is unchanged by the operation then it is deemed to have
failed.
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
index 3096f086b5a3..71ab4ba9c25d 100644
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
- return 1;
-
old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
- return 0;
-
old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
return !!(old & mask);
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405294f29faee5de8c10cb9d4a90e229c2835279 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 16 Aug 2022 05:39:36 +0000
Subject: [PATCH] KVM: Unconditionally get a ref to /dev/kvm module when
creating a VM
Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.
The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.
It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.
Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable(a)vger.kernel.org
Cc: David Matlack <dmatlack(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220816053937.2477106-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee5f48cc100b..15e304e059d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1134,6 +1134,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (!kvm)
return ERR_PTR(-ENOMEM);
+ /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
+ __module_get(kvm_chardev_ops.owner);
+
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
@@ -1226,16 +1229,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
- /*
- * When the fd passed to this ioctl() is opened it pins the module,
- * but try_module_get() also prevents getting a reference if the module
- * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
- */
- if (!try_module_get(kvm_chardev_ops.owner)) {
- r = -ENODEV;
- goto out_err;
- }
-
return kvm;
out_err:
@@ -1259,6 +1252,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405294f29faee5de8c10cb9d4a90e229c2835279 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 16 Aug 2022 05:39:36 +0000
Subject: [PATCH] KVM: Unconditionally get a ref to /dev/kvm module when
creating a VM
Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.
The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.
It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.
Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable(a)vger.kernel.org
Cc: David Matlack <dmatlack(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220816053937.2477106-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee5f48cc100b..15e304e059d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1134,6 +1134,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (!kvm)
return ERR_PTR(-ENOMEM);
+ /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
+ __module_get(kvm_chardev_ops.owner);
+
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
@@ -1226,16 +1229,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
- /*
- * When the fd passed to this ioctl() is opened it pins the module,
- * but try_module_get() also prevents getting a reference if the module
- * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
- */
- if (!try_module_get(kvm_chardev_ops.owner)) {
- r = -ENODEV;
- goto out_err;
- }
-
return kvm;
out_err:
@@ -1259,6 +1252,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405294f29faee5de8c10cb9d4a90e229c2835279 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 16 Aug 2022 05:39:36 +0000
Subject: [PATCH] KVM: Unconditionally get a ref to /dev/kvm module when
creating a VM
Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.
The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.
It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.
Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable(a)vger.kernel.org
Cc: David Matlack <dmatlack(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220816053937.2477106-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee5f48cc100b..15e304e059d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1134,6 +1134,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (!kvm)
return ERR_PTR(-ENOMEM);
+ /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
+ __module_get(kvm_chardev_ops.owner);
+
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
@@ -1226,16 +1229,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
- /*
- * When the fd passed to this ioctl() is opened it pins the module,
- * but try_module_get() also prevents getting a reference if the module
- * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
- */
- if (!try_module_get(kvm_chardev_ops.owner)) {
- r = -ENODEV;
- goto out_err;
- }
-
return kvm;
out_err:
@@ -1259,6 +1252,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405294f29faee5de8c10cb9d4a90e229c2835279 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 16 Aug 2022 05:39:36 +0000
Subject: [PATCH] KVM: Unconditionally get a ref to /dev/kvm module when
creating a VM
Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.
The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.
It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.
Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable(a)vger.kernel.org
Cc: David Matlack <dmatlack(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220816053937.2477106-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee5f48cc100b..15e304e059d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1134,6 +1134,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (!kvm)
return ERR_PTR(-ENOMEM);
+ /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
+ __module_get(kvm_chardev_ops.owner);
+
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
@@ -1226,16 +1229,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
- /*
- * When the fd passed to this ioctl() is opened it pins the module,
- * but try_module_get() also prevents getting a reference if the module
- * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
- */
- if (!try_module_get(kvm_chardev_ops.owner)) {
- r = -ENODEV;
- goto out_err;
- }
-
return kvm;
out_err:
@@ -1259,6 +1252,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405294f29faee5de8c10cb9d4a90e229c2835279 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 16 Aug 2022 05:39:36 +0000
Subject: [PATCH] KVM: Unconditionally get a ref to /dev/kvm module when
creating a VM
Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.
The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.
It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.
Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable(a)vger.kernel.org
Cc: David Matlack <dmatlack(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220816053937.2477106-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee5f48cc100b..15e304e059d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1134,6 +1134,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (!kvm)
return ERR_PTR(-ENOMEM);
+ /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
+ __module_get(kvm_chardev_ops.owner);
+
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
@@ -1226,16 +1229,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
- /*
- * When the fd passed to this ioctl() is opened it pins the module,
- * but try_module_get() also prevents getting a reference if the module
- * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
- */
- if (!try_module_get(kvm_chardev_ops.owner)) {
- r = -ENODEV;
- goto out_err;
- }
-
return kvm;
out_err:
@@ -1259,6 +1252,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}
Note, this is the LAST 5.18.y kernel to be released. This branch is now
end-of-life. Please move to 5.19.y at this point in time.
I'm announcing the release of the 5.18.19 kernel.
All users of the 5.18 kernel series must upgrade.
The updated 5.18.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.18.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
arch/arm64/kernel/kexec_image.c | 11 -----
arch/x86/kernel/kexec-bzimage64.c | 20 ----------
drivers/tee/tee_shm.c | 3 +
fs/btrfs/raid56.c | 74 +++++++++++++++++++++++++++++---------
include/linux/kexec.h | 7 +++
kernel/kexec_file.c | 17 ++++++++
net/sched/cls_route.c | 10 +++++
8 files changed, 97 insertions(+), 47 deletions(-)
Coiby Xu (2):
kexec, KEYS: make the code in bzImage64_verify_sig generic
arm64: kexec_file: use more system keyrings to verify kernel image signature
Greg Kroah-Hartman (1):
Linux 5.18.19
Jamal Hadi Salim (1):
net_sched: cls_route: disallow handle of 0
Jens Wiklander (1):
tee: add overflow check in register_shm_helper()
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data stripes
btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
I'm announcing the release of the 5.19.3 kernel.
All users of the 5.19 kernel series must upgrade.
The updated 5.19.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.19.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
arch/arm64/kernel/kexec_image.c | 11 ------
arch/x86/kernel/kexec-bzimage64.c | 20 -----------
drivers/tee/tee_shm.c | 3 +
fs/btrfs/raid56.c | 68 ++++++++++++++++++++++++++++++--------
include/linux/kexec.h | 7 +++
kernel/kexec_file.c | 17 +++++++++
mm/kfence/core.c | 18 +++++-----
net/sched/cls_route.c | 10 +++++
9 files changed, 104 insertions(+), 52 deletions(-)
Coiby Xu (2):
kexec, KEYS: make the code in bzImage64_verify_sig generic
arm64: kexec_file: use more system keyrings to verify kernel image signature
Greg Kroah-Hartman (1):
Linux 5.19.3
Jamal Hadi Salim (1):
net_sched: cls_route: disallow handle of 0
Jens Wiklander (1):
tee: add overflow check in register_shm_helper()
Marco Elver (1):
Revert "mm: kfence: apply kmemleak_ignore_phys on early allocated pool"
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data stripes
btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
I'm announcing the release of the 5.15.62 kernel.
All users of the 5.15 kernel series must upgrade.
The updated 5.15.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.15.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/x86/kernel/ftrace.c | 7 --
arch/x86/kernel/ftrace_64.S | 19 +++++-
drivers/tee/tee_shm.c | 3 +
fs/btrfs/raid56.c | 74 +++++++++++++++++++------
fs/io_uring.c | 2
fs/ksmbd/smb2misc.c | 7 +-
fs/ksmbd/smb2pdu.c | 45 +++++++++------
fs/ksmbd/smbacl.c | 130 +++++++++++++++++++++++++++++---------------
fs/ksmbd/smbacl.h | 2
fs/ksmbd/vfs.c | 5 +
net/sched/cls_route.c | 10 +++
12 files changed, 215 insertions(+), 91 deletions(-)
Greg Kroah-Hartman (1):
Linux 5.15.62
Hyunchul Lee (1):
ksmbd: prevent out of bound read for SMB2_WRITE
Jamal Hadi Salim (1):
net_sched: cls_route: disallow handle of 0
Jens Axboe (1):
io_uring: use original request task for inflight tracking
Jens Wiklander (1):
tee: add overflow check in register_shm_helper()
Namjae Jeon (1):
ksmbd: fix heap-based overflow in set_ntacl_dacl()
Peter Zijlstra (2):
x86/ibt,ftrace: Make function-graph play nice
x86/ftrace: Use alternative RET encoding
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data stripes
btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
Thadeu Lima de Souza Cascardo (1):
Revert "x86/ftrace: Use alternative RET encoding"
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0b9ba6135d7f18b82f3d8bebb55ded725ba88e0e Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Wed, 13 Jul 2022 01:12:21 +0200
Subject: [PATCH] um: seed rng using host OS rng
UML generally does not provide access to special CPU instructions like
RDRAND, and execution tends to be rather deterministic, with no real
hardware interrupts, making good randomness really very hard, if not
all together impossible. Not only is this a security eyebrow raiser, but
it's also quite annoying when trying to do various pieces of UML-based
automation that takes a long time to boot, if ever.
Fix this by trivially calling getrandom() in the host and using that
seed as "bootloader randomness", which initializes the rng immediately
at UML boot.
The old behavior can be restored the same way as on any other arch, by
way of CONFIG_TRUST_BOOTLOADER_RANDOMNESS=n or
random.trust_bootloader=0. So seen from that perspective, this just
makes UML act like other archs, which is positive in its own right.
Additionally, wire up arch_get_random_{int,long}() in the same way, so
that reseeds can also make use of the host RNG, controllable by
CONFIG_TRUST_CPU_RANDOMNESS and random.trust_cpu, per usual.
Cc: stable(a)vger.kernel.org
Acked-by: Johannes Berg <johannes(a)sipsolutions.net>
Acked-By: Anton Ivanov <anton.ivanov(a)cambridgegreys.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/arch/um/include/asm/archrandom.h b/arch/um/include/asm/archrandom.h
new file mode 100644
index 000000000000..2f24cb96391d
--- /dev/null
+++ b/arch/um/include/asm/archrandom.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_UM_ARCHRANDOM_H__
+#define __ASM_UM_ARCHRANDOM_H__
+
+#include <linux/types.h>
+
+/* This is from <os.h>, but better not to #include that in a global header here. */
+ssize_t os_getrandom(void *buf, size_t len, unsigned int flags);
+
+static inline bool __must_check arch_get_random_long(unsigned long *v)
+{
+ return os_getrandom(v, sizeof(*v), 0) == sizeof(*v);
+}
+
+static inline bool __must_check arch_get_random_int(unsigned int *v)
+{
+ return os_getrandom(v, sizeof(*v), 0) == sizeof(*v);
+}
+
+static inline bool __must_check arch_get_random_seed_long(unsigned long *v)
+{
+ return false;
+}
+
+static inline bool __must_check arch_get_random_seed_int(unsigned int *v)
+{
+ return false;
+}
+
+#endif
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index fafde1d5416e..0df646c6651e 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -11,6 +11,12 @@
#include <irq_user.h>
#include <longjmp.h>
#include <mm_id.h>
+/* This is to get size_t */
+#ifndef __UM_HOST__
+#include <linux/types.h>
+#else
+#include <sys/types.h>
+#endif
#define CATCH_EINTR(expr) while ((errno = 0, ((expr) < 0)) && (errno == EINTR))
@@ -243,6 +249,7 @@ extern void stack_protections(unsigned long address);
extern int raw(int fd);
extern void setup_machinename(char *machine_out);
extern void setup_hostinfo(char *buf, int len);
+extern ssize_t os_getrandom(void *buf, size_t len, unsigned int flags);
extern void os_dump_core(void) __attribute__ ((noreturn));
extern void um_early_printk(const char *s, unsigned int n);
extern void os_fix_helper_signals(void);
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index 0760e24f2eba..74f3efd96bd4 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -16,6 +16,7 @@
#include <linux/sched/task.h>
#include <linux/kmsg_dump.h>
#include <linux/suspend.h>
+#include <linux/random.h>
#include <asm/processor.h>
#include <asm/cpufeature.h>
@@ -406,6 +407,8 @@ int __init __weak read_initrd(void)
void __init setup_arch(char **cmdline_p)
{
+ u8 rng_seed[32];
+
stack_protections((unsigned long) &init_thread_info);
setup_physmem(uml_physmem, uml_reserved, physmem_size, highmem);
mem_total_pages(physmem_size, iomem_size, highmem);
@@ -416,6 +419,11 @@ void __init setup_arch(char **cmdline_p)
strlcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
*cmdline_p = command_line;
setup_hostinfo(host_info, sizeof host_info);
+
+ if (os_getrandom(rng_seed, sizeof(rng_seed), 0) == sizeof(rng_seed)) {
+ add_bootloader_randomness(rng_seed, sizeof(rng_seed));
+ memzero_explicit(rng_seed, sizeof(rng_seed));
+ }
}
void __init check_bugs(void)
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index 41297ec404bf..fc0f2a9dee5a 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -14,6 +14,7 @@
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/utsname.h>
+#include <sys/random.h>
#include <init.h>
#include <os.h>
@@ -96,6 +97,11 @@ static inline void __attribute__ ((noreturn)) uml_abort(void)
exit(127);
}
+ssize_t os_getrandom(void *buf, size_t len, unsigned int flags)
+{
+ return getrandom(buf, len, flags);
+}
+
/*
* UML helper threads must not handle SIGWINCH/INT/TERM
*/
Syzkaller has reported a warning in iomap_iter(), which got
triggered due to a call to iomap_iter_done() which has:
WARN_ON_ONCE(iter->iomap.offset > iter->pos);
The warning was triggered because `pos` was being negative.
I was having offset = 0, pos = -2950420705881096192.
This ridiculously negative value smells of an overflow, and sure
it is.
The userspace can configure a loop using an ioctl call, wherein
a configuration of type loop_config is passed (see lo_ioctl()'s
case on line 1550 of drivers/block/loop.c). This proceeds to call
loop_configure() which in turn calls loop_set_status_from_info()
(see line 1050 of loop.c), passing &config->info which is of type
loop_info64*. This function then sets the appropriate values, like
the offset.
The problem here is loop_device has lo_offset of type loff_t
(see line 52 of loop.c), which is typdef-chained to long long,
whereas loop_info64 has lo_offset of type __u64 (see line 56 of
include/uapi/linux/loop.h).
The function directly copies offset from info to the device as
follows (See line 980 of loop.c):
lo->lo_offset = info->lo_offset;
This results in the encountered overflow (in my case, the RHS
was 15496323367828455424).
Thus, convert the type definitions in loop_info64 to their
signed counterparts in order to match definitions in loop_device,
and check for negative value during loop_set_status_from_info().
Bug report: https://syzkaller.appspot.com/bug?id=c620fe14aac810396d3c3edc9ad73848bf69a2…
Reported-and-tested-by: syzbot+a8e049cd3abd342936b6(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org
Signed-off-by: Siddh Raman Pant <code(a)siddh.me>
---
Unless I am missing any other uses or quirks of UAPI loop_info64,
I think this won't introduce regression, since if something is
working, it will continue to work. If something does break, then
it was relying on overflows, which is anyways an incorrect way
to go about.
Also, it seems even the 32-bit compatibility structure uses the
signed types (compat_loop_info uses compat_int_t which is s32),
so this patch should be fine.
drivers/block/loop.c | 3 +++
include/uapi/linux/loop.h | 12 ++++++------
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e3c0ba93c1a3..4ca20ce3158d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -977,6 +977,9 @@ loop_set_status_from_info(struct loop_device *lo,
return -EINVAL;
}
+ if (info->lo_offset < 0 || info->lo_sizelimit < 0)
+ return -EINVAL;
+
lo->lo_offset = info->lo_offset;
lo->lo_sizelimit = info->lo_sizelimit;
memcpy(lo->lo_file_name, info->lo_file_name, LO_NAME_SIZE);
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index 6f63527dd2ed..973565f38f9d 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -53,12 +53,12 @@ struct loop_info64 {
__u64 lo_device; /* ioctl r/o */
__u64 lo_inode; /* ioctl r/o */
__u64 lo_rdevice; /* ioctl r/o */
- __u64 lo_offset;
- __u64 lo_sizelimit;/* bytes, 0 == max available */
- __u32 lo_number; /* ioctl r/o */
- __u32 lo_encrypt_type; /* obsolete, ignored */
- __u32 lo_encrypt_key_size; /* ioctl w/o */
- __u32 lo_flags;
+ __s64 lo_offset;
+ __s64 lo_sizelimit;/* bytes, 0 == max available */
+ __s32 lo_number; /* ioctl r/o */
+ __s32 lo_encrypt_type; /* obsolete, ignored */
+ __s32 lo_encrypt_key_size; /* ioctl w/o */
+ __s32 lo_flags;
__u8 lo_file_name[LO_NAME_SIZE];
__u8 lo_crypt_name[LO_NAME_SIZE];
__u8 lo_encrypt_key[LO_KEY_SIZE]; /* ioctl w/o */
--
2.35.1
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: fc2e426b1161761561624ebd43ce8c8d2fa058da
Gitweb: https://git.kernel.org/tip/fc2e426b1161761561624ebd43ce8c8d2fa058da
Author: Chen Zhongjin <chenzhongjin(a)huawei.com>
AuthorDate: Fri, 19 Aug 2022 16:43:34 +08:00
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitterDate: Sun, 21 Aug 2022 12:19:32 +02:00
x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
When meeting ftrace trampolines in ORC unwinding, unwinder uses address
of ftrace_{regs_}call address to find the ORC entry, which gets next frame at
sp+176.
If there is an IRQ hitting at sub $0xa8,%rsp, the next frame should be
sp+8 instead of 176. It makes unwinder skip correct frame and throw
warnings such as "wrong direction" or "can't access registers", etc,
depending on the content of the incorrect frame address.
By adding the base address ftrace_{regs_}caller with the offset
*ip - ops->trampoline*, we can get the correct address to find the ORC entry.
Also change "caller" to "tramp_addr" to make variable name conform to
its content.
[ mingo: Clarified the changelog a bit. ]
Fixes: 6be7fa3c74d1 ("ftrace, orc, x86: Handle ftrace dynamically allocated trampolines")
Signed-off-by: Chen Zhongjin <chenzhongjin(a)huawei.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220819084334.244016-1-chenzhongjin@huawei.com
---
arch/x86/kernel/unwind_orc.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 38185ae..0ea57da 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -93,22 +93,27 @@ static struct orc_entry *orc_find(unsigned long ip);
static struct orc_entry *orc_ftrace_find(unsigned long ip)
{
struct ftrace_ops *ops;
- unsigned long caller;
+ unsigned long tramp_addr, offset;
ops = ftrace_ops_trampoline(ip);
if (!ops)
return NULL;
+ /* Set tramp_addr to the start of the code copied by the trampoline */
if (ops->flags & FTRACE_OPS_FL_SAVE_REGS)
- caller = (unsigned long)ftrace_regs_call;
+ tramp_addr = (unsigned long)ftrace_regs_caller;
else
- caller = (unsigned long)ftrace_call;
+ tramp_addr = (unsigned long)ftrace_caller;
+
+ /* Now place tramp_addr to the location within the trampoline ip is at */
+ offset = ip - ops->trampoline;
+ tramp_addr += offset;
/* Prevent unlikely recursion */
- if (ip == caller)
+ if (ip == tramp_addr)
return NULL;
- return orc_find(caller);
+ return orc_find(tramp_addr);
}
#else
static struct orc_entry *orc_ftrace_find(unsigned long ip)
This is the start of the stable review cycle for the 5.19.3 release.
There are 7 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 21 Aug 2022 15:36:59 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.19.3-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.19.3-rc1
Coiby Xu <coxu(a)redhat.com>
arm64: kexec_file: use more system keyrings to verify kernel image signature
Coiby Xu <coxu(a)redhat.com>
kexec, KEYS: make the code in bzImage64_verify_sig generic
Qu Wenruo <wqu(a)suse.com>
btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
Qu Wenruo <wqu(a)suse.com>
btrfs: only write the sectors in the vertical stripe which has data stripes
Jamal Hadi Salim <jhs(a)mojatatu.com>
net_sched: cls_route: disallow handle of 0
Jens Wiklander <jens.wiklander(a)linaro.org>
tee: add overflow check in register_shm_helper()
Marco Elver <elver(a)google.com>
Revert "mm: kfence: apply kmemleak_ignore_phys on early allocated pool"
-------------
Diffstat:
Makefile | 4 +--
arch/arm64/kernel/kexec_image.c | 11 +------
arch/x86/kernel/kexec-bzimage64.c | 20 +-----------
drivers/tee/tee_shm.c | 3 ++
fs/btrfs/raid56.c | 68 +++++++++++++++++++++++++++++++--------
include/linux/kexec.h | 7 ++++
kernel/kexec_file.c | 17 ++++++++++
mm/kfence/core.c | 18 +++++------
net/sched/cls_route.c | 10 ++++++
9 files changed, 105 insertions(+), 53 deletions(-)
--
Hello,
We the Board Directors believe you are in good health, doing great and
with the hope that this mail will meet you in good condition, We are
privileged and delighted to reach you via email" And we are urgently
waiting to hear from you. and again your number is not connecting.
My regards,
Dr. KAbore Aouessian
Sincerely,
Prof. Chin Guang
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.
Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.…
Cc: stable(a)vger.kernel.org
Fixes: 1e3bac71c5053 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 181f08186d32..0356cae0cf74 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -176,6 +176,7 @@ static int trace_define_generic_fields(void)
__generic_field(int, CPU, FILTER_CPU);
__generic_field(int, cpu, FILTER_CPU);
+ __generic_field(int, common_cpu, FILTER_CPU);
__generic_field(char *, COMM, FILTER_COMM);
__generic_field(char *, comm, FILTER_COMM);
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.
Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.…
Cc: stable(a)vger.kernel.org
Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_probe.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 4daabbb8b772..36dff277de46 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -314,7 +314,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
}
} else
goto inval_var;
- } else if (strcmp(arg, "comm") == 0) {
+ } else if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
code->op = FETCH_OP_COMM;
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
} else if (((flags & TPARG_FL_MASK) ==
@@ -625,7 +625,8 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
* we can find those by strcmp. But ignore for eprobes.
*/
if (!(flags & TPARG_FL_TPOINT) &&
- (strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0)) {
+ (strcmp(arg, "$comm") == 0 || strcmp(arg, "$COMM") == 0 ||
+ strncmp(arg, "\\\"", 2) == 0)) {
/* The type of $comm must be "string", and not an array. */
if (parg->count || (t && strcmp(t, "string")))
goto out;
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.
Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.
For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_eprobe.c | 70 +++++++++++++++++++++++++++++++++----
1 file changed, 64 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index a1d3423ab74f..1783e3478912 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -227,6 +227,7 @@ static int trace_eprobe_tp_arg_update(struct trace_eprobe *ep, int i)
struct probe_arg *parg = &ep->tp.args[i];
struct ftrace_event_field *field;
struct list_head *head;
+ int ret = -ENOENT;
head = trace_get_fields(ep->event);
list_for_each_entry(field, head, link) {
@@ -236,9 +237,20 @@ static int trace_eprobe_tp_arg_update(struct trace_eprobe *ep, int i)
return 0;
}
}
+
+ /*
+ * Argument not found on event. But allow for comm and COMM
+ * to be used to get the current->comm.
+ */
+ if (strcmp(parg->code->data, "COMM") == 0 ||
+ strcmp(parg->code->data, "comm") == 0) {
+ parg->code->op = FETCH_OP_COMM;
+ ret = 0;
+ }
+
kfree(parg->code->data);
parg->code->data = NULL;
- return -ENOENT;
+ return ret;
}
static int eprobe_event_define_fields(struct trace_event_call *event_call)
@@ -363,16 +375,38 @@ static unsigned long get_event_field(struct fetch_insn *code, void *rec)
static int get_eprobe_size(struct trace_probe *tp, void *rec)
{
+ struct fetch_insn *code;
struct probe_arg *arg;
int i, len, ret = 0;
for (i = 0; i < tp->nr_args; i++) {
arg = tp->args + i;
- if (unlikely(arg->dynamic)) {
+ if (arg->dynamic) {
unsigned long val;
- val = get_event_field(arg->code, rec);
- len = process_fetch_insn_bottom(arg->code + 1, val, NULL, NULL);
+ code = arg->code;
+ retry:
+ switch (code->op) {
+ case FETCH_OP_TP_ARG:
+ val = get_event_field(code, rec);
+ break;
+ case FETCH_OP_IMM:
+ val = code->immediate;
+ break;
+ case FETCH_OP_COMM:
+ val = (unsigned long)current->comm;
+ break;
+ case FETCH_OP_DATA:
+ val = (unsigned long)code->data;
+ break;
+ case FETCH_NOP_SYMBOL: /* Ignore a place holder */
+ code++;
+ goto retry;
+ default:
+ continue;
+ }
+ code++;
+ len = process_fetch_insn_bottom(code, val, NULL, NULL);
if (len > 0)
ret += len;
}
@@ -390,8 +424,28 @@ process_fetch_insn(struct fetch_insn *code, void *rec, void *dest,
{
unsigned long val;
- val = get_event_field(code, rec);
- return process_fetch_insn_bottom(code + 1, val, dest, base);
+ retry:
+ switch (code->op) {
+ case FETCH_OP_TP_ARG:
+ val = get_event_field(code, rec);
+ break;
+ case FETCH_OP_IMM:
+ val = code->immediate;
+ break;
+ case FETCH_OP_COMM:
+ val = (unsigned long)current->comm;
+ break;
+ case FETCH_OP_DATA:
+ val = (unsigned long)code->data;
+ break;
+ case FETCH_NOP_SYMBOL: /* Ignore a place holder */
+ code++;
+ goto retry;
+ default:
+ return -EILSEQ;
+ }
+ code++;
+ return process_fetch_insn_bottom(code, val, dest, base);
}
NOKPROBE_SYMBOL(process_fetch_insn)
@@ -866,6 +920,10 @@ static int trace_eprobe_tp_update_arg(struct trace_eprobe *ep, const char *argv[
trace_probe_log_err(0, BAD_ATTACH_ARG);
}
+ /* Handle symbols "@" */
+ if (!ret)
+ ret = traceprobe_update_arg(&ep->tp.args[i]);
+
return ret;
}
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.
Now when a string field is used, the ":string" type can be used to display
it:
echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_eprobe.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index 550671985fd1..a1d3423ab74f 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -311,6 +311,27 @@ static unsigned long get_event_field(struct fetch_insn *code, void *rec)
addr = rec + field->offset;
+ if (is_string_field(field)) {
+ switch (field->filter_type) {
+ case FILTER_DYN_STRING:
+ val = (unsigned long)(rec + (*(unsigned int *)addr & 0xffff));
+ break;
+ case FILTER_RDYN_STRING:
+ val = (unsigned long)(addr + (*(unsigned int *)addr & 0xffff));
+ break;
+ case FILTER_STATIC_STRING:
+ val = (unsigned long)addr;
+ break;
+ case FILTER_PTR_STRING:
+ val = (unsigned long)(*(char *)addr);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return 0;
+ }
+ return val;
+ }
+
switch (field->size) {
case 1:
if (field->is_signed)
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.
Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_probe.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index dec657af363c..4daabbb8b772 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -622,9 +622,10 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
/*
* Since $comm and immediate string can not be dereferenced,
- * we can find those by strcmp.
+ * we can find those by strcmp. But ignore for eprobes.
*/
- if (strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0) {
+ if (!(flags & TPARG_FL_TPOINT) &&
+ (strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0)) {
/* The type of $comm must be "string", and not an array. */
if (parg->count || (t && strcmp(t, "string")))
goto out;
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
If in perf_trace_event_init(), the perf_trace_event_open() fails, then it
will call perf_trace_event_unreg() which will not only unregister the perf
trace event, but will also call the put() function of the tp_event.
The problem here is that the trace_event_try_get_ref() is called by the
caller of perf_trace_event_init() and if perf_trace_event_init() returns a
failure, it will then call trace_event_put(). But since the
perf_trace_event_unreg() already called the trace_event_put() function, it
triggers a WARN_ON().
WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20
If perf_trace_event_reg() does not call the trace_event_try_get_ref() then
the perf_trace_event_unreg() should not be calling trace_event_put(). This
breaks symmetry and causes bugs like these.
Pull out the trace_event_put() from perf_trace_event_unreg() and call it
in the locations that perf_trace_event_unreg() is called. This not only
fixes this bug, but also brings back the proper symmetry of the reg/unreg
vs get/put logic.
Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/
Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home
Cc: stable(a)vger.kernel.org
Fixes: 1d18538e6a092 ("tracing: Have dynamic events have a ref counter")
Reported-by: Krister Johansen <kjlx(a)templeofstupid.com>
Reviewed-by: Krister Johansen <kjlx(a)templeofstupid.com>
Tested-by: Krister Johansen <kjlx(a)templeofstupid.com>
Acked-by: Jiri Olsa <jolsa(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_event_perf.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a114549720d6..61e3a2620fa3 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -157,7 +157,7 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
int i;
if (--tp_event->perf_refcount > 0)
- goto out;
+ return;
tp_event->class->reg(tp_event, TRACE_REG_PERF_UNREGISTER, NULL);
@@ -176,8 +176,6 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
perf_trace_buf[i] = NULL;
}
}
-out:
- trace_event_put_ref(tp_event);
}
static int perf_trace_event_open(struct perf_event *p_event)
@@ -241,6 +239,7 @@ void perf_trace_destroy(struct perf_event *p_event)
mutex_lock(&event_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+ trace_event_put_ref(p_event->tp_event);
mutex_unlock(&event_mutex);
}
@@ -292,6 +291,7 @@ void perf_kprobe_destroy(struct perf_event *p_event)
mutex_lock(&event_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+ trace_event_put_ref(p_event->tp_event);
mutex_unlock(&event_mutex);
destroy_local_trace_kprobe(p_event->tp_event);
@@ -347,6 +347,7 @@ void perf_uprobe_destroy(struct perf_event *p_event)
mutex_lock(&event_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+ trace_event_put_ref(p_event->tp_event);
mutex_unlock(&event_mutex);
destroy_local_trace_uprobe(p_event->tp_event);
}
--
2.35.1
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7ef3d06f1bc4a5e62273726f3dc2bd258ae1c71f Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 28 Jul 2022 00:32:18 +1000
Subject: [PATCH] powerpc/powernv/kvm: Use darn for H_RANDOM on Power9
The existing logic in KVM to support guests calling H_RANDOM only works
on Power8, because it looks for an RNG in the device tree, but on Power9
we just use darn.
In addition the existing code needs to work in real mode, so we have the
special cased powernv_get_random_real_mode() to deal with that.
Instead just have KVM call ppc_md.get_random_seed(), and do the real
mode check inside of there, that way we use whatever RNG is available,
including darn on Power9.
Fixes: e928e9cb3601 ("KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation.")
Cc: stable(a)vger.kernel.org # v4.1+
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
Tested-by: Sachin Sant <sachinp(a)linux.ibm.com>
[mpe: Rebase on previous commit, update change log appropriately]
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20220727143219.2684192-2-mpe@ellerman.id.au
diff --git a/arch/powerpc/include/asm/archrandom.h b/arch/powerpc/include/asm/archrandom.h
index 9a53e29680f4..258174304904 100644
--- a/arch/powerpc/include/asm/archrandom.h
+++ b/arch/powerpc/include/asm/archrandom.h
@@ -38,12 +38,7 @@ static inline bool __must_check arch_get_random_seed_int(unsigned int *v)
#endif /* CONFIG_ARCH_RANDOM */
#ifdef CONFIG_PPC_POWERNV
-int powernv_hwrng_present(void);
int powernv_get_random_long(unsigned long *v);
-int powernv_get_random_real_mode(unsigned long *v);
-#else
-static inline int powernv_hwrng_present(void) { return 0; }
-static inline int powernv_get_random_real_mode(unsigned long *v) { return 0; }
#endif
#endif /* _ASM_POWERPC_ARCHRANDOM_H */
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index 88a8f6473c4e..3abaef5f9ac2 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -19,7 +19,7 @@
#include <asm/interrupt.h>
#include <asm/kvm_ppc.h>
#include <asm/kvm_book3s.h>
-#include <asm/archrandom.h>
+#include <asm/machdep.h>
#include <asm/xics.h>
#include <asm/xive.h>
#include <asm/dbell.h>
@@ -176,13 +176,14 @@ EXPORT_SYMBOL_GPL(kvmppc_hcall_impl_hv_realmode);
int kvmppc_hwrng_present(void)
{
- return powernv_hwrng_present();
+ return ppc_md.get_random_seed != NULL;
}
EXPORT_SYMBOL_GPL(kvmppc_hwrng_present);
long kvmppc_rm_h_random(struct kvm_vcpu *vcpu)
{
- if (powernv_get_random_real_mode(&vcpu->arch.regs.gpr[4]))
+ if (ppc_md.get_random_seed &&
+ ppc_md.get_random_seed(&vcpu->arch.regs.gpr[4]))
return H_SUCCESS;
return H_HARDWARE;
diff --git a/arch/powerpc/platforms/powernv/rng.c b/arch/powerpc/platforms/powernv/rng.c
index 2287c9cd0cd5..d19305292e1e 100644
--- a/arch/powerpc/platforms/powernv/rng.c
+++ b/arch/powerpc/platforms/powernv/rng.c
@@ -29,15 +29,6 @@ struct powernv_rng {
static DEFINE_PER_CPU(struct powernv_rng *, powernv_rng);
-int powernv_hwrng_present(void)
-{
- struct powernv_rng *rng;
-
- rng = get_cpu_var(powernv_rng);
- put_cpu_var(rng);
- return rng != NULL;
-}
-
static unsigned long rng_whiten(struct powernv_rng *rng, unsigned long val)
{
unsigned long parity;
@@ -58,19 +49,6 @@ static unsigned long rng_whiten(struct powernv_rng *rng, unsigned long val)
return val;
}
-int powernv_get_random_real_mode(unsigned long *v)
-{
- struct powernv_rng *rng;
-
- rng = raw_cpu_read(powernv_rng);
- if (!rng)
- return 0;
-
- *v = rng_whiten(rng, __raw_rm_readq(rng->regs_real));
-
- return 1;
-}
-
static int powernv_get_random_darn(unsigned long *v)
{
unsigned long val;
@@ -107,12 +85,14 @@ int powernv_get_random_long(unsigned long *v)
{
struct powernv_rng *rng;
- rng = get_cpu_var(powernv_rng);
-
- *v = rng_whiten(rng, in_be64(rng->regs));
-
- put_cpu_var(rng);
-
+ if (mfmsr() & MSR_DR) {
+ rng = get_cpu_var(powernv_rng);
+ *v = rng_whiten(rng, in_be64(rng->regs));
+ put_cpu_var(rng);
+ } else {
+ rng = raw_cpu_read(powernv_rng);
+ *v = rng_whiten(rng, __raw_rm_readq(rng->regs_real));
+ }
return 1;
}
EXPORT_SYMBOL_GPL(powernv_get_random_long);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0d519cadf75184a24313568e7f489a7fc9b1be3b Mon Sep 17 00:00:00 2001
From: Coiby Xu <coxu(a)redhat.com>
Date: Thu, 14 Jul 2022 21:40:26 +0800
Subject: [PATCH] arm64: kexec_file: use more system keyrings to verify kernel
image signature
Currently, when loading a kernel image via the kexec_file_load() system
call, arm64 can only use the .builtin_trusted_keys keyring to verify
a signature whereas x86 can use three more keyrings i.e.
.secondary_trusted_keys, .machine and .platform keyrings. For example,
one resulting problem is kexec'ing a kernel image would be rejected
with the error "Lockdown: kexec: kexec of unsigned images is restricted;
see man kernel_lockdown.7".
This patch set enables arm64 to make use of the same keyrings as x86 to
verify the signature kexec'ed kernel image.
Fixes: 732b7b93d849 ("arm64: kexec_file: add kernel signature verification support")
Cc: stable(a)vger.kernel.org # 105e10e2cf1c: kexec_file: drop weak attribute from functions
Cc: stable(a)vger.kernel.org # 34d5960af253: kexec: clean up arch_kexec_kernel_verify_sig
Cc: stable(a)vger.kernel.org # 83b7bb2d49ae: kexec, KEYS: make the code in bzImage64_verify_sig generic
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: kexec(a)lists.infradead.org
Cc: keyrings(a)vger.kernel.org
Cc: linux-security-module(a)vger.kernel.org
Co-developed-by: Michal Suchanek <msuchanek(a)suse.de>
Signed-off-by: Michal Suchanek <msuchanek(a)suse.de>
Acked-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Coiby Xu <coxu(a)redhat.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.ibm.com>
diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
index 9ec34690e255..5ed6a585f21f 100644
--- a/arch/arm64/kernel/kexec_image.c
+++ b/arch/arm64/kernel/kexec_image.c
@@ -14,7 +14,6 @@
#include <linux/kexec.h>
#include <linux/pe.h>
#include <linux/string.h>
-#include <linux/verification.h>
#include <asm/byteorder.h>
#include <asm/cpufeature.h>
#include <asm/image.h>
@@ -130,18 +129,10 @@ static void *image_load(struct kimage *image,
return NULL;
}
-#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
-static int image_verify_sig(const char *kernel, unsigned long kernel_len)
-{
- return verify_pefile_signature(kernel, kernel_len, NULL,
- VERIFYING_KEXEC_PE_SIGNATURE);
-}
-#endif
-
const struct kexec_file_ops kexec_image_ops = {
.probe = image_probe,
.load = image_load,
#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
- .verify_sig = image_verify_sig,
+ .verify_sig = kexec_kernel_verify_pe_sig,
#endif
};
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 689a71493bd2f31c024f8c0395f85a1fd4b2138e Mon Sep 17 00:00:00 2001
From: Coiby Xu <coxu(a)redhat.com>
Date: Thu, 14 Jul 2022 21:40:24 +0800
Subject: [PATCH] kexec: clean up arch_kexec_kernel_verify_sig
Before commit 105e10e2cf1c ("kexec_file: drop weak attribute from
functions"), there was already no arch-specific implementation
of arch_kexec_kernel_verify_sig. With weak attribute dropped by that
commit, arch_kexec_kernel_verify_sig is completely useless. So clean it
up.
Note later patches are dependent on this patch so it should be backported
to the stable tree as well.
Cc: stable(a)vger.kernel.org
Suggested-by: Eric W. Biederman <ebiederm(a)xmission.com>
Reviewed-by: Michal Suchanek <msuchanek(a)suse.de>
Acked-by: Baoquan He <bhe(a)redhat.com>
Signed-off-by: Coiby Xu <coxu(a)redhat.com>
[zohar(a)linux.ibm.com: reworded patch description "Note"]
Link: https://lore.kernel.org/linux-integrity/20220714134027.394370-1-coxu@redhat…
Signed-off-by: Mimi Zohar <zohar(a)linux.ibm.com>
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 8107606ad1e8..7f710fb3712b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -212,11 +212,6 @@ static inline void *arch_kexec_kernel_image_load(struct kimage *image)
}
#endif
-#ifdef CONFIG_KEXEC_SIG
-int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
- unsigned long buf_len);
-#endif
-
extern int kexec_add_buffer(struct kexec_buf *kbuf);
int kexec_locate_mem_hole(struct kexec_buf *kbuf);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 0c27c81351ee..6dc1294c90fc 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -81,24 +81,6 @@ int kexec_image_post_load_cleanup_default(struct kimage *image)
return image->fops->cleanup(image->image_loader_data);
}
-#ifdef CONFIG_KEXEC_SIG
-static int kexec_image_verify_sig_default(struct kimage *image, void *buf,
- unsigned long buf_len)
-{
- if (!image->fops || !image->fops->verify_sig) {
- pr_debug("kernel loader does not support signature verification.\n");
- return -EKEYREJECTED;
- }
-
- return image->fops->verify_sig(buf, buf_len);
-}
-
-int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf, unsigned long buf_len)
-{
- return kexec_image_verify_sig_default(image, buf, buf_len);
-}
-#endif
-
/*
* Free up memory used by kernel, initrd, and command line. This is temporary
* memory allocation which is not needed any more after these buffers have
@@ -141,13 +123,24 @@ void kimage_file_post_load_cleanup(struct kimage *image)
}
#ifdef CONFIG_KEXEC_SIG
+static int kexec_image_verify_sig(struct kimage *image, void *buf,
+ unsigned long buf_len)
+{
+ if (!image->fops || !image->fops->verify_sig) {
+ pr_debug("kernel loader does not support signature verification.\n");
+ return -EKEYREJECTED;
+ }
+
+ return image->fops->verify_sig(buf, buf_len);
+}
+
static int
kimage_validate_signature(struct kimage *image)
{
int ret;
- ret = arch_kexec_kernel_verify_sig(image, image->kernel_buf,
- image->kernel_buf_len);
+ ret = kexec_image_verify_sig(image, image->kernel_buf,
+ image->kernel_buf_len);
if (ret) {
if (sig_enforce) {
This is the start of the stable review cycle for the 5.15.62 release.
There are 14 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 21 Aug 2022 15:36:59 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.62-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.62-rc1
Coiby Xu <coxu(a)redhat.com>
arm64: kexec_file: use more system keyrings to verify kernel image signature
Coiby Xu <coxu(a)redhat.com>
kexec, KEYS: make the code in bzImage64_verify_sig generic
Coiby Xu <coxu(a)redhat.com>
kexec: clean up arch_kexec_kernel_verify_sig
Naveen N. Rao <naveen.n.rao(a)linux.vnet.ibm.com>
kexec_file: drop weak attribute from functions
Qu Wenruo <wqu(a)suse.com>
btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
Qu Wenruo <wqu(a)suse.com>
btrfs: only write the sectors in the vertical stripe which has data stripes
Peter Zijlstra <peterz(a)infradead.org>
x86/ftrace: Use alternative RET encoding
Peter Zijlstra <peterz(a)infradead.org>
x86/ibt,ftrace: Make function-graph play nice
Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
Revert "x86/ftrace: Use alternative RET encoding"
Namjae Jeon <linkinjeon(a)kernel.org>
ksmbd: fix heap-based overflow in set_ntacl_dacl()
Hyunchul Lee <hyc.lee(a)gmail.com>
ksmbd: prevent out of bound read for SMB2_WRITE
Jamal Hadi Salim <jhs(a)mojatatu.com>
net_sched: cls_route: disallow handle of 0
Jens Wiklander <jens.wiklander(a)linaro.org>
tee: add overflow check in register_shm_helper()
Jens Axboe <axboe(a)kernel.dk>
io_uring: use original request task for inflight tracking
-------------
Diffstat:
Makefile | 4 +-
arch/arm64/include/asm/kexec.h | 4 +-
arch/arm64/kernel/kexec_image.c | 11 +---
arch/powerpc/include/asm/kexec.h | 9 +++
arch/s390/include/asm/kexec.h | 3 +
arch/x86/include/asm/kexec.h | 6 ++
arch/x86/kernel/ftrace.c | 7 +-
arch/x86/kernel/ftrace_64.S | 19 ++++--
arch/x86/kernel/kexec-bzimage64.c | 20 +-----
drivers/tee/tee_shm.c | 3 +
fs/btrfs/raid56.c | 74 +++++++++++++++++-----
fs/io_uring.c | 2 +-
fs/ksmbd/smb2misc.c | 7 +-
fs/ksmbd/smb2pdu.c | 45 +++++++------
fs/ksmbd/smbacl.c | 130 ++++++++++++++++++++++++++------------
fs/ksmbd/smbacl.h | 2 +-
fs/ksmbd/vfs.c | 5 ++
include/linux/kexec.h | 50 ++++++++++++---
kernel/kexec_file.c | 83 +++++++++---------------
net/sched/cls_route.c | 10 +++
20 files changed, 312 insertions(+), 182 deletions(-)
-------------------
NOTE, this is the LAST 5.18.y stable release. This tree will be
end-of-life after this one. Please move to 5.19.y at this point in time
or let us know why that is not possible.
-------------------
This is the start of the stable review cycle for the 5.18.19 release.
There are 6 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 21 Aug 2022 15:36:59 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.18.19-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.18.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.18.19-rc1
Coiby Xu <coxu(a)redhat.com>
arm64: kexec_file: use more system keyrings to verify kernel image signature
Coiby Xu <coxu(a)redhat.com>
kexec, KEYS: make the code in bzImage64_verify_sig generic
Qu Wenruo <wqu(a)suse.com>
btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
Qu Wenruo <wqu(a)suse.com>
btrfs: only write the sectors in the vertical stripe which has data stripes
Jamal Hadi Salim <jhs(a)mojatatu.com>
net_sched: cls_route: disallow handle of 0
Jens Wiklander <jens.wiklander(a)linaro.org>
tee: add overflow check in register_shm_helper()
-------------
Diffstat:
Makefile | 4 +--
arch/arm64/kernel/kexec_image.c | 11 +-----
arch/x86/kernel/kexec-bzimage64.c | 20 +----------
drivers/tee/tee_shm.c | 3 ++
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++---------
include/linux/kexec.h | 7 ++++
kernel/kexec_file.c | 17 +++++++++
net/sched/cls_route.c | 10 ++++++
8 files changed, 98 insertions(+), 48 deletions(-)
From: Gil Fine <gil.fine(a)intel.com>
[ Upstream commit 5fd6b9a5cbe63fea4c490fee8af34144a139a266 ]
In case of uni-directional time sync, TMU handshake is
initiated by upstream router. In case of bi-directional
time sync, TMU handshake is initiated by downstream router.
In order to handle correctly the case of uni-directional mode,
we avoid changing the upstream router's rate to off,
because it might have another downstream router plugged that is set to
uni-directional mode (and we don't want to change its mode).
Instead, we always change downstream router's rate.
Signed-off-by: Gil Fine <gil.fine(a)intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/thunderbolt/tmu.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/thunderbolt/tmu.c b/drivers/thunderbolt/tmu.c
index e4a07a26f693..93ba1d00335b 100644
--- a/drivers/thunderbolt/tmu.c
+++ b/drivers/thunderbolt/tmu.c
@@ -359,13 +359,14 @@ int tb_switch_tmu_disable(struct tb_switch *sw)
* In case of uni-directional time sync, TMU handshake is
* initiated by upstream router. In case of bi-directional
* time sync, TMU handshake is initiated by downstream router.
- * Therefore, we change the rate to off in the respective
- * router.
+ * We change downstream router's rate to off for both uni/bidir
+ * cases although it is needed only for the bi-directional mode.
+ * We avoid changing upstream router's mode since it might
+ * have another downstream router plugged, that is set to
+ * uni-directional mode and we don't want to change it's TMU
+ * mode.
*/
- if (unidirectional)
- tb_switch_tmu_rate_write(parent, TB_SWITCH_TMU_RATE_OFF);
- else
- tb_switch_tmu_rate_write(sw, TB_SWITCH_TMU_RATE_OFF);
+ tb_switch_tmu_rate_write(sw, TB_SWITCH_TMU_RATE_OFF);
tb_port_tmu_time_sync_disable(up);
ret = tb_port_tmu_time_sync_disable(down);
--
2.35.1
From: Guenter Roeck <linux(a)roeck-us.net>
[ Upstream commit 0cc011c576aaa4de505046f7a6c90933d7c749a9 ]
In some circumstances, attempts are made to add entries to or to remove
entries from an uninitialized list. A prime example is
amdgpu_bo_vm_destroy(): It is indirectly called from
ttm_bo_init_reserved() if that function fails, and tries to remove an
entry from a list. However, that list is only initialized in
amdgpu_bo_create_vm() after the call to ttm_bo_init_reserved() returned
success. This results in crashes such as
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 1479 Comm: chrome Not tainted 5.10.110-15768-g29a72e65dae5
Hardware name: Google Grunt/Grunt, BIOS Google_Grunt.11031.149.0 07/15/2020
RIP: 0010:__list_del_entry_valid+0x26/0x7d
...
Call Trace:
amdgpu_bo_vm_destroy+0x48/0x8b
ttm_bo_init_reserved+0x1d7/0x1e0
amdgpu_bo_create+0x212/0x476
? amdgpu_bo_user_destroy+0x23/0x23
? kmem_cache_alloc+0x60/0x271
amdgpu_bo_create_vm+0x40/0x7d
amdgpu_vm_pt_create+0xe8/0x24b
...
Check if the list's prev and next pointers are NULL to catch such problems.
Link: https://lkml.kernel.org/r/20220531222951.92073-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux(a)roeck-us.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/list_debug.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/lib/list_debug.c b/lib/list_debug.c
index 9daa3fb9d1cd..d98d43f80958 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -20,7 +20,11 @@
bool __list_add_valid(struct list_head *new, struct list_head *prev,
struct list_head *next)
{
- if (CHECK_DATA_CORRUPTION(next->prev != prev,
+ if (CHECK_DATA_CORRUPTION(prev == NULL,
+ "list_add corruption. prev is NULL.\n") ||
+ CHECK_DATA_CORRUPTION(next == NULL,
+ "list_add corruption. next is NULL.\n") ||
+ CHECK_DATA_CORRUPTION(next->prev != prev,
"list_add corruption. next->prev should be prev (%px), but was %px. (next=%px).\n",
prev, next->prev, next) ||
CHECK_DATA_CORRUPTION(prev->next != next,
@@ -42,7 +46,11 @@ bool __list_del_entry_valid(struct list_head *entry)
prev = entry->prev;
next = entry->next;
- if (CHECK_DATA_CORRUPTION(next == LIST_POISON1,
+ if (CHECK_DATA_CORRUPTION(next == NULL,
+ "list_del corruption, %px->next is NULL\n", entry) ||
+ CHECK_DATA_CORRUPTION(prev == NULL,
+ "list_del corruption, %px->prev is NULL\n", entry) ||
+ CHECK_DATA_CORRUPTION(next == LIST_POISON1,
"list_del corruption, %px->next is LIST_POISON1 (%px)\n",
entry, LIST_POISON1) ||
CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.
Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.
For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_eprobe.c | 70 +++++++++++++++++++++++++++++++++----
1 file changed, 64 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index a1d3423ab74f..1783e3478912 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -227,6 +227,7 @@ static int trace_eprobe_tp_arg_update(struct trace_eprobe *ep, int i)
struct probe_arg *parg = &ep->tp.args[i];
struct ftrace_event_field *field;
struct list_head *head;
+ int ret = -ENOENT;
head = trace_get_fields(ep->event);
list_for_each_entry(field, head, link) {
@@ -236,9 +237,20 @@ static int trace_eprobe_tp_arg_update(struct trace_eprobe *ep, int i)
return 0;
}
}
+
+ /*
+ * Argument not found on event. But allow for comm and COMM
+ * to be used to get the current->comm.
+ */
+ if (strcmp(parg->code->data, "COMM") == 0 ||
+ strcmp(parg->code->data, "comm") == 0) {
+ parg->code->op = FETCH_OP_COMM;
+ ret = 0;
+ }
+
kfree(parg->code->data);
parg->code->data = NULL;
- return -ENOENT;
+ return ret;
}
static int eprobe_event_define_fields(struct trace_event_call *event_call)
@@ -363,16 +375,38 @@ static unsigned long get_event_field(struct fetch_insn *code, void *rec)
static int get_eprobe_size(struct trace_probe *tp, void *rec)
{
+ struct fetch_insn *code;
struct probe_arg *arg;
int i, len, ret = 0;
for (i = 0; i < tp->nr_args; i++) {
arg = tp->args + i;
- if (unlikely(arg->dynamic)) {
+ if (arg->dynamic) {
unsigned long val;
- val = get_event_field(arg->code, rec);
- len = process_fetch_insn_bottom(arg->code + 1, val, NULL, NULL);
+ code = arg->code;
+ retry:
+ switch (code->op) {
+ case FETCH_OP_TP_ARG:
+ val = get_event_field(code, rec);
+ break;
+ case FETCH_OP_IMM:
+ val = code->immediate;
+ break;
+ case FETCH_OP_COMM:
+ val = (unsigned long)current->comm;
+ break;
+ case FETCH_OP_DATA:
+ val = (unsigned long)code->data;
+ break;
+ case FETCH_NOP_SYMBOL: /* Ignore a place holder */
+ code++;
+ goto retry;
+ default:
+ continue;
+ }
+ code++;
+ len = process_fetch_insn_bottom(code, val, NULL, NULL);
if (len > 0)
ret += len;
}
@@ -390,8 +424,28 @@ process_fetch_insn(struct fetch_insn *code, void *rec, void *dest,
{
unsigned long val;
- val = get_event_field(code, rec);
- return process_fetch_insn_bottom(code + 1, val, dest, base);
+ retry:
+ switch (code->op) {
+ case FETCH_OP_TP_ARG:
+ val = get_event_field(code, rec);
+ break;
+ case FETCH_OP_IMM:
+ val = code->immediate;
+ break;
+ case FETCH_OP_COMM:
+ val = (unsigned long)current->comm;
+ break;
+ case FETCH_OP_DATA:
+ val = (unsigned long)code->data;
+ break;
+ case FETCH_NOP_SYMBOL: /* Ignore a place holder */
+ code++;
+ goto retry;
+ default:
+ return -EILSEQ;
+ }
+ code++;
+ return process_fetch_insn_bottom(code, val, dest, base);
}
NOKPROBE_SYMBOL(process_fetch_insn)
@@ -866,6 +920,10 @@ static int trace_eprobe_tp_update_arg(struct trace_eprobe *ep, const char *argv[
trace_probe_log_err(0, BAD_ATTACH_ARG);
}
+ /* Handle symbols "@" */
+ if (!ret)
+ ret = traceprobe_update_arg(&ep->tp.args[i]);
+
return ret;
}
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.
Now when a string field is used, the ":string" type can be used to display
it:
echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_eprobe.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index 550671985fd1..a1d3423ab74f 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -311,6 +311,27 @@ static unsigned long get_event_field(struct fetch_insn *code, void *rec)
addr = rec + field->offset;
+ if (is_string_field(field)) {
+ switch (field->filter_type) {
+ case FILTER_DYN_STRING:
+ val = (unsigned long)(rec + (*(unsigned int *)addr & 0xffff));
+ break;
+ case FILTER_RDYN_STRING:
+ val = (unsigned long)(addr + (*(unsigned int *)addr & 0xffff));
+ break;
+ case FILTER_STATIC_STRING:
+ val = (unsigned long)addr;
+ break;
+ case FILTER_PTR_STRING:
+ val = (unsigned long)(*(char *)addr);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return 0;
+ }
+ return val;
+ }
+
switch (field->size) {
case 1:
if (field->is_signed)
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.
Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_probe.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index dec657af363c..4daabbb8b772 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -622,9 +622,10 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
/*
* Since $comm and immediate string can not be dereferenced,
- * we can find those by strcmp.
+ * we can find those by strcmp. But ignore for eprobes.
*/
- if (strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0) {
+ if (!(flags & TPARG_FL_TPOINT) &&
+ (strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0)) {
/* The type of $comm must be "string", and not an array. */
if (parg->count || (t && strcmp(t, "string")))
goto out;
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.
Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_probe.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index dec657af363c..23dcd52ad45c 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -622,9 +622,10 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
/*
* Since $comm and immediate string can not be dereferenced,
- * we can find those by strcmp.
+ * we can find those by strcmp. But ignore for eprobes.
*/
- if (strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0) {
+ if (!(flags & TPARG_FL_TPOINT) &&
+ strcmp(arg, "$comm") == 0 || strncmp(arg, "\\\"", 2) == 0) {
/* The type of $comm must be "string", and not an array. */
if (parg->count || (t && strcmp(t, "string")))
goto out;
--
2.35.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.
Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.
For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_eprobe.c | 67 +++++++++++++++++++++++++++++++++----
1 file changed, 61 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index a1d3423ab74f..63218a541217 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -227,6 +227,7 @@ static int trace_eprobe_tp_arg_update(struct trace_eprobe *ep, int i)
struct probe_arg *parg = &ep->tp.args[i];
struct ftrace_event_field *field;
struct list_head *head;
+ int ret = -ENOENT;
head = trace_get_fields(ep->event);
list_for_each_entry(field, head, link) {
@@ -236,9 +237,17 @@ static int trace_eprobe_tp_arg_update(struct trace_eprobe *ep, int i)
return 0;
}
}
+
+ /* Argument no found on event. But allow for comm and COMM to be used */
+ if (strcmp(parg->code->data, "COMM") == 0 ||
+ strcmp(parg->code->data, "comm") == 0) {
+ parg->code->op = FETCH_OP_COMM;
+ ret = 0;
+ }
+
kfree(parg->code->data);
parg->code->data = NULL;
- return -ENOENT;
+ return ret;
}
static int eprobe_event_define_fields(struct trace_event_call *event_call)
@@ -363,16 +372,38 @@ static unsigned long get_event_field(struct fetch_insn *code, void *rec)
static int get_eprobe_size(struct trace_probe *tp, void *rec)
{
+ struct fetch_insn *code;
struct probe_arg *arg;
int i, len, ret = 0;
for (i = 0; i < tp->nr_args; i++) {
arg = tp->args + i;
- if (unlikely(arg->dynamic)) {
+ if (arg->dynamic) {
unsigned long val;
- val = get_event_field(arg->code, rec);
- len = process_fetch_insn_bottom(arg->code + 1, val, NULL, NULL);
+ code = arg->code;
+ retry:
+ switch (code->op) {
+ case FETCH_OP_TP_ARG:
+ val = get_event_field(code, rec);
+ break;
+ case FETCH_OP_IMM:
+ val = code->immediate;
+ break;
+ case FETCH_OP_COMM:
+ val = (unsigned long)current->comm;
+ break;
+ case FETCH_OP_DATA:
+ val = (unsigned long)code->data;
+ break;
+ case FETCH_NOP_SYMBOL: /* Ignore a place holder */
+ code++;
+ goto retry;
+ default:
+ continue;
+ }
+ code++;
+ len = process_fetch_insn_bottom(code, val, NULL, NULL);
if (len > 0)
ret += len;
}
@@ -390,8 +421,28 @@ process_fetch_insn(struct fetch_insn *code, void *rec, void *dest,
{
unsigned long val;
- val = get_event_field(code, rec);
- return process_fetch_insn_bottom(code + 1, val, dest, base);
+ retry:
+ switch (code->op) {
+ case FETCH_OP_TP_ARG:
+ val = get_event_field(code, rec);
+ break;
+ case FETCH_OP_IMM:
+ val = code->immediate;
+ break;
+ case FETCH_OP_COMM:
+ val = (unsigned long)current->comm;
+ break;
+ case FETCH_OP_DATA:
+ val = (unsigned long)code->data;
+ break;
+ case FETCH_NOP_SYMBOL: /* Ignore a place holder */
+ code++;
+ goto retry;
+ default:
+ return -EILSEQ;
+ }
+ code++;
+ return process_fetch_insn_bottom(code, val, dest, base);
}
NOKPROBE_SYMBOL(process_fetch_insn)
@@ -866,6 +917,10 @@ static int trace_eprobe_tp_update_arg(struct trace_eprobe *ep, const char *argv[
trace_probe_log_err(0, BAD_ATTACH_ARG);
}
+ /* Handle symbols "@" */
+ if (!ret)
+ ret = traceprobe_update_arg(&ep->tp.args[i]);
+
return ret;
}
--
2.35.1
The iterator can not be greater than ATC_MAX_DSCR_TRIALS, as the for loop
will stop when i == ATC_MAX_DSCR_TRIALS. While here, use the common "i"
name for the iterator.
Fixes: 93dce3a6434f ("dmaengine: at_hdmac: fix residue computation")
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
---
drivers/dma/at_hdmac.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index 16cea65a708d..c72c796d58bc 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -753,7 +753,8 @@ static int atc_get_bytes_left(struct dma_chan *chan, dma_cookie_t cookie)
struct at_desc *desc_first = atc_first_active(atchan);
struct at_desc *desc;
int ret;
- u32 ctrla, dscr, trials;
+ u32 ctrla, dscr;
+ unsigned int i;
/*
* If the cookie doesn't match to the currently running transfer then
@@ -823,7 +824,7 @@ static int atc_get_bytes_left(struct dma_chan *chan, dma_cookie_t cookie)
dscr = channel_readl(atchan, DSCR);
rmb(); /* ensure DSCR is read before CTRLA */
ctrla = channel_readl(atchan, CTRLA);
- for (trials = 0; trials < ATC_MAX_DSCR_TRIALS; ++trials) {
+ for (i = 0; i < ATC_MAX_DSCR_TRIALS; ++i) {
u32 new_dscr;
rmb(); /* ensure DSCR is read after CTRLA */
@@ -849,7 +850,7 @@ static int atc_get_bytes_left(struct dma_chan *chan, dma_cookie_t cookie)
rmb(); /* ensure DSCR is read before CTRLA */
ctrla = channel_readl(atchan, CTRLA);
}
- if (unlikely(trials >= ATC_MAX_DSCR_TRIALS))
+ if (unlikely(i == ATC_MAX_DSCR_TRIALS))
return -ETIMEDOUT;
/* for the first descriptor we can be more accurate */
--
2.25.1
at_hdmac uses __raw_writel for register writes. In the absence of a
barrier, the CPU may reorder the register operations.
Introduce a write memory barrier so that the CPU does not reorder the
channel enable, thus the start of the transfer, without making sure that
all the pre-required register fields are already written.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Cc: stable(a)vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index 825a29ede35e..1cb0d26d30ed 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -691,6 +691,8 @@ static void atc_dostart(struct at_dma_chan *atchan, struct at_desc *first)
channel_writel(atchan, DPIP, FIELD_PREP(ATC_DPIP_HOLE,
first->dst_hole) |
FIELD_PREP(ATC_DPIP_BOUNDARY, first->boundary));
+ /* Don't allow CPU to reorder channel enable. */
+ wmb();
dma_writel(atdma, CHER, atchan->mask);
vdbg_dump_regs(atchan);
--
2.25.1
In case the controller detected an error, the code took the chance to move
all the queued (submitted) descriptors to the active (issued) list. This
was wrong as if there were any descriptors in the submitted list they were
moved to the issued list without actually issuing them to the controller,
thus a completion could be raised without even fireing the descriptor.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Cc: stable(a)vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index e5ac73768d13..825a29ede35e 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -974,10 +974,6 @@ static void atc_handle_error(struct at_dma_chan *atchan)
bad_desc = atc_first_active(atchan);
list_del_init(&bad_desc->desc_node);
- /* As we are stopped, take advantage to push queued descriptors
- * in active_list */
- list_splice_init(&atchan->queue, atchan->active_list.prev);
-
/* Try to restart the controller */
if (!list_empty(&atchan->active_list)) {
desc = atc_first_queued(atchan);
--
2.25.1
As it was before, the descriptor was issued to the hardware without adding
it to the active (issued) list. This could result in a completion of other
descriptor, or/and in the descriptor never being completed.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Cc: stable(a)vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index 635c3be74399..e5ac73768d13 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -945,8 +945,11 @@ static void atc_advance_work(struct at_dma_chan *atchan)
/* advance work */
spin_lock_irqsave(&atchan->lock, flags);
- if (!list_empty(&atchan->active_list))
- atc_dostart(atchan, atc_first_active(atchan));
+ if (!list_empty(&atchan->active_list)) {
+ desc = atc_first_queued(atchan);
+ list_move_tail(&desc->desc_node, &atchan->active_list);
+ atc_dostart(atchan, desc);
+ }
spin_unlock_irqrestore(&atchan->lock, flags);
}
@@ -958,6 +961,7 @@ static void atc_advance_work(struct at_dma_chan *atchan)
static void atc_handle_error(struct at_dma_chan *atchan)
{
struct at_desc *bad_desc;
+ struct at_desc *desc;
struct at_desc *child;
unsigned long flags;
@@ -975,8 +979,11 @@ static void atc_handle_error(struct at_dma_chan *atchan)
list_splice_init(&atchan->queue, atchan->active_list.prev);
/* Try to restart the controller */
- if (!list_empty(&atchan->active_list))
- atc_dostart(atchan, atc_first_active(atchan));
+ if (!list_empty(&atchan->active_list)) {
+ desc = atc_first_queued(atchan);
+ list_move_tail(&desc->desc_node, &atchan->active_list);
+ atc_dostart(atchan, desc);
+ }
/*
* KERN_CRITICAL may seem harsh, but since this only happens
--
2.25.1
The tasklet did not held the channel lock when retrieving the first active
descriptor, causing concurrency problems if issue_pending() was called in
between. If issue_pending() was called exactly after the lock was released
in the tasklet, atc_chain_complete() could complete a descriptor for which
the controller has not yet raised an interrupt.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Cc: stable(a)vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index c2b3d7b63920..635c3be74399 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -897,8 +897,6 @@ atc_chain_complete(struct at_dma_chan *atchan, struct at_desc *desc)
if (!atc_chan_is_cyclic(atchan))
dma_cookie_complete(txd);
- /* Remove transfer node from the active list. */
- list_del_init(&desc->desc_node);
spin_unlock_irqrestore(&atchan->lock, flags);
dma_descriptor_unmap(txd);
@@ -930,6 +928,7 @@ atc_chain_complete(struct at_dma_chan *atchan, struct at_desc *desc)
*/
static void atc_advance_work(struct at_dma_chan *atchan)
{
+ struct at_desc *desc;
unsigned long flags;
dev_vdbg(chan2dev(&atchan->dma_chan), "advance_work\n");
@@ -937,9 +936,12 @@ static void atc_advance_work(struct at_dma_chan *atchan)
spin_lock_irqsave(&atchan->lock, flags);
if (atc_chan_is_enabled(atchan) || list_empty(&atchan->active_list))
return spin_unlock_irqrestore(&atchan->lock, flags);
- spin_unlock_irqrestore(&atchan->lock, flags);
- atc_chain_complete(atchan, atc_first_active(atchan));
+ desc = atc_first_active(atchan);
+ /* Remove the transfer node from the active list. */
+ list_del_init(&desc->desc_node);
+ spin_unlock_irqrestore(&atchan->lock, flags);
+ atc_chain_complete(atchan, desc);
/* advance work */
spin_lock_irqsave(&atchan->lock, flags);
--
2.25.1
The descriptor was added to the free_list before calling the callback,
which could result in reissuing of the same descriptor and calling of a
single callback for both. Move the decriptor to the free list after the
callback is invoked.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index 10424a6fcbbf..b3184da7ced4 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -904,11 +904,8 @@ atc_chain_complete(struct at_dma_chan *atchan, struct at_desc *desc)
desc->memset_buffer = false;
}
- /* move children to free_list */
- list_splice_init(&desc->tx_list, &atchan->free_list);
- /* move myself to free_list */
- list_move(&desc->desc_node, &atchan->free_list);
-
+ /* Remove transfer node from the active list. */
+ list_del_init(&desc->desc_node);
spin_unlock_irqrestore(&atchan->lock, flags);
dma_descriptor_unmap(txd);
@@ -918,6 +915,13 @@ atc_chain_complete(struct at_dma_chan *atchan, struct at_desc *desc)
dmaengine_desc_get_callback_invoke(txd, NULL);
dma_run_dependencies(txd);
+
+ spin_lock_irqsave(&atchan->lock, flags);
+ /* move children to free_list */
+ list_splice_init(&desc->tx_list, &atchan->free_list);
+ /* add myself to free_list */
+ list_add(&desc->desc_node, &atchan->free_list);
+ spin_unlock_irqrestore(&atchan->lock, flags);
}
/**
--
2.25.1
atc_complete_all() had concurrency bugs, thus remove it:
1/ atc_complete_all() in its entirety was buggy, as when the atchan->queue
list (the one that contains descriptors that are not yet issued to the
hardware) contained descriptors, it fired just the first from the
atchan->queue, but moved all the desc from atchan->queue to
atchan->active_list and considered them all as fired. This could result in
calling the completion of a descriptor that was not yet issued to the
hardware.
2/ when in tasklet at atc_advance_work() time, atchan->active_list was
queried without holding the lock of the chan. This can result in
atchan->active_list concurrency problems between the tasklet and
issue_pending().
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 49 ++++--------------------------------------
1 file changed, 4 insertions(+), 45 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index aeb241832c52..10424a6fcbbf 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -920,42 +920,6 @@ atc_chain_complete(struct at_dma_chan *atchan, struct at_desc *desc)
dma_run_dependencies(txd);
}
-/**
- * atc_complete_all - finish work for all transactions
- * @atchan: channel to complete transactions for
- *
- * Eventually submit queued descriptors if any
- *
- * Assume channel is idle while calling this function
- * Called with atchan->lock held and bh disabled
- */
-static void atc_complete_all(struct at_dma_chan *atchan)
-{
- struct at_desc *desc, *_desc;
- LIST_HEAD(list);
- unsigned long flags;
-
- dev_vdbg(chan2dev(&atchan->dma_chan), "complete all\n");
-
- spin_lock_irqsave(&atchan->lock, flags);
-
- /*
- * Submit queued descriptors ASAP, i.e. before we go through
- * the completed ones.
- */
- if (!list_empty(&atchan->queue))
- atc_dostart(atchan, atc_first_queued(atchan));
- /* empty active_list now it is completed */
- list_splice_init(&atchan->active_list, &list);
- /* empty queue list by moving descriptors (if any) to active_list */
- list_splice_init(&atchan->queue, &atchan->active_list);
-
- spin_unlock_irqrestore(&atchan->lock, flags);
-
- list_for_each_entry_safe(desc, _desc, &list, desc_node)
- atc_chain_complete(atchan, desc);
-}
-
/**
* atc_advance_work - at the end of a transaction, move forward
* @atchan: channel where the transaction ended
@@ -963,25 +927,20 @@ static void atc_complete_all(struct at_dma_chan *atchan)
static void atc_advance_work(struct at_dma_chan *atchan)
{
unsigned long flags;
- int ret;
dev_vdbg(chan2dev(&atchan->dma_chan), "advance_work\n");
spin_lock_irqsave(&atchan->lock, flags);
- ret = atc_chan_is_enabled(atchan);
+ if (atc_chan_is_enabled(atchan) || list_empty(&atchan->active_list))
+ return spin_unlock_irqrestore(&atchan->lock, flags);
spin_unlock_irqrestore(&atchan->lock, flags);
- if (ret)
- return;
-
- if (list_empty(&atchan->active_list) ||
- list_is_singular(&atchan->active_list))
- return atc_complete_all(atchan);
atc_chain_complete(atchan, atc_first_active(atchan));
/* advance work */
spin_lock_irqsave(&atchan->lock, flags);
- atc_dostart(atchan, atc_first_active(atchan));
+ if (!list_empty(&atchan->active_list))
+ atc_dostart(atchan, atc_first_active(atchan));
spin_unlock_irqrestore(&atchan->lock, flags);
}
--
2.25.1
Now that the complete callback call was removed from
device_terminate_all(), we can protect the atchan->status with the channel
lock. The atomic bitops on atchan->status do not substitute proper locking
on the status, as one could still modify the status after the lock was
dropped in atc_terminate_all() but before the atomic bitops were executed.
Fixes: 078a6506141a ("dmaengine: at_hdmac: Fix deadlocks")
Reported-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index b3895e5d2ae9..aeb241832c52 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -1902,12 +1902,12 @@ static int atc_terminate_all(struct dma_chan *chan)
list_splice_tail_init(&atchan->queue, &atchan->free_list);
list_splice_tail_init(&atchan->active_list, &atchan->free_list);
- spin_unlock_irqrestore(&atchan->lock, flags);
-
clear_bit(ATC_IS_PAUSED, &atchan->status);
/* if channel dedicated to cyclic operations, free it */
clear_bit(ATC_IS_CYCLIC, &atchan->status);
+ spin_unlock_irqrestore(&atchan->lock, flags);
+
return 0;
}
--
2.25.1
Multiple calls to atc_issue_pending() could result in a premature
completion of a descriptor from the atchan->active list, as the method
always completed the first active descriptor from the list. Instead,
issue_pending() should just take the first transaction descriptor from the
pending queue, move it to active_list and start the transfer.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index d9f492081e76..c445575f8646 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -1963,16 +1963,26 @@ atc_tx_status(struct dma_chan *chan,
}
/**
- * atc_issue_pending - try to finish work
+ * atc_issue_pending - takes the first transaction descriptor in the pending
+ * queue and starts the transfer.
* @chan: target DMA channel
*/
static void atc_issue_pending(struct dma_chan *chan)
{
- struct at_dma_chan *atchan = to_at_dma_chan(chan);
+ struct at_dma_chan *atchan = to_at_dma_chan(chan);
+ struct at_desc *desc;
+ unsigned long flags;
dev_vdbg(chan2dev(chan), "issue_pending\n");
- atc_advance_work(atchan);
+ spin_lock_irqsave(&atchan->lock, flags);
+ if (atc_chan_is_enabled(atchan) || list_empty(&atchan->queue))
+ return spin_unlock_irqrestore(&atchan->lock, flags);
+
+ desc = atc_first_queued(atchan);
+ list_move_tail(&desc->desc_node, &atchan->active_list);
+ atc_dostart(atchan, desc);
+ spin_unlock_irqrestore(&atchan->lock, flags);
}
/**
--
2.25.1
Cyclic channels must too call issue_pending in order to start a transfer.
Start the transfer in issue_pending regardless of the type of channel.
This wrongly worked before, because in the past the transfer was started
at tx_submit level when only a desc in the transfer list.
Fixes: 53830cc75974 ("dmaengine: at_hdmac: add cyclic DMA operation support")
Reported-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index b6b1d2dcfc4c..d9f492081e76 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -1972,10 +1972,6 @@ static void atc_issue_pending(struct dma_chan *chan)
dev_vdbg(chan2dev(chan), "issue_pending\n");
- /* Not needed for cyclic transfers */
- if (atc_chan_is_cyclic(atchan))
- return;
-
atc_advance_work(atchan);
}
--
2.25.1
tx_submit is supposed to push the current transaction descriptor to a
pending queue, waiting for issue_pending() to be called. issue_pending()
must start the transfer, not tx_submit(), thus remove atc_dostart() from
atc_tx_submit(). Clients of at_xdmac that assume that tx_submit() starts
the transfer must be updated and call dma_async_issue_pending() if they
miss to call it.
The vdbg print was moved to after the lock is released. It is desirable to
do the prints without the lock held if possible, and because the if
statement disappears there's no reason why to do the print while holding
the lock.
Fixes: dc78baa2b90b ("dmaengine: at_hdmac: new driver for the Atmel AHB DMA Controller")
Reported-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
---
drivers/dma/at_hdmac.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index e89facf14fab..b6b1d2dcfc4c 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -1126,19 +1126,11 @@ static dma_cookie_t atc_tx_submit(struct dma_async_tx_descriptor *tx)
spin_lock_irqsave(&atchan->lock, flags);
cookie = dma_cookie_assign(tx);
- if (list_empty(&atchan->active_list)) {
- dev_vdbg(chan2dev(tx->chan), "tx_submit: started %u\n",
- desc->txd.cookie);
- atc_dostart(atchan, desc);
- list_add_tail(&desc->desc_node, &atchan->active_list);
- } else {
- dev_vdbg(chan2dev(tx->chan), "tx_submit: queued %u\n",
- desc->txd.cookie);
- list_add_tail(&desc->desc_node, &atchan->queue);
- }
-
+ list_add_tail(&desc->desc_node, &atchan->queue);
spin_unlock_irqrestore(&atchan->lock, flags);
+ dev_vdbg(chan2dev(tx->chan), "tx_submit: queued %u\n",
+ desc->txd.cookie);
return cookie;
}
--
2.25.1
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.
Now when a string field is used, the ":string" type can be used to display
it:
echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events
Cc: stable(a)vger.kernel.org
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_eprobe.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index 550671985fd1..a1d3423ab74f 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -311,6 +311,27 @@ static unsigned long get_event_field(struct fetch_insn *code, void *rec)
addr = rec + field->offset;
+ if (is_string_field(field)) {
+ switch (field->filter_type) {
+ case FILTER_DYN_STRING:
+ val = (unsigned long)(rec + (*(unsigned int *)addr & 0xffff));
+ break;
+ case FILTER_RDYN_STRING:
+ val = (unsigned long)(addr + (*(unsigned int *)addr & 0xffff));
+ break;
+ case FILTER_STATIC_STRING:
+ val = (unsigned long)addr;
+ break;
+ case FILTER_PTR_STRING:
+ val = (unsigned long)(*(char *)addr);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return 0;
+ }
+ return val;
+ }
+
switch (field->size) {
case 1:
if (field->is_signed)
--
2.35.1
Driver registration fails on SOC imx8mn as its supplier, the clock
control module, is probed later than subsys initcall level. This driver
uses platform_driver_probe which is not compatible with deferred probing
and won't be probed again later if probe function fails due to clock not
being available at that time.
This patch replaces the use of platform_driver_probe with
platform_driver_register which will allow probing the driver later again
when the clock control module will be available.
Fixes: a580b8c5429a ("dmaengine: mxs-dma: add dma support for i.MX23/28")
Co-developed-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Cc: stable(a)vger.kernel.org
---
Changes in v5:
- Update the commit message.
- Add the patch "dmaengine: mxs: fix section mismatch" to remove the
warning raised by this patch.
Changes in v4:
- Restore __init in front of mxs_dma_probe() definition.
- Rename the mxs_dma_driver variable to mxs_dma_driver_probe.
- Update the commit message.
- Use builtin_platform_driver() instead of module_platform_driver().
Changes in v3:
- Restore __init in front of mxs_dma_init() definition.
Changes in v2:
- Add the tag "Cc: stable(a)vger.kernel.org" in the sign-off area.
drivers/dma/mxs-dma.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/dma/mxs-dma.c b/drivers/dma/mxs-dma.c
index 994fc4d2aca4..18f8154b859b 100644
--- a/drivers/dma/mxs-dma.c
+++ b/drivers/dma/mxs-dma.c
@@ -839,10 +839,6 @@ static struct platform_driver mxs_dma_driver = {
.name = "mxs-dma",
.of_match_table = mxs_dma_dt_ids,
},
+ .probe = mxs_dma_probe,
};
-
-static int __init mxs_dma_module_init(void)
-{
- return platform_driver_probe(&mxs_dma_driver, mxs_dma_probe);
-}
-subsys_initcall(mxs_dma_module_init);
+builtin_platform_driver(mxs_dma_driver);
--
2.32.0
This is the backport for v4.19.x stable branch.
The full explananation can be found here:
https://lore.kernel.org/linux-btrfs/cover.1660891713.git.wqu@suse.com/
No code change between v5.4.x and v4.19.x, at least nothing git can not
auto-resolve.
Testing wise, there are more test failures, as their coressponding fixes
are not backported.
Despite that no new regression so far.
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 17 deletions(-)
--
2.37.1
This is the backport for v5.4.x stable branch.
The full explananation can be found here:
https://lore.kernel.org/linux-btrfs/cover.1660891713.git.wqu@suse.com/
These backport have no change compared to v5.10.x backports (at least
nothing git can not auto-resolve).
While the testing part shows some extra warning in btrfs/162, it's the
existing show_dev_name related RCU string accesss problem, not something
new regression caused by these backports.
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 17 deletions(-)
--
2.37.1
The patch titled
Subject: bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
bootmem-remove-the-vmemmap-pages-from-kmemleak-in-put_page_bootmem.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
Date: Fri, 19 Aug 2022 17:40:05 +0800
The vmemmap pages is marked by kmemleak when allocated from memblock.
Remove it from kmemleak when freeing the page. Otherwise, when we reuse
the page, kmemleak may report such an error and then stop working.
kmemleak: Cannot insert 0xffff98fb6eab3d40 into the object search tree (overlaps existing)
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xffff98fb6be00000 (size 335544320):
kmemleak: comm "swapper", pid 0, jiffies 4294892296
kmemleak: min_count = 0
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
Link: https://lkml.kernel.org/r/20220819094005.2928241-1-liushixin2@huawei.com
Fixes: f41f2ed43ca5 (mm: hugetlb: free the vmemmap pages associated with each HugeTLB page)
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/bootmem_info.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/bootmem_info.c~bootmem-remove-the-vmemmap-pages-from-kmemleak-in-put_page_bootmem
+++ a/mm/bootmem_info.c
@@ -12,6 +12,7 @@
#include <linux/memblock.h>
#include <linux/bootmem_info.h>
#include <linux/memory_hotplug.h>
+#include <linux/kmemleak.h>
void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
{
@@ -33,6 +34,7 @@ void put_page_bootmem(struct page *page)
ClearPagePrivate(page);
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
+ kmemleak_free_part(page_to_virt(page), PAGE_SIZE);
free_reserved_page(page);
}
}
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
bootmem-remove-the-vmemmap-pages-from-kmemleak-in-free_bootmem_page.patch
bootmem-remove-the-vmemmap-pages-from-kmemleak-in-put_page_bootmem.patch
Syzkaller reports using smp_processor_id() in preemptible code at
radix_tree_node_alloc() in 5.10 stable releases. The problem has
been fixed by the following patch which can be cleanly applied to
the 5.10 branch.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
This is an early RFC to not rewrite stuff one more time later on if
the implementation is not acceptable or any major design changes are
required. For the TODO list, please scroll to the end.
Make kallsyms independent of symbols positions in vmlinux (or module)
by including relative filepath in each symbol's kallsyms value. I.e.
dev_gro_receive -> net/core/gro.c:dev_gro_receive
For the implementation details, please look at the patch 3/3.
Patch 2/3 is just a stub, I plan to reuse kallsyms enhancement from
the Rust series for it.
Patch 1/3 is a fix of one modpost macro straight from 2.6.12-rc2.
A nice side effect is that it's now easier to debug the kernel, as
stacktraces will now tell every call's place in the file tree:
[ 0.733191] Call Trace:
[ 0.733668] <TASK>
[ 0.733980] lib/dump_stack.c:dump_stack_lvl+0x48/0x68
[ 0.734689] kernel/panic.c:panic+0xf8/0x2ae
[ 0.735291] init/do_mounts.c:mount_block_root+0x143/0x1ea
[ 0.736046] init/do_mounts.c:prepare_namespace+0x13f/0x16e
[ 0.736798] init/main.c:kernel_init_freeable+0x240/0x24f
[ 0.737549] ? init/main.c:rest_init+0xc0/0xc0
[ 0.738171] init/main.c:kernel_init+0x1a/0x140
[ 0.738765] arch/x86/entry/entry_64.S:ret_from_fork+0x1f/0x30
[ 0.739529] </TASK>
Here are some stats:
Despite running a small utility on each object file and a script on
each built-in.a plus one at the kallsyms generation process, it adds
only 3 seconds to the whole clean build time:
make -j$(($(nproc) + 1)) all compile_commands.json 19071.12s user 3481.97s system 4587% cpu 8:11.64 total
make -j$(($(nproc) + 1)) all compile_commands.json 19202.88s user 3598.80s system 4607% cpu 8:14.85 total
Compressed kallsyms become bigger by 1.4 Mbytes on x86_64 standard
distroconfig - 60% of the kallsyms and 2.4% of the decompressed
vmlinux in the memory:
ffffffff8259ab30 D kallsyms_offsets
ffffffff82624ed0 D kallsyms_relative_base
ffffffff82624ed8 D kallsyms_num_syms
ffffffff82624ee0 D kallsyms_names
ffffffff827e9c68 D kallsyms_markers
ffffffff827ea510 D kallsyms_token_table
ffffffff827ea8c0 D kallsyms_token_index
ffffffff827eaac0 d .LC1
->
ffffffff8259ac30 D kallsyms_offsets
ffffffff82624fb8 D kallsyms_relative_base
ffffffff82624fc0 D kallsyms_num_syms
ffffffff82624fc8 D kallsyms_names
ffffffff8294de50 D kallsyms_markers
ffffffff8294e6f8 D kallsyms_token_table
ffffffff8294eac8 D kallsyms_token_index
ffffffff8294ecc8 d .LC1
TODO:
* ELF rel and MIPS relocation support (only rela currently, just
to test on x86_64);
* modules support. Currently, the kernel reuses standard ELF strtab
for module kallsyms. My plan is to create a new section which will
have the same symbol order as symtab, but point to new complete
strings with filepaths, and use this section solely for kallsyms
(leaving symtab alone);
* LTO support (now relies on that object files are ELFs);
* the actual kallsyms/livepatching/probes code which will use
filepaths instead of indexes/positions.
Have fun!
Alexander Lobakin (3):
modpost: fix TO_NATIVE() with expressions and consts
[STUB] increase kallsyms length limit
kallsyms: add option to include relative filepaths into kallsyms
.gitignore | 1 +
Makefile | 2 +-
include/linux/kallsyms.h | 2 +-
init/Kconfig | 26 ++-
kernel/livepatch/core.c | 4 +-
scripts/Makefile.build | 7 +-
scripts/Makefile.lib | 10 +-
scripts/Makefile.modfinal | 3 +-
scripts/gen_sympaths.pl | 270 ++++++++++++++++++++++++++
scripts/kallsyms.c | 89 +++++++--
scripts/link-vmlinux.sh | 14 +-
scripts/mod/.gitignore | 1 +
scripts/mod/Makefile | 9 +
scripts/mod/modpost.h | 7 +-
scripts/mod/sympath.c | 385 ++++++++++++++++++++++++++++++++++++++
15 files changed, 796 insertions(+), 34 deletions(-)
create mode 100755 scripts/gen_sympaths.pl
create mode 100644 scripts/mod/sympath.c
--
2.37.2
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0d519cadf75184a24313568e7f489a7fc9b1be3b Mon Sep 17 00:00:00 2001
From: Coiby Xu <coxu(a)redhat.com>
Date: Thu, 14 Jul 2022 21:40:26 +0800
Subject: [PATCH] arm64: kexec_file: use more system keyrings to verify kernel
image signature
Currently, when loading a kernel image via the kexec_file_load() system
call, arm64 can only use the .builtin_trusted_keys keyring to verify
a signature whereas x86 can use three more keyrings i.e.
.secondary_trusted_keys, .machine and .platform keyrings. For example,
one resulting problem is kexec'ing a kernel image would be rejected
with the error "Lockdown: kexec: kexec of unsigned images is restricted;
see man kernel_lockdown.7".
This patch set enables arm64 to make use of the same keyrings as x86 to
verify the signature kexec'ed kernel image.
Fixes: 732b7b93d849 ("arm64: kexec_file: add kernel signature verification support")
Cc: stable(a)vger.kernel.org # 105e10e2cf1c: kexec_file: drop weak attribute from functions
Cc: stable(a)vger.kernel.org # 34d5960af253: kexec: clean up arch_kexec_kernel_verify_sig
Cc: stable(a)vger.kernel.org # 83b7bb2d49ae: kexec, KEYS: make the code in bzImage64_verify_sig generic
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: kexec(a)lists.infradead.org
Cc: keyrings(a)vger.kernel.org
Cc: linux-security-module(a)vger.kernel.org
Co-developed-by: Michal Suchanek <msuchanek(a)suse.de>
Signed-off-by: Michal Suchanek <msuchanek(a)suse.de>
Acked-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Coiby Xu <coxu(a)redhat.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.ibm.com>
diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
index 9ec34690e255..5ed6a585f21f 100644
--- a/arch/arm64/kernel/kexec_image.c
+++ b/arch/arm64/kernel/kexec_image.c
@@ -14,7 +14,6 @@
#include <linux/kexec.h>
#include <linux/pe.h>
#include <linux/string.h>
-#include <linux/verification.h>
#include <asm/byteorder.h>
#include <asm/cpufeature.h>
#include <asm/image.h>
@@ -130,18 +129,10 @@ static void *image_load(struct kimage *image,
return NULL;
}
-#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
-static int image_verify_sig(const char *kernel, unsigned long kernel_len)
-{
- return verify_pefile_signature(kernel, kernel_len, NULL,
- VERIFYING_KEXEC_PE_SIGNATURE);
-}
-#endif
-
const struct kexec_file_ops kexec_image_ops = {
.probe = image_probe,
.load = image_load,
#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
- .verify_sig = image_verify_sig,
+ .verify_sig = kexec_kernel_verify_pe_sig,
#endif
};
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0d519cadf75184a24313568e7f489a7fc9b1be3b Mon Sep 17 00:00:00 2001
From: Coiby Xu <coxu(a)redhat.com>
Date: Thu, 14 Jul 2022 21:40:26 +0800
Subject: [PATCH] arm64: kexec_file: use more system keyrings to verify kernel
image signature
Currently, when loading a kernel image via the kexec_file_load() system
call, arm64 can only use the .builtin_trusted_keys keyring to verify
a signature whereas x86 can use three more keyrings i.e.
.secondary_trusted_keys, .machine and .platform keyrings. For example,
one resulting problem is kexec'ing a kernel image would be rejected
with the error "Lockdown: kexec: kexec of unsigned images is restricted;
see man kernel_lockdown.7".
This patch set enables arm64 to make use of the same keyrings as x86 to
verify the signature kexec'ed kernel image.
Fixes: 732b7b93d849 ("arm64: kexec_file: add kernel signature verification support")
Cc: stable(a)vger.kernel.org # 105e10e2cf1c: kexec_file: drop weak attribute from functions
Cc: stable(a)vger.kernel.org # 34d5960af253: kexec: clean up arch_kexec_kernel_verify_sig
Cc: stable(a)vger.kernel.org # 83b7bb2d49ae: kexec, KEYS: make the code in bzImage64_verify_sig generic
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: kexec(a)lists.infradead.org
Cc: keyrings(a)vger.kernel.org
Cc: linux-security-module(a)vger.kernel.org
Co-developed-by: Michal Suchanek <msuchanek(a)suse.de>
Signed-off-by: Michal Suchanek <msuchanek(a)suse.de>
Acked-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Coiby Xu <coxu(a)redhat.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.ibm.com>
diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
index 9ec34690e255..5ed6a585f21f 100644
--- a/arch/arm64/kernel/kexec_image.c
+++ b/arch/arm64/kernel/kexec_image.c
@@ -14,7 +14,6 @@
#include <linux/kexec.h>
#include <linux/pe.h>
#include <linux/string.h>
-#include <linux/verification.h>
#include <asm/byteorder.h>
#include <asm/cpufeature.h>
#include <asm/image.h>
@@ -130,18 +129,10 @@ static void *image_load(struct kimage *image,
return NULL;
}
-#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
-static int image_verify_sig(const char *kernel, unsigned long kernel_len)
-{
- return verify_pefile_signature(kernel, kernel_len, NULL,
- VERIFYING_KEXEC_PE_SIGNATURE);
-}
-#endif
-
const struct kexec_file_ops kexec_image_ops = {
.probe = image_probe,
.load = image_load,
#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
- .verify_sig = image_verify_sig,
+ .verify_sig = kexec_kernel_verify_pe_sig,
#endif
};
The patch below does not apply to the 5.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0d519cadf75184a24313568e7f489a7fc9b1be3b Mon Sep 17 00:00:00 2001
From: Coiby Xu <coxu(a)redhat.com>
Date: Thu, 14 Jul 2022 21:40:26 +0800
Subject: [PATCH] arm64: kexec_file: use more system keyrings to verify kernel
image signature
Currently, when loading a kernel image via the kexec_file_load() system
call, arm64 can only use the .builtin_trusted_keys keyring to verify
a signature whereas x86 can use three more keyrings i.e.
.secondary_trusted_keys, .machine and .platform keyrings. For example,
one resulting problem is kexec'ing a kernel image would be rejected
with the error "Lockdown: kexec: kexec of unsigned images is restricted;
see man kernel_lockdown.7".
This patch set enables arm64 to make use of the same keyrings as x86 to
verify the signature kexec'ed kernel image.
Fixes: 732b7b93d849 ("arm64: kexec_file: add kernel signature verification support")
Cc: stable(a)vger.kernel.org # 105e10e2cf1c: kexec_file: drop weak attribute from functions
Cc: stable(a)vger.kernel.org # 34d5960af253: kexec: clean up arch_kexec_kernel_verify_sig
Cc: stable(a)vger.kernel.org # 83b7bb2d49ae: kexec, KEYS: make the code in bzImage64_verify_sig generic
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: kexec(a)lists.infradead.org
Cc: keyrings(a)vger.kernel.org
Cc: linux-security-module(a)vger.kernel.org
Co-developed-by: Michal Suchanek <msuchanek(a)suse.de>
Signed-off-by: Michal Suchanek <msuchanek(a)suse.de>
Acked-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Coiby Xu <coxu(a)redhat.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.ibm.com>
diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
index 9ec34690e255..5ed6a585f21f 100644
--- a/arch/arm64/kernel/kexec_image.c
+++ b/arch/arm64/kernel/kexec_image.c
@@ -14,7 +14,6 @@
#include <linux/kexec.h>
#include <linux/pe.h>
#include <linux/string.h>
-#include <linux/verification.h>
#include <asm/byteorder.h>
#include <asm/cpufeature.h>
#include <asm/image.h>
@@ -130,18 +129,10 @@ static void *image_load(struct kimage *image,
return NULL;
}
-#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
-static int image_verify_sig(const char *kernel, unsigned long kernel_len)
-{
- return verify_pefile_signature(kernel, kernel_len, NULL,
- VERIFYING_KEXEC_PE_SIGNATURE);
-}
-#endif
-
const struct kexec_file_ops kexec_image_ops = {
.probe = image_probe,
.load = image_load,
#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG
- .verify_sig = image_verify_sig,
+ .verify_sig = kexec_kernel_verify_pe_sig,
#endif
};
This is the backport for v5.10.x stable branch.
The full explananation can be found here:
https://lore.kernel.org/linux-btrfs/cover.1660891713.git.wqu@suse.com/
Difference between v5.10.x and v5.15.x backports:
- Naming change in btrfs_io_contrl
In v5.15, we don't have the btrfs_io_contrl rename, thus only
btrfs_bio.
- Missing btrfs_fs_info::sectorsize_bits
Since RAID56 doesn't support anything but PAGE_SIZE == sectorsize
(until v5.19+), here we just use PAGE_SHIFT.
Another thing related to v5.10.x testing is, there are some lockdep
assert triggered related to uuid_mutex.
I'm not 100% sure, but at least RAID56 code is not touching that mutex,
thus I guess it's some other problems.
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 17 deletions(-)
--
2.37.1
This is the start of the stable review cycle for the 5.19.1 release.
There are 21 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 11 Aug 2022 17:55:02 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.19.1-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.19.1-rc1
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/speculation: Add LFENCE to RSB fill sequence
Daniel Sneddon <daniel.sneddon(a)linux.intel.com>
x86/speculation: Add RSB VM Exit protections
Ning Qiang <sohu0106(a)126.com>
macintosh/adb: fix oob read in do_adb_query() function
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x13D3:0x3586
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x13D3:0x3587
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x0CB8:0xC558
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x04C5:0x1675
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x04CA:0x4007
Aaron Ma <aaron.ma(a)canonical.com>
Bluetooth: btusb: Add support of IMC Networks PID 0x3568
Ahmad Fatoum <a.fatoum(a)pengutronix.de>
dt-bindings: bluetooth: broadcom: Add BCM4349B1 DT binding
Hakan Jansson <hakan.jansson(a)infineon.com>
Bluetooth: hci_bcm: Add DT compatible for CYW55572
Ahmad Fatoum <a.fatoum(a)pengutronix.de>
Bluetooth: hci_bcm: Add BCM4349B1 variant
Sai Teja Aluvala <quic_saluvala(a)quicinc.com>
Bluetooth: hci_qca: Return wakeup for qca_wakeup
Peter Collingbourne <pcc(a)google.com>
arm64: set UXN on swapper page tables
Andrew Lunn <andrew(a)lunn.ch>
ata: sata_mv: Fixes expected number of resources now IRQs are gone
GUO Zihua <guozihua(a)huawei.com>
crypto: arm64/poly1305 - fix a read out-of-bound
Tony Luck <tony.luck(a)intel.com>
ACPI: APEI: Better fix to avoid spamming the console with old error logs
Werner Sembach <wse(a)tuxedocomputers.com>
ACPI: video: Shortening quirk list by identifying Clevo by board_name only
Werner Sembach <wse(a)tuxedocomputers.com>
ACPI: video: Force backlight native for some TongFang devices
Stéphane Graber <stgraber(a)ubuntu.com>
tools/vm/slabinfo: Handle files in debugfs
Jan Kara <jack(a)suse.cz>
block: fix default IO priority handling again
-------------
Diffstat:
Documentation/admin-guide/hw-vuln/spectre.rst | 8 ++
.../bindings/net/broadcom-bluetooth.yaml | 1 +
Makefile | 4 +-
arch/arm64/crypto/poly1305-glue.c | 2 +-
arch/arm64/include/asm/kernel-pgtable.h | 4 +-
arch/arm64/kernel/head.S | 2 +-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/nospec-branch.h | 21 +++++-
arch/x86/kernel/cpu/bugs.c | 86 ++++++++++++++++------
arch/x86/kernel/cpu/common.c | 12 ++-
arch/x86/kvm/vmx/vmenter.S | 8 +-
block/blk-ioc.c | 2 +
block/ioprio.c | 4 +-
drivers/acpi/apei/bert.c | 31 ++++++--
drivers/acpi/video_detect.c | 55 +++++++++-----
drivers/ata/sata_mv.c | 2 +-
drivers/bluetooth/btbcm.c | 2 +
drivers/bluetooth/btusb.c | 15 ++++
drivers/bluetooth/hci_bcm.c | 2 +
drivers/bluetooth/hci_qca.c | 2 +-
drivers/macintosh/adb.c | 2 +-
include/linux/ioprio.h | 2 +-
tools/arch/x86/include/asm/cpufeatures.h | 1 +
tools/arch/x86/include/asm/msr-index.h | 4 +
tools/vm/slabinfo.c | 26 ++++++-
26 files changed, 232 insertions(+), 72 deletions(-)
Syzkaller reports general protection fault at reweight_entity() in 5.10
stable releases. The problem has been fixed by the following patch which
can be applied with minor changes to the 5.10 branch.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
This is the backport for v5.15.x stable branch.
The full explananation can be found here:
https://lore.kernel.org/linux-btrfs/Yv85PTBsDhrQITZp@kroah.com/T/#t
Difference between v5.15.x and v5.18.x backports:
- btrfs_io_contrl::fs_info and btrfs_raid_bio::fs_info refactor
In v5.15 and older kernel, btrfs_raid_bio and btrfs_io_control both
have fs_info member, and the new rbio_add_bio() function utilizes
btrfs_io_ctrl::fs_info.
Unfortunately that rbio->bioc->fs_info is not always initialized in
v5.15 and older kernels.
Thus has to use rbio->fs_info instead.
If not properly handled, can lead to btrfs/158 crash.
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 17 deletions(-)
--
2.37.1
The panic comes from the sanity test code, but after trying to boil down the
.config differences between the kitchen sink our test team uses, and a
"defconfig", it seems there are at least a couple extra dependencies for
creating a reproducer:
make defconfig
echo CONFIG_FUNCTION_TRACER=y >> .config
echo CONFIG_KPROBES_SANITY_TEST=y >> .config
echo CONFIG_UNWINDER_FRAME_POINTER=y >> .config
yes "" | make oldconfig
Note that ftrace is probably just opening the door to CONFIG_KPROBES_ON_FTRACE=y
The report I got was with gcc-11 on an Atom; I was able to reproduce it
with the default gcc-7 found on Ubuntu 18.04 and booting on a Xeon v2 -
so it seems to not be specific to gcc options or processor features.
I don't know if the v5.15 backports were specifically tested to be fully
bisectable, but if we assume they are, a bisect between 56 and 57 says:
commit 1d61a2988612ac0632134454d5407c63ae0b9d42 (refs/bisect/bad)
Author: Peter Zijlstra <peterz(a)infradead.org>
Date: Tue Jun 14 23:15:45 2022 +0200
x86: Use return-thunk in asm code
commit aa3d480315ba6c3025a60958e1981072ea37c3df upstream.
Use the return thunk in asm code. If the thunk isn't needed, it will
get patched into a RET instruction during boot by apply_returns().
Splat follows:
rcu: Hierarchical SRCU implementation.
Kprobe smoke test: started
BUG: unable to handle page fault for address: ffffffffc110f3e7
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD b2c60f067 P4D b2c60f067 PUD b2c611067 PMD 0
Oops: 0010 [#1] SMP NOPTI
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.57 #33
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.06.E006.013120181511 01/31/2018
RIP: 0010:0xffffffffc110f3e7
Code: Unable to access opcode bytes at RIP 0xffffffffc110f3bd.
RSP: 0000:ffffae4bc006be38 EFLAGS: 00010246
RAX: ffffffffb973f310 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000005856e7bd
RBP: ffffae4bc006be60 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffffbae38560 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8c92df800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc110f3bd CR3: 0000000b2c60c001 CR4: 00000000001706f0
Call Trace:
<TASK>
? kprobe_target+0x5/0x20
? init_test_probes+0x78/0x420
init_kprobes+0x16c/0x18e
? init_optprobes+0x27/0x27
do_one_initcall+0x43/0x1d0
kernel_init_freeable+0xf1/0x240
? rest_init+0xd0/0xd0
kernel_init+0x1a/0x120
ret_from_fork+0x1f/0x30
</TASK>
Modules linked in:
CR2: ffffffffc110f3e7
---[ end trace 759f040622219261 ]---
[ Upstream commit dd8de84b57b02ba9c1fe530a6d916c0853f136bd ]
On FSL_BOOK3E, _PAGE_RW is defined with two bits, one for user and one
for supervisor. As soon as one of the two bits is set, the page has
to be display as RW. But the way it is implemented today requires both
bits to be set in order to display it as RW.
Instead of display RW when _PAGE_RW bits are set and R otherwise,
reverse the logic and display R when _PAGE_RW bits are all 0 and
RW otherwise.
This change has no impact on other platforms as _PAGE_RW is a single
bit on all of them.
Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/0c33b96317811edf691e81698aaee8fa45ec3449.16564273…
---
arch/powerpc/mm/dump_linuxpagetables.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
index 0bbaf7344872..07541d322b56 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -123,15 +123,10 @@ static const struct flag_info flag_array[] = {
.set = "user",
.clear = " ",
}, {
-#if _PAGE_RO == 0
- .mask = _PAGE_RW,
- .val = _PAGE_RW,
-#else
- .mask = _PAGE_RO,
- .val = 0,
-#endif
- .set = "rw",
- .clear = "ro",
+ .mask = _PAGE_RW | _PAGE_RO,
+ .val = _PAGE_RO,
+ .set = "ro",
+ .clear = "rw",
}, {
.mask = _PAGE_EXEC,
.val = _PAGE_EXEC,
--
2.37.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dd8de84b57b02ba9c1fe530a6d916c0853f136bd Mon Sep 17 00:00:00 2001
From: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Date: Tue, 28 Jun 2022 16:43:35 +0200
Subject: [PATCH] powerpc/ptdump: Fix display of RW pages on FSL_BOOK3E
On FSL_BOOK3E, _PAGE_RW is defined with two bits, one for user and one
for supervisor. As soon as one of the two bits is set, the page has
to be display as RW. But the way it is implemented today requires both
bits to be set in order to display it as RW.
Instead of display RW when _PAGE_RW bits are set and R otherwise,
reverse the logic and display R when _PAGE_RW bits are all 0 and
RW otherwise.
This change has no impact on other platforms as _PAGE_RW is a single
bit on all of them.
Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/0c33b96317811edf691e81698aaee8fa45ec3449.16564273…
diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c
index 03607ab90c66..f884760ca5cf 100644
--- a/arch/powerpc/mm/ptdump/shared.c
+++ b/arch/powerpc/mm/ptdump/shared.c
@@ -17,9 +17,9 @@ static const struct flag_info flag_array[] = {
.clear = " ",
}, {
.mask = _PAGE_RW,
- .val = _PAGE_RW,
- .set = "rw",
- .clear = "r ",
+ .val = 0,
+ .set = "r ",
+ .clear = "rw",
}, {
.mask = _PAGE_EXEC,
.val = _PAGE_EXEC,
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From aa7aeee169480e98cf41d83c01290a37e569be6d Mon Sep 17 00:00:00 2001
From: Tyler Hicks <tyhicks(a)linux.microsoft.com>
Date: Sun, 10 Jul 2022 09:14:02 -0500
Subject: [PATCH] net/9p: Initialize the iounit field during fid creation
Ensure that the fid's iounit field is set to zero when a new fid is
created. Certain 9P operations, such as OPEN and CREATE, allow the
server to reply with an iounit size which the client code assigns to the
p9_fid struct shortly after the fid is created by p9_fid_create(). On
the other hand, an XATTRWALK operation doesn't allow for the server to
specify an iounit value. The iounit field of the newly allocated p9_fid
struct remained uninitialized in that case. Depending on allocation
patterns, the iounit value could have been something reasonable that was
carried over from previously freed fids or, in the worst case, could
have been arbitrary values from non-fid related usages of the memory
location.
The bug was detected in the Windows Subsystem for Linux 2 (WSL2) kernel
after the uninitialized iounit field resulted in the typical sequence of
two getxattr(2) syscalls, one to get the size of an xattr and another
after allocating a sufficiently sized buffer to fit the xattr value, to
hit an unexpected ERANGE error in the second call to getxattr(2). An
uninitialized iounit field would sometimes force rsize to be smaller
than the xattr value size in p9_client_read_once() and the 9P server in
WSL refused to chunk up the READ on the attr_fid and, instead, returned
ERANGE to the client. The virtfs server in QEMU seems happy to chunk up
the READ and this problem goes undetected there.
Link: https://lkml.kernel.org/r/20220710141402.803295-1-tyhicks@linux.microsoft.c…
Fixes: ebf46264a004 ("fs/9p: Add support user. xattr")
Cc: stable(a)vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks(a)linux.microsoft.com>
Reviewed-by: Christian Schoenebeck <linux_oss(a)crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus(a)codewreck.org>
diff --git a/net/9p/client.c b/net/9p/client.c
index 89c5aeb00076..3c145a64dc2b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -887,16 +887,13 @@ static struct p9_fid *p9_fid_create(struct p9_client *clnt)
struct p9_fid *fid;
p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
- fid = kmalloc(sizeof(*fid), GFP_KERNEL);
+ fid = kzalloc(sizeof(*fid), GFP_KERNEL);
if (!fid)
return NULL;
- memset(&fid->qid, 0, sizeof(fid->qid));
fid->mode = -1;
fid->uid = current_fsuid();
fid->clnt = clnt;
- fid->rdir = NULL;
- fid->fid = 0;
refcount_set(&fid->count, 1);
idr_preload(GFP_KERNEL);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 573ae4f13f630d6660008f1974c0a8a29c30e18a Mon Sep 17 00:00:00 2001
From: Jens Wiklander <jens.wiklander(a)linaro.org>
Date: Thu, 18 Aug 2022 13:08:59 +0200
Subject: [PATCH] tee: add overflow check in register_shm_helper()
With special lengths supplied by user space, register_shm_helper() has
an integer overflow when calculating the number of pages covered by a
supplied user space memory region.
This causes internal_get_user_pages_fast() a helper function of
pin_user_pages_fast() to do a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
Modules linked in:
CPU: 1 PID: 173 Comm: optee_example_a Not tainted 5.19.0 #11
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
pc : internal_get_user_pages_fast+0x474/0xa80
Call trace:
internal_get_user_pages_fast+0x474/0xa80
pin_user_pages_fast+0x24/0x4c
register_shm_helper+0x194/0x330
tee_shm_register_user_buf+0x78/0x120
tee_ioctl+0xd0/0x11a0
__arm64_sys_ioctl+0xa8/0xec
invoke_syscall+0x48/0x114
Fix this by adding an an explicit call to access_ok() in
tee_shm_register_user_buf() to catch an invalid user space address
early.
Fixes: 033ddf12bcf5 ("tee: add register user memory")
Cc: stable(a)vger.kernel.org
Reported-by: Nimish Mishra <neelam.nimish(a)gmail.com>
Reported-by: Anirban Chakraborty <ch.anirban00727(a)gmail.com>
Reported-by: Debdeep Mukhopadhyay <debdeep.mukhopadhyay(a)gmail.com>
Suggested-by: Jerome Forissier <jerome.forissier(a)linaro.org>
Signed-off-by: Jens Wiklander <jens.wiklander(a)linaro.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index f2b1bcefcadd..1175f3a46859 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -326,6 +326,9 @@ struct tee_shm *tee_shm_register_user_buf(struct tee_context *ctx,
void *ret;
int id;
+ if (!access_ok((void __user *)addr, length))
+ return ERR_PTR(-EFAULT);
+
mutex_lock(&teedev->mutex);
id = idr_alloc(&teedev->idr, NULL, 1, 0, GFP_KERNEL);
mutex_unlock(&teedev->mutex);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 573ae4f13f630d6660008f1974c0a8a29c30e18a Mon Sep 17 00:00:00 2001
From: Jens Wiklander <jens.wiklander(a)linaro.org>
Date: Thu, 18 Aug 2022 13:08:59 +0200
Subject: [PATCH] tee: add overflow check in register_shm_helper()
With special lengths supplied by user space, register_shm_helper() has
an integer overflow when calculating the number of pages covered by a
supplied user space memory region.
This causes internal_get_user_pages_fast() a helper function of
pin_user_pages_fast() to do a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
Modules linked in:
CPU: 1 PID: 173 Comm: optee_example_a Not tainted 5.19.0 #11
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
pc : internal_get_user_pages_fast+0x474/0xa80
Call trace:
internal_get_user_pages_fast+0x474/0xa80
pin_user_pages_fast+0x24/0x4c
register_shm_helper+0x194/0x330
tee_shm_register_user_buf+0x78/0x120
tee_ioctl+0xd0/0x11a0
__arm64_sys_ioctl+0xa8/0xec
invoke_syscall+0x48/0x114
Fix this by adding an an explicit call to access_ok() in
tee_shm_register_user_buf() to catch an invalid user space address
early.
Fixes: 033ddf12bcf5 ("tee: add register user memory")
Cc: stable(a)vger.kernel.org
Reported-by: Nimish Mishra <neelam.nimish(a)gmail.com>
Reported-by: Anirban Chakraborty <ch.anirban00727(a)gmail.com>
Reported-by: Debdeep Mukhopadhyay <debdeep.mukhopadhyay(a)gmail.com>
Suggested-by: Jerome Forissier <jerome.forissier(a)linaro.org>
Signed-off-by: Jens Wiklander <jens.wiklander(a)linaro.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index f2b1bcefcadd..1175f3a46859 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -326,6 +326,9 @@ struct tee_shm *tee_shm_register_user_buf(struct tee_context *ctx,
void *ret;
int id;
+ if (!access_ok((void __user *)addr, length))
+ return ERR_PTR(-EFAULT);
+
mutex_lock(&teedev->mutex);
id = idr_alloc(&teedev->idr, NULL, 1, 0, GFP_KERNEL);
mutex_unlock(&teedev->mutex);
This is the backport for v5.18.x stable branch.
The full explananation can be found here:
https://lore.kernel.org/linux-btrfs/Yv85PTBsDhrQITZp@kroah.com/T/#t
Difference between v5.18.x and v5.19.x backports:
- Subpage sector::uptodate related changes
Since v5.18 and older branches doesn't support subpage RAID56, these
uptodate checks are now in their older PageUptodate form.
- Various member naming changes
Like btrfs_raid_bio::stripe_npages -> stripe_nsector.
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 17 deletions(-)
--
2.37.1
This backport is going to address the following possible corruption for
btrfs RAID56:
- RMW always writes the full P/Q stripe
This happens even only some vertical stripes are dirty.
If some data sectors are corrupted in the clean vertical stripes,
then such unnecessary P/Q updates will wipe the chance or recovery,
as the P/Q will be calculated using corrupted data.
This will be addressed by the first backport patch.
- Don't trust any cached sector when doing recovery
Although above patch will avoid unnecessary P/Q writes, the full P/Q
stripe will still be updated in the in-memory only RAID56 cache.
To properly recovery data stripes, we should not trust any cached
sector, and always read data from disk, which will avoid corrupted
P/Q sectors.
This will be addressed by the second backport patch.
This would make some previously always-fail test cases, like btrfs/125,
to pass, and end users have a lower chance to corrupt their RAID56 data.
Since this is a data corruption related fix, these backport patches are
needed for all stable branches.
Unfortunately due to the new cleanups in v6.0-rc, these backport patches
have quite some conflicts (even for 5.19), and have to be manually resolved.
Almost every stable branch will need their own backports.
Acked-by: David Sterba <dsterba(a)suse.com>
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 68 ++++++++++++++++++++++++++++++++++++++---------
1 file changed, 55 insertions(+), 13 deletions(-)
--
2.37.1
The mitigation for PBRSB includes adding LFENCE instructions to the
RSB filling sequence. However, RSB filling is done on some older CPUs
that don't support the LFENCE instruction.
Define and use a BARRIER_NOSPEC macro which makes the LFENCE
conditional on X86_FEATURE_LFENCE_RDTSC, like the barrier_nospec()
macro defined for C code in <asm/barrier.h>.
Reported-by: Martin-Éric Racine <martin-eric.racine(a)iki.fi>
References: https://bugs.debian.org/1017425
Cc: stable(a)vger.kernel.org
Cc: regressions(a)lists.linux.dev
Cc: Daniel Sneddon <daniel.sneddon(a)linux.intel.com>
Cc: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
Fixes: 2b1299322016 ("x86/speculation: Add RSB VM Exit protections")
Fixes: ba6e31af2be9 ("x86/speculation: Add LFENCE to RSB fill sequence")
Signed-off-by: Ben Hutchings <benh(a)debian.org>
---
arch/x86/include/asm/nospec-branch.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index e64fd20778b6..b1029fd88474 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -34,6 +34,11 @@
#define RSB_CLEAR_LOOPS 32 /* To forcibly overwrite all entries */
+#ifdef __ASSEMBLY__
+
+/* Prevent speculative execution past this barrier. */
+#define BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
+
/*
* Google experimented with loop-unrolling and this turned out to be
* the optimal version - two calls, each with their own speculation
@@ -62,9 +67,7 @@
dec reg; \
jnz 771b; \
/* barrier for jnz misprediction */ \
- lfence;
-
-#ifdef __ASSEMBLY__
+ BARRIER_NOSPEC;
/*
* This should be used immediately before an indirect jump/call. It tells
@@ -138,7 +141,7 @@
int3
.Lunbalanced_ret_guard_\@:
add $(BITS_PER_LONG/8), %_ASM_SP
- lfence
+ BARRIER_NOSPEC
.endm
/*
Tired of trying Negativer plans but that have only partial effect that last
only several weeks?
Try this complex strategy and get the negative SEO effect to come much
faster and last a lot longer than the traditional Negative strategies
More info here
https://www.creative-digital.co/product/derank-seo-service/
Unsubscribe:
in the footer of our site
On Mon, Aug 01, 2022 at 06:25:11PM +0000, Carlos Llamas wrote:
> A transaction of type BINDER_TYPE_WEAK_HANDLE can fail to increment the
> reference for a node. In this case, the target proc normally releases
> the failed reference upon close as expected. However, if the target is
> dying in parallel the call will race with binder_deferred_release(), so
> the target could have released all of its references by now leaving the
> cleanup of the new failed reference unhandled.
>
> The transaction then ends and the target proc gets released making the
> ref->proc now a dangling pointer. Later on, ref->node is closed and we
> attempt to take spin_lock(&ref->proc->inner_lock), which leads to the
> use-after-free bug reported below. Let's fix this by cleaning up the
> failed reference on the spot instead of relying on the target to do so.
>
> ==================================================================
> BUG: KASAN: use-after-free in _raw_spin_lock+0xa8/0x150
> Write of size 4 at addr ffff5ca207094238 by task kworker/1:0/590
>
> CPU: 1 PID: 590 Comm: kworker/1:0 Not tainted 5.19.0-rc8 #10
> Hardware name: linux,dummy-virt (DT)
> Workqueue: events binder_deferred_func
> Call trace:
> dump_backtrace.part.0+0x1d0/0x1e0
> show_stack+0x18/0x70
> dump_stack_lvl+0x68/0x84
> print_report+0x2e4/0x61c
> kasan_report+0xa4/0x110
> kasan_check_range+0xfc/0x1a4
> __kasan_check_write+0x3c/0x50
> _raw_spin_lock+0xa8/0x150
> binder_deferred_func+0x5e0/0x9b0
> process_one_work+0x38c/0x5f0
> worker_thread+0x9c/0x694
> kthread+0x188/0x190
> ret_from_fork+0x10/0x20
>
> Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
> ---
> drivers/android/binder.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index 362c0deb65f1..9d42afe60180 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -1361,6 +1361,18 @@ static int binder_inc_ref_for_node(struct binder_proc *proc,
> }
> ret = binder_inc_ref_olocked(ref, strong, target_list);
> *rdata = ref->data;
> + if (ret && ref == new_ref) {
> + /*
> + * Cleanup the failed reference here as the target
> + * could now be dead and have already released its
> + * references by now. Calling on the new reference
> + * with strong=0 and a tmp_refs will not decrement
> + * the node. The new_ref gets kfree'd below.
> + */
> + binder_cleanup_ref_olocked(new_ref);
> + ref = NULL;
> + }
> +
> binder_proc_unlock(proc);
> if (new_ref && ref != new_ref)
> /*
> --
> 2.37.1.455.g008518b4e5-goog
>
Sorry, I forgot to CC stable. This patch should be applied to all stable
kernels starting with 4.14 and higher.
Cc: stable(a)vger.kernel.org # 4.14+
From: Cameron Gutman <aicommander(a)gmail.com>
Suspending and resuming the system can sometimes cause the out
URB to get hung after a reset_resume. This causes LED setting
and force feedback to break on resume. To avoid this, just drop
the reset_resume callback so the USB core rebinds xpad to the
wireless pads on resume if a reset happened.
A nice side effect of this change is the LED ring on wireless
controllers is now set correctly on system resume.
Cc: stable(a)vger.kernel.org
Fixes: 4220f7db1e42 ("Input: xpad - workaround dead irq_out after suspend/ resume")
Signed-off-by: Cameron Gutman <aicommander(a)gmail.com>
Signed-off-by: Pavel Rojtberg <rojtberg(a)gmail.com>
---
drivers/input/joystick/xpad.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/input/joystick/xpad.c b/drivers/input/joystick/xpad.c
index 629646b..4e01056 100644
--- a/drivers/input/joystick/xpad.c
+++ b/drivers/input/joystick/xpad.c
@@ -1991,7 +1991,6 @@ static struct usb_driver xpad_driver = {
.disconnect = xpad_disconnect,
.suspend = xpad_suspend,
.resume = xpad_resume,
- .reset_resume = xpad_resume,
.id_table = xpad_table,
};
--
2.34.1
Older CPUs beyond its Servicing period are not listed in the affected
processor list for MMIO Stale Data vulnerabilities. These CPUs currently
report "Not affected" in sysfs, which may not be correct.
Add support for "Unknown" reporting for such CPUs. Mitigation is not
deployed when the status is "Unknown".
"CPU is beyond its Servicing period" means these CPUs are beyond their
Servicing [1] period and have reached End of Servicing Updates (ESU) [2].
[1] Servicing: The process of providing functional and security
updates to Intel processors or platforms, utilizing the Intel Platform
Update (IPU) process or other similar mechanisms.
[2] End of Servicing Updates (ESU): ESU is the date at which Intel
will no longer provide Servicing, such as through IPU or other similar
update processes. ESU dates will typically be aligned to end of
quarter.
Suggested-by: Andrew Cooper <andrew.cooper3(a)citrix.com>
Suggested-by: Tony Luck <tony.luck(a)intel.com>
Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data")
Cc: stable(a)vger.kernel.org
Signed-off-by: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
---
CPU vulnerability is unknown if, hardware doesn't set the immunity bits
and CPU is not in the known-affected-list.
In order to report the unknown status, this patch sets the MMIO bug
for all Intel CPUs that don't have the hardware immunity bits set.
Based on the known-affected-list of CPUs, mitigation selection then
deploys the mitigation or sets the "Unknown" status; which is ugly.
I will appreciate suggestions to improve this.
Thanks,
Pawan
.../hw-vuln/processor_mmio_stale_data.rst | 3 +++
arch/x86/kernel/cpu/bugs.c | 11 +++++++-
arch/x86/kernel/cpu/common.c | 26 +++++++++++++------
arch/x86/kernel/cpu/cpu.h | 1 +
4 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
index 9393c50b5afc..55524e0798da 100644
--- a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
+++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
@@ -230,6 +230,9 @@ The possible values in this file are:
* - 'Mitigation: Clear CPU buffers'
- The processor is vulnerable and the CPU buffer clearing mitigation is
enabled.
+ * - 'Unknown: CPU is beyond its Servicing period'
+ - The processor vulnerability status is unknown because it is
+ out of Servicing period. Mitigation is not attempted.
If the processor is vulnerable then the following information is appended to
the above information:
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0dd04713434b..dd6e78d370bc 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -416,6 +416,7 @@ enum mmio_mitigations {
MMIO_MITIGATION_OFF,
MMIO_MITIGATION_UCODE_NEEDED,
MMIO_MITIGATION_VERW,
+ MMIO_MITIGATION_UNKNOWN,
};
/* Default mitigation for Processor MMIO Stale Data vulnerabilities */
@@ -426,12 +427,18 @@ static const char * const mmio_strings[] = {
[MMIO_MITIGATION_OFF] = "Vulnerable",
[MMIO_MITIGATION_UCODE_NEEDED] = "Vulnerable: Clear CPU buffers attempted, no microcode",
[MMIO_MITIGATION_VERW] = "Mitigation: Clear CPU buffers",
+ [MMIO_MITIGATION_UNKNOWN] = "Unknown: CPU is beyond its servicing period",
};
static void __init mmio_select_mitigation(void)
{
u64 ia32_cap;
+ if (mmio_stale_data_unknown()) {
+ mmio_mitigation = MMIO_MITIGATION_UNKNOWN;
+ return;
+ }
+
if (!boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA) ||
cpu_mitigations_off()) {
mmio_mitigation = MMIO_MITIGATION_OFF;
@@ -1638,6 +1645,7 @@ void cpu_bugs_smt_update(void)
pr_warn_once(MMIO_MSG_SMT);
break;
case MMIO_MITIGATION_OFF:
+ case MMIO_MITIGATION_UNKNOWN:
break;
}
@@ -2235,7 +2243,8 @@ static ssize_t tsx_async_abort_show_state(char *buf)
static ssize_t mmio_stale_data_show_state(char *buf)
{
- if (mmio_mitigation == MMIO_MITIGATION_OFF)
+ if (mmio_mitigation == MMIO_MITIGATION_OFF ||
+ mmio_mitigation == MMIO_MITIGATION_UNKNOWN)
return sysfs_emit(buf, "%s\n", mmio_strings[mmio_mitigation]);
if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 736262a76a12..82088410870e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1286,6 +1286,22 @@ static bool arch_cap_mmio_immune(u64 ia32_cap)
ia32_cap & ARCH_CAP_SBDR_SSDP_NO);
}
+bool __init mmio_stale_data_unknown(void)
+{
+ u64 ia32_cap = x86_read_arch_cap_msr();
+
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ return false;
+ /*
+ * CPU vulnerability is unknown when, hardware doesn't set the
+ * immunity bits and CPU is not in the known affected list.
+ */
+ if (!cpu_matches(cpu_vuln_blacklist, MMIO) &&
+ !arch_cap_mmio_immune(ia32_cap))
+ return true;
+ return false;
+}
+
static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{
u64 ia32_cap = x86_read_arch_cap_msr();
@@ -1349,14 +1365,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
cpu_matches(cpu_vuln_blacklist, SRBDS | MMIO_SBDS))
setup_force_cpu_bug(X86_BUG_SRBDS);
- /*
- * Processor MMIO Stale Data bug enumeration
- *
- * Affected CPU list is generally enough to enumerate the vulnerability,
- * but for virtualization case check for ARCH_CAP MSR bits also, VMM may
- * not want the guest to enumerate the bug.
- */
- if (cpu_matches(cpu_vuln_blacklist, MMIO) &&
+ /* Processor MMIO Stale Data bug enumeration */
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
!arch_cap_mmio_immune(ia32_cap))
setup_force_cpu_bug(X86_BUG_MMIO_STALE_DATA);
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 7c9b5893c30a..a2dbfc1bbc49 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -82,6 +82,7 @@ unsigned int aperfmperf_get_khz(int cpu);
extern void x86_spec_ctrl_setup_ap(void);
extern void update_srbds_msr(void);
+extern bool mmio_stale_data_unknown(void);
extern u64 x86_read_arch_cap_msr(void);
base-commit: 4a57a8400075bc5287c5c877702c68aeae2a033d
--
2.35.3
The vfio_ap_mdev_unlink_adapter and vfio_ap_mdev_unlink_domain functions
add the associated vfio_ap_queue objects to the hashtable that links them
to the matrix mdev to which their APQN is assigned. In order to unlink
them, they must be deleted from the hashtable; if not, they will continue
to be reset whenever userspace closes the mdev fd or removes the mdev.
This patch fixes that issue.
Cc: stable(a)vger.kernel.org
Fixes: 2838ba5bdcd6 ("s390/vfio-ap: reset queues after adapter/domain unassignment")
Reported-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
---
drivers/s390/crypto/vfio_ap_ops.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index ee82207b4e60..2493926b5dfb 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1049,8 +1049,7 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
if (q && qtable) {
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
- hash_add(qtable->queues, &q->mdev_qnode,
- q->apqn);
+ vfio_ap_unlink_queue_fr_mdev(q);
}
}
}
@@ -1236,8 +1235,7 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
if (q && qtable) {
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
- hash_add(qtable->queues, &q->mdev_qnode,
- q->apqn);
+ vfio_ap_unlink_queue_fr_mdev(q);
}
}
}
--
2.31.1
This reverts commit 96e51ccf1af33e82f429a0d6baebba29c6448d0f.
Recently we started running the kernel with rstat infrastructure on
production traffic and begin to see negative memcg stats values.
Particularly the 'sock' stat is the one which we observed having
negative value.
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224
Re-run after couple of seconds
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248
For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat. I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often. So, this
negative sock is due to rstat flusher flushing the stats on the CPU that
has seen the decrement of sock but missed the CPU that has increments. A
typical race condition.
For easy stable backport, revert is the most simple solution. For long
term solution, I am thinking of two directions. First is just reduce the
race window by optimizing the rstat flusher. Second is if the reader
sees a negative stat value, force flush and restart the stat collection.
Basically retry but limited.
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: stable(a)vger.kernel.org # 5.15
---
include/linux/memcontrol.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..6257867fbf95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -987,19 +987,30 @@ static inline void mod_memcg_page_state(struct page *page,
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
- return READ_ONCE(memcg->vmstats.state[idx]);
+ long x = READ_ONCE(memcg->vmstats.state[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
+ long x;
if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- return READ_ONCE(pn->lruvec_stats.state[idx]);
+ x = READ_ONCE(pn->lruvec_stats.state[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
--
2.37.1.595.g718a3a8f04-goog
When the port does not support USB PD, prevent transition to PD
only states when power supply property is written. In this case,
TCPM transitions to SNK_NEGOTIATE_CAPABILITIES
which should not be the case given that the port is not pd_capable.
[ 84.308251] state change SNK_READY -> SNK_NEGOTIATE_CAPABILITIES [rev3 NONE_AMS]
[ 84.308335] Setting usb_comm capable false
[ 84.323367] set_auto_vbus_discharge_threshold mode:3 pps_active:n vbus:5000 ret:0
[ 84.323376] state change SNK_NEGOTIATE_CAPABILITIES -> SNK_WAIT_CAPABILITIES [rev3 NONE_AMS]
Fixes: e9e6e164ed8f6 ("usb: typec: tcpm: Support non-PD mode")
Cc: stable(a)vger.kernel.org
Signed-off-by: Badhri Jagan Sridharan <badhri(a)google.com>
---
Changes since v1:
- Add Fixes tag.
Changes since v2:
- CCed stable
---
drivers/usb/typec/tcpm/tcpm.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index ea5a917c51b1..904c7b4ce2f0 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -6320,6 +6320,13 @@ static int tcpm_psy_set_prop(struct power_supply *psy,
struct tcpm_port *port = power_supply_get_drvdata(psy);
int ret;
+ /*
+ * All the properties below are related to USB PD. The check needs to be
+ * property specific when a non-pd related property is added.
+ */
+ if (!port->pd_supported)
+ return -EOPNOTSUPP;
+
switch (psp) {
case POWER_SUPPLY_PROP_ONLINE:
ret = tcpm_psy_set_online(port, val);
--
2.37.1.595.g718a3a8f04-goog
The patch below was submitted to be applied to the 5.19-stable tree.
I fail to see how this patch meets the stable kernel rules as found at
Documentation/process/stable-kernel-rules.rst.
I could be totally wrong, and if so, please respond to
<stable(a)vger.kernel.org> and let me know why this patch should be
applied. Otherwise, it is now dropped from my patch queues, never to be
seen again.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b1f707146923335849fb70237eec27d4d1ae7d62 Mon Sep 17 00:00:00 2001
From: Arun Easi <aeasi(a)marvell.com>
Date: Tue, 12 Jul 2022 22:20:39 -0700
Subject: [PATCH] scsi: qla2xxx: Fix response queue handler reading stale
packets
On some platforms, the current logic of relying on finding new packet
solely based on signature pattern can lead to driver reading stale
packets. Though this is a bug in those platforms, reduce such exposures by
limiting reading packets until the IN pointer.
Two module parameters are introduced:
ql2xrspq_follow_inptr:
When set, on newer adapters that has queue pointer shadowing, look for
response packets only until response queue in pointer.
When reset, response packets are read based on a signature pattern
logic (old way).
ql2xrspq_follow_inptr_legacy:
Like ql2xrspq_follow_inptr, but for those adapters where there is no
queue pointer shadowing.
Link: https://lore.kernel.org/r/20220713052045.10683-5-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Signed-off-by: Arun Easi <aeasi(a)marvell.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qla2xxx/qla_gbl.h b/drivers/scsi/qla2xxx/qla_gbl.h
index 3674b35196b0..96147ca40126 100644
--- a/drivers/scsi/qla2xxx/qla_gbl.h
+++ b/drivers/scsi/qla2xxx/qla_gbl.h
@@ -193,6 +193,8 @@ extern int ql2xsecenable;
extern int ql2xenforce_iocb_limit;
extern int ql2xabts_wait_nvme;
extern u32 ql2xnvme_queues;
+extern int ql2xrspq_follow_inptr;
+extern int ql2xrspq_follow_inptr_legacy;
extern int qla2x00_loop_reset(scsi_qla_host_t *);
extern void qla2x00_abort_all_cmds(scsi_qla_host_t *, int);
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 4fa24d318f14..35b425c446b9 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -3780,6 +3780,8 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
struct qla_hw_data *ha = vha->hw;
struct purex_entry_24xx *purex_entry;
struct purex_item *pure_item;
+ u16 rsp_in = 0;
+ int follow_inptr, is_shadow_hba;
if (!ha->flags.fw_started)
return;
@@ -3789,7 +3791,25 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
qla_cpu_update(rsp->qpair, smp_processor_id());
}
- while (rsp->ring_ptr->signature != RESPONSE_PROCESSED) {
+#define __update_rsp_in(_update, _is_shadow_hba, _rsp, _rsp_in) \
+ do { \
+ if (_update) { \
+ _rsp_in = _is_shadow_hba ? *(_rsp)->in_ptr : \
+ rd_reg_dword_relaxed((_rsp)->rsp_q_in); \
+ } \
+ } while (0)
+
+ is_shadow_hba = IS_SHADOW_REG_CAPABLE(ha);
+ follow_inptr = is_shadow_hba ? ql2xrspq_follow_inptr :
+ ql2xrspq_follow_inptr_legacy;
+
+ __update_rsp_in(follow_inptr, is_shadow_hba, rsp, rsp_in);
+
+ while ((likely(follow_inptr &&
+ rsp->ring_index != rsp_in &&
+ rsp->ring_ptr->signature != RESPONSE_PROCESSED)) ||
+ (!follow_inptr &&
+ rsp->ring_ptr->signature != RESPONSE_PROCESSED)) {
pkt = (struct sts_entry_24xx *)rsp->ring_ptr;
rsp->ring_index++;
@@ -3902,6 +3922,8 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
}
pure_item = qla27xx_copy_fpin_pkt(vha,
(void **)&pkt, &rsp);
+ __update_rsp_in(follow_inptr, is_shadow_hba,
+ rsp, rsp_in);
if (!pure_item)
break;
qla24xx_queue_purex_item(vha, pure_item,
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 1c7fb6484db2..0bd0fd1042df 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -338,6 +338,16 @@ module_param(ql2xdelay_before_pci_error_handling, uint, 0644);
MODULE_PARM_DESC(ql2xdelay_before_pci_error_handling,
"Number of seconds delayed before qla begin PCI error self-handling (default: 5).\n");
+int ql2xrspq_follow_inptr = 1;
+module_param(ql2xrspq_follow_inptr, int, 0644);
+MODULE_PARM_DESC(ql2xrspq_follow_inptr,
+ "Follow RSP IN pointer for RSP updates for HBAs 27xx and newer (default: 1).");
+
+int ql2xrspq_follow_inptr_legacy = 1;
+module_param(ql2xrspq_follow_inptr_legacy, int, 0644);
+MODULE_PARM_DESC(ql2xrspq_follow_inptr_legacy,
+ "Follow RSP IN pointer for RSP updates for HBAs older than 27XX. (default: 1).");
+
static void qla2x00_clear_drv_active(struct qla_hw_data *);
static void qla2x00_free_device(scsi_qla_host_t *);
static int qla2xxx_map_queues(struct Scsi_Host *shost);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 689a71493bd2f31c024f8c0395f85a1fd4b2138e Mon Sep 17 00:00:00 2001
From: Coiby Xu <coxu(a)redhat.com>
Date: Thu, 14 Jul 2022 21:40:24 +0800
Subject: [PATCH] kexec: clean up arch_kexec_kernel_verify_sig
Before commit 105e10e2cf1c ("kexec_file: drop weak attribute from
functions"), there was already no arch-specific implementation
of arch_kexec_kernel_verify_sig. With weak attribute dropped by that
commit, arch_kexec_kernel_verify_sig is completely useless. So clean it
up.
Note later patches are dependent on this patch so it should be backported
to the stable tree as well.
Cc: stable(a)vger.kernel.org
Suggested-by: Eric W. Biederman <ebiederm(a)xmission.com>
Reviewed-by: Michal Suchanek <msuchanek(a)suse.de>
Acked-by: Baoquan He <bhe(a)redhat.com>
Signed-off-by: Coiby Xu <coxu(a)redhat.com>
[zohar(a)linux.ibm.com: reworded patch description "Note"]
Link: https://lore.kernel.org/linux-integrity/20220714134027.394370-1-coxu@redhat…
Signed-off-by: Mimi Zohar <zohar(a)linux.ibm.com>
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 8107606ad1e8..7f710fb3712b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -212,11 +212,6 @@ static inline void *arch_kexec_kernel_image_load(struct kimage *image)
}
#endif
-#ifdef CONFIG_KEXEC_SIG
-int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
- unsigned long buf_len);
-#endif
-
extern int kexec_add_buffer(struct kexec_buf *kbuf);
int kexec_locate_mem_hole(struct kexec_buf *kbuf);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 0c27c81351ee..6dc1294c90fc 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -81,24 +81,6 @@ int kexec_image_post_load_cleanup_default(struct kimage *image)
return image->fops->cleanup(image->image_loader_data);
}
-#ifdef CONFIG_KEXEC_SIG
-static int kexec_image_verify_sig_default(struct kimage *image, void *buf,
- unsigned long buf_len)
-{
- if (!image->fops || !image->fops->verify_sig) {
- pr_debug("kernel loader does not support signature verification.\n");
- return -EKEYREJECTED;
- }
-
- return image->fops->verify_sig(buf, buf_len);
-}
-
-int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf, unsigned long buf_len)
-{
- return kexec_image_verify_sig_default(image, buf, buf_len);
-}
-#endif
-
/*
* Free up memory used by kernel, initrd, and command line. This is temporary
* memory allocation which is not needed any more after these buffers have
@@ -141,13 +123,24 @@ void kimage_file_post_load_cleanup(struct kimage *image)
}
#ifdef CONFIG_KEXEC_SIG
+static int kexec_image_verify_sig(struct kimage *image, void *buf,
+ unsigned long buf_len)
+{
+ if (!image->fops || !image->fops->verify_sig) {
+ pr_debug("kernel loader does not support signature verification.\n");
+ return -EKEYREJECTED;
+ }
+
+ return image->fops->verify_sig(buf, buf_len);
+}
+
static int
kimage_validate_signature(struct kimage *image)
{
int ret;
- ret = arch_kexec_kernel_verify_sig(image, image->kernel_buf,
- image->kernel_buf_len);
+ ret = kexec_image_verify_sig(image, image->kernel_buf,
+ image->kernel_buf_len);
if (ret) {
if (sig_enforce) {
The quilt patch titled
Subject: mm/hugetlb: support write-faults in shared mappings
has been removed from the -mm tree. Its filename was
mm-hugetlb-support-write-faults-in-shared-mappings.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/hugetlb: support write-faults in shared mappings
Date: Thu, 11 Aug 2022 12:34:35 +0200
If we ever get a write-fault on a write-protected page in a shared
mapping, we'd be in trouble (again). Instead, we can simply map the page
writable.
And in fact, there is even a way right now to trigger that code via
uffd-wp ever since we stared to support it for shmem in 5.19:
--------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <linux/userfaultfd.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static char *map;
int uffd;
static int temp_setup_uffd(void)
{
struct uffdio_api uffdio_api;
struct uffdio_register uffdio_register;
struct uffdio_writeprotect uffd_writeprotect;
struct uffdio_range uffd_range;
uffd = syscall(__NR_userfaultfd,
O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
if (uffd < 0) {
fprintf(stderr, "syscall() failed: %d\n", errno);
return -errno;
}
uffdio_api.api = UFFD_API;
uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
return -errno;
}
if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
return -ENOSYS;
}
/* Register UFFD-WP */
uffdio_register.range.start = (unsigned long) map;
uffdio_register.range.len = HUGETLB_SIZE;
uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
return -errno;
}
/* Writeprotect a single page. */
uffd_writeprotect.range.start = (unsigned long) map;
uffd_writeprotect.range.len = HUGETLB_SIZE;
uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
return -errno;
}
/* Unregister UFFD-WP without prior writeunprotection. */
uffd_range.start = (unsigned long) map;
uffd_range.len = HUGETLB_SIZE;
if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno);
return -errno;
}
return 0;
}
int main(int argc, char **argv)
{
int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
if (!fd) {
fprintf(stderr, "open() failed\n");
return -errno;
}
if (ftruncate(fd, HUGETLB_SIZE)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}
*map = 0;
if (temp_setup_uffd())
return 1;
*map = 0;
return 0;
}
--------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page.
# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
# echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
# cat /proc/meminfo | grep HugePages_
HugePages_Total: 2
HugePages_Free: 1
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when
unregistering and consequently keeps the PTE writeprotected. Reason for
this is to avoid the additional overhead when unregistering. Note that
this is the case also for !hugetlb and that we will end up with writable
PTEs that still have the uffd-wp PTE bit set once we return from
hugetlb_wp(). I'm not touching the uffd-wp PTE bit for now, because it
seems to be a generic thing -- wp_page_reuse() also doesn't clear it.
VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates
that MAP_SHARED handling was at least envisioned, but could never have
worked as expected.
While at it, make sure that we never end up in hugetlb_wp() on write
faults without VM_WRITE, because we don't support maybe_mkwrite()
semantics as commonly used in the !hugetlb case -- for example, in
wp_page_reuse().
Note that there is no need to do any kind of reservation in
hugetlb_fault() in this case ... because we already have a hugetlb page
mapped R/O that we will simply map writable and we are not dealing with
COW/unsharing.
Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Bjorn Helgaas <bhelgaas(a)google.com>
Cc: Cyrill Gorcunov <gorcunov(a)openvz.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jamie Liu <jamieliu(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Pavel Emelyanov <xemul(a)parallels.com>
Cc: Peter Feiner <pfeiner(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [5.19]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
--- a/mm/hugetlb.c~mm-hugetlb-support-write-faults-in-shared-mappings
+++ a/mm/hugetlb.c
@@ -5241,6 +5241,21 @@ static vm_fault_t hugetlb_wp(struct mm_s
VM_BUG_ON(unshare && (flags & FOLL_WRITE));
VM_BUG_ON(!unshare && !(flags & FOLL_WRITE));
+ /*
+ * hugetlb does not support FOLL_FORCE-style write faults that keep the
+ * PTE mapped R/O such as maybe_mkwrite() would do.
+ */
+ if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE)))
+ return VM_FAULT_SIGSEGV;
+
+ /* Let's take out MAP_SHARED mappings first. */
+ if (vma->vm_flags & VM_MAYSHARE) {
+ if (unlikely(unshare))
+ return 0;
+ set_huge_ptep_writable(vma, haddr, ptep);
+ return 0;
+ }
+
pte = huge_ptep_get(ptep);
old_page = pte_page(pte);
@@ -5781,12 +5796,11 @@ vm_fault_t hugetlb_fault(struct mm_struc
* If we are going to COW/unshare the mapping later, we examine the
* pending reservations for this page now. This will ensure that any
* allocations necessary to record that reservation occur outside the
- * spinlock. For private mappings, we also lookup the pagecache
- * page now as it is used to determine if a reservation has been
- * consumed.
+ * spinlock. Also lookup the pagecache page now as it is used to
+ * determine if a reservation has been consumed.
*/
if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
- !huge_pte_write(entry)) {
+ !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(entry)) {
if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
goto out_mutex;
@@ -5794,9 +5808,7 @@ vm_fault_t hugetlb_fault(struct mm_struc
/* Just decrements count, does not deallocate */
vma_end_reservation(h, vma, haddr);
- if (!(vma->vm_flags & VM_MAYSHARE))
- pagecache_page = hugetlbfs_pagecache_page(h,
- vma, haddr);
+ pagecache_page = hugetlbfs_pagecache_page(h, vma, haddr);
}
ptl = huge_pte_lock(h, mm, ptep);
_
Patches currently in -mm which might be from david(a)redhat.com are
The quilt patch titled
Subject: mm/hugetlb: fix hugetlb not supporting softdirty tracking
has been removed from the -mm tree. Its filename was
mm-hugetlb-fix-hugetlb-not-supporting-softdirty-tracking.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/hugetlb: fix hugetlb not supporting softdirty tracking
Date: Thu, 11 Aug 2022 12:34:34 +0200
Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
I observed that hugetlb does not support/expect write-faults in shared
mappings that would have to map the R/O-mapped page writable -- and I
found two case where we could currently get such faults and would
erroneously map an anon page into a shared mapping.
Reproducers part of the patches.
I propose to backport both fixes to stable trees. The first fix needs a
small adjustment.
This patch (of 2):
Staring at hugetlb_wp(), one might wonder where all the logic for shared
mappings is when stumbling over a write-protected page in a shared
mapping. In fact, there is none, and so far we thought we could get away
with that because e.g., mprotect() should always do the right thing and
map all pages directly writable.
Looks like we were wrong:
--------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static void clear_softdirty(void)
{
int fd = open("/proc/self/clear_refs", O_WRONLY);
const char *ctrl = "4";
int ret;
if (fd < 0) {
fprintf(stderr, "open(clear_refs) failed\n");
exit(1);
}
ret = write(fd, ctrl, strlen(ctrl));
if (ret != strlen(ctrl)) {
fprintf(stderr, "write(clear_refs) failed\n");
exit(1);
}
close(fd);
}
int main(int argc, char **argv)
{
char *map;
int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
if (!fd) {
fprintf(stderr, "open() failed\n");
return -errno;
}
if (ftruncate(fd, HUGETLB_SIZE)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}
*map = 0;
if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
clear_softdirty();
if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
*map = 0;
return 0;
}
--------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page.
# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
# echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
# cat /proc/meminfo | grep HugePages_
HugePages_Total: 2
HugePages_Free: 1
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Reason in this particular case is that vma_wants_writenotify() will
return "true", removing VM_SHARED in vma_set_page_prot() to map pages
write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
support softdirty tracking.
Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Peter Feiner <pfeiner(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov(a)openvz.org>
Cc: Pavel Emelyanov <xemul(a)parallels.com>
Cc: Jamie Liu <jamieliu(a)google.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas(a)google.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/mm/mmap.c~mm-hugetlb-fix-hugetlb-not-supporting-softdirty-tracking
+++ a/mm/mmap.c
@@ -1646,8 +1646,11 @@ int vma_wants_writenotify(struct vm_area
pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags)))
return 0;
- /* Do we need to track softdirty? */
- if (vma_soft_dirty_enabled(vma))
+ /*
+ * Do we need to track softdirty? hugetlb does not support softdirty
+ * tracking yet.
+ */
+ if (vma_soft_dirty_enabled(vma) && !is_vm_hugetlb_page(vma))
return 1;
/* Specialty mapping? */
_
Patches currently in -mm which might be from david(a)redhat.com are
The quilt patch titled
Subject: mm/uffd: reset write protection when unregister with wp-mode
has been removed from the -mm tree. Its filename was
mm-uffd-reset-write-protection-when-unregister-with-wp-mode.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/uffd: reset write protection when unregister with wp-mode
Date: Thu, 11 Aug 2022 16:13:40 -0400
The motivation of this patch comes from a recent report and patchfix from
David Hildenbrand on hugetlb shared handling of wr-protected page [1].
With the reproducer provided in commit message of [1], one can leverage
the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
not only the attacker process, but also the whole system.
The lazy-reset mechanism of uffd-wp was used to make unregister faster,
meanwhile it has an assumption that any leftover pgtable entries should
only affect the process on its own, so not only the user should be aware
of anything it does, but also it should not affect outside of the process.
But it seems that this is not true, and it can also be utilized to make
some exploit easier.
So far there's no clue showing that the lazy-reset is important to any
userfaultfd users because normally the unregister will only happen once
for a specific range of memory of the lifecycle of the process.
Considering all above, what this patch proposes is to do explicit pte
resets when unregister an uffd region with wr-protect mode enabled.
It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll
make the unregister slower. From that pov it's a very slight abi change,
but hopefully nothing should break with this change either.
Regarding to the change itself - core of uffd write [un]protect operation
is moved into a separate function (uffd_wp_range()) and it is reused in
the unregister code path.
Note that the new function will not check for anything, e.g. ranges or
memory types, because they should have been checked during the previous
UFFDIO_REGISTER or it should have failed already. It also doesn't check
mmap_changing because we're with mmap write lock held anyway.
I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
the only issue reported so far and that's the commit David's reproducer
will start working (v5.19+). But the whole idea actually applies to not
only file memories but also anonymous. It's just that we don't need to
fix anonymous prior to v5.19- because there's no known way to exploit.
IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
[1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/userfaultfd.c | 4 ++++
include/linux/userfaultfd_k.h | 2 ++
mm/userfaultfd.c | 29 ++++++++++++++++++-----------
3 files changed, 24 insertions(+), 11 deletions(-)
--- a/fs/userfaultfd.c~mm-uffd-reset-write-protection-when-unregister-with-wp-mode
+++ a/fs/userfaultfd.c
@@ -1601,6 +1601,10 @@ static int userfaultfd_unregister(struct
wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
}
+ /* Reset ptes for the whole vma range if wr-protected */
+ if (userfaultfd_wp(vma))
+ uffd_wp_range(mm, vma, start, vma_end - start, false);
+
new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
--- a/include/linux/userfaultfd_k.h~mm-uffd-reset-write-protection-when-unregister-with-wp-mode
+++ a/include/linux/userfaultfd_k.h
@@ -73,6 +73,8 @@ extern ssize_t mcopy_continue(struct mm_
extern int mwriteprotect_range(struct mm_struct *dst_mm,
unsigned long start, unsigned long len,
bool enable_wp, atomic_t *mmap_changing);
+extern void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long len, bool enable_wp);
/* mm helpers */
static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
--- a/mm/userfaultfd.c~mm-uffd-reset-write-protection-when-unregister-with-wp-mode
+++ a/mm/userfaultfd.c
@@ -703,14 +703,29 @@ ssize_t mcopy_continue(struct mm_struct
mmap_changing, 0);
}
+void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+ unsigned long start, unsigned long len, bool enable_wp)
+{
+ struct mmu_gather tlb;
+ pgprot_t newprot;
+
+ if (enable_wp)
+ newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+ else
+ newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+ tlb_gather_mmu(&tlb, dst_mm);
+ change_protection(&tlb, dst_vma, start, start + len, newprot,
+ enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+ tlb_finish_mmu(&tlb);
+}
+
int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool enable_wp,
atomic_t *mmap_changing)
{
struct vm_area_struct *dst_vma;
unsigned long page_mask;
- struct mmu_gather tlb;
- pgprot_t newprot;
int err;
/*
@@ -750,15 +765,7 @@ int mwriteprotect_range(struct mm_struct
goto out_unlock;
}
- if (enable_wp)
- newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
- else
- newprot = vm_get_page_prot(dst_vma->vm_flags);
-
- tlb_gather_mmu(&tlb, dst_mm);
- change_protection(&tlb, dst_vma, start, start + len, newprot,
- enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
- tlb_finish_mmu(&tlb);
+ uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
err = 0;
out_unlock:
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-x86-use-swp_type_bits-in-3-level-swap-macros.patch
mm-swap-comment-all-the-ifdef-in-swapopsh.patch
mm-swap-add-swp_offset_pfn-to-fetch-pfn-from-swap-entry.patch
mm-thp-carry-over-dirty-bit-when-thp-splits-on-pmd.patch
mm-remember-young-dirty-bit-for-page-migrations.patch
mm-swap-cache-maximum-swapfile-size-when-init-swap.patch
mm-swap-cache-swap-migration-a-d-bits-support.patch
The quilt patch titled
Subject: kernel/sys_ni: add compat entry for fadvise64_64
has been removed from the -mm tree. Its filename was
kernel-sys_ni-add-compat-entry-for-fadvise64_64.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Randy Dunlap <rdunlap(a)infradead.org>
Subject: kernel/sys_ni: add compat entry for fadvise64_64
Date: Sun, 7 Aug 2022 15:09:34 -0700
When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is
set/enabled, the riscv compat_syscall_table references
'compat_sys_fadvise64_64', which is not defined:
riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8):
undefined reference to `compat_sys_fadvise64_64'
Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so
that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function
available.
Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org
Fixes: d3ac21cacc24 ("mm: Support compiling out madvise and fadvise")
Signed-off-by: Randy Dunlap <rdunlap(a)infradead.org>
Suggested-by: Arnd Bergmann <arnd(a)arndb.de>
Reviewed-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Josh Triplett <josh(a)joshtriplett.org>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Albert Ou <aou(a)eecs.berkeley.edu>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/sys_ni.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/sys_ni.c~kernel-sys_ni-add-compat-entry-for-fadvise64_64
+++ a/kernel/sys_ni.c
@@ -277,6 +277,7 @@ COND_SYSCALL(landlock_restrict_self);
/* mm/fadvise.c */
COND_SYSCALL(fadvise64_64);
+COND_SYSCALL_COMPAT(fadvise64_64);
/* mm/, CONFIG_MMU only */
COND_SYSCALL(swapon);
_
Patches currently in -mm which might be from rdunlap(a)infradead.org are
The quilt patch titled
Subject: mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
has been removed from the -mm tree. Its filename was
mm-gup-fix-foll_force-cow-security-issue-and-remove-foll_cow.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
Date: Tue, 9 Aug 2022 22:56:40 +0200
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
that FOLL_FORCE can be possibly dangerous, especially if there are races
that can be exploited by user space.
Right now, it would be sufficient to have some code that sets a PTE of a
R/O-mapped shared page dirty, in order for it to erroneously become
writable by FOLL_FORCE. The implications of setting a write-protected PTE
dirty might not be immediately obvious to everyone.
And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
a shmem page R/O while marking the pte dirty. This can be used by
unprivileged user space to modify tmpfs/shmem file content even if the
user does not have write permissions to the file, and to bypass memfd
write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
To fix such security issues for good, the insight is that we really only
need that fancy retry logic (FOLL_COW) for COW mappings that are not
writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
we have an exclusive anonymous page mapped. If we have something else
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
we have to trigger a write fault to break COW. If we don't find an
exclusive anonymous page when we retry, we have to trigger COW breaking
once again because something intervened.
Let's move away from this mandatory-retry + dirty handling and rely on our
PageAnonExclusive() flag for making a similar decision, to use the same
COW logic as in other kernel parts here as well. In case we stumble over
a PTE in a COW mapping that does not map an exclusive anonymous page, COW
was not properly broken and we have to trigger a fake write-fault to break
COW.
Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e
("mm/mprotect: try avoiding write faults for exclusive anonymous pages
when changing protection") and commit 76aefad628aa ("mm/mprotect: fix
soft-dirty check in can_change_pte_writable()"), take care of softdirty
and uffd-wp manually.
For example, a write() via /proc/self/mem to a uffd-wp-protected range has
to fail instead of silently granting write access and bypassing the
userspace fault handler. Note that FOLL_FORCE is not only used for debug
access, but also triggered by applications without debug intentions, for
example, when pinning pages via RDMA.
This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
let's just get rid of it.
Thanks to Nadav Amit for pointing out that the pte_dirty() check in
FOLL_FORCE code is problematic and might be exploitable.
Note 1: We don't check for the PTE being dirty because it doesn't matter
for making a "was COWed" decision anymore, and whoever modifies the
page has to set the page dirty either way.
Note 2: Kernels before extended uffd-wp support and before
PageAnonExclusive (< 5.19) can simply revert the problematic
commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
v5.19 requires minor adjustments due to lack of
vma_soft_dirty_enabled().
Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: David Laight <David.Laight(a)ACULAB.COM>
Cc: <stable(a)vger.kernel.org> [5.16]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 1
mm/gup.c | 68 +++++++++++++++++++++++++++++--------------
mm/huge_memory.c | 64 +++++++++++++++++++++++++++-------------
3 files changed, 89 insertions(+), 44 deletions(-)
--- a/include/linux/mm.h~mm-gup-fix-foll_force-cow-security-issue-and-remove-foll_cow
+++ a/include/linux/mm.h
@@ -2884,7 +2884,6 @@ struct page *follow_page(struct vm_area_
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
-#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
--- a/mm/gup.c~mm-gup-fix-foll_force-cow-security-issue-and-remove-foll_cow
+++ a/mm/gup.c
@@ -478,14 +478,42 @@ static int follow_pfn_pte(struct vm_area
return -EEXIST;
}
-/*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int flags)
{
- return pte_write(pte) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+ /* If the pte is writable, we can write to the page. */
+ if (pte_write(pte))
+ return true;
+
+ /* Maybe FOLL_FORCE is set to override it? */
+ if (!(flags & FOLL_FORCE))
+ return false;
+
+ /* But FOLL_FORCE has no effect on shared mappings */
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+ return false;
+
+ /* ... or read-only private ones */
+ if (!(vma->vm_flags & VM_MAYWRITE))
+ return false;
+
+ /* ... or already writable ones that just need to take a write fault */
+ if (vma->vm_flags & VM_WRITE)
+ return false;
+
+ /*
+ * See can_change_pte_writable(): we broke COW and could map the page
+ * writable if we have an exclusive anonymous page ...
+ */
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+ return false;
+
+ /* ... and a write-fault isn't required for other reasons. */
+ if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
+ return false;
+ return !userfaultfd_pte_wp(vma, pte);
}
static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -528,12 +556,19 @@ retry:
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
- pte_unmap_unlock(ptep, ptl);
- return NULL;
- }
page = vm_normal_page(vma, address, pte);
+
+ /*
+ * We only care about anon pages in can_follow_write_pte() and don't
+ * have to worry about pte_devmap() because they are never anon.
+ */
+ if ((flags & FOLL_WRITE) &&
+ !can_follow_write_pte(pte, page, vma, flags)) {
+ page = NULL;
+ goto out;
+ }
+
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
* Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -986,17 +1021,6 @@ static int faultin_page(struct vm_area_s
return -EBUSY;
}
- /*
- * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
- * necessary, even if maybe_mkwrite decided not to set pte_write. We
- * can thus safely do subsequent page lookups as if they were reads.
- * But only do so when looping for pte_write is futile: in some cases
- * userspace may also be wanting to write to the gotten user page,
- * which a read fault here might prevent (a readonly page might get
- * reCOWed by userspace write).
- */
- if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
- *flags |= FOLL_COW;
return 0;
}
--- a/mm/huge_memory.c~mm-gup-fix-foll_force-cow-security-issue-and-remove-foll_cow
+++ a/mm/huge_memory.c
@@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm
assert_spin_locked(pmd_lockptr(mm, pmd));
- /*
- * When we COW a devmap PMD entry, we split it into PTEs, so we should
- * not be in this function with `flags & FOLL_COW` set.
- */
- WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
-
/* FOLL_GET and FOLL_PIN are mutually exclusive. */
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
@@ -1395,14 +1389,42 @@ fallback:
return VM_FAULT_FALLBACK;
}
-/*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int flags)
{
- return pmd_write(pmd) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+ /* If the pmd is writable, we can write to the page. */
+ if (pmd_write(pmd))
+ return true;
+
+ /* Maybe FOLL_FORCE is set to override it? */
+ if (!(flags & FOLL_FORCE))
+ return false;
+
+ /* But FOLL_FORCE has no effect on shared mappings */
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+ return false;
+
+ /* ... or read-only private ones */
+ if (!(vma->vm_flags & VM_MAYWRITE))
+ return false;
+
+ /* ... or already writable ones that just need to take a write fault */
+ if (vma->vm_flags & VM_WRITE)
+ return false;
+
+ /*
+ * See can_change_pte_writable(): we broke COW and could map the page
+ * writable if we have an exclusive anonymous page ...
+ */
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+ return false;
+
+ /* ... and a write-fault isn't required for other reasons. */
+ if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+ return false;
+ return !userfaultfd_huge_pmd_wp(vma, pmd);
}
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1411,12 +1433,16 @@ struct page *follow_trans_huge_pmd(struc
unsigned int flags)
{
struct mm_struct *mm = vma->vm_mm;
- struct page *page = NULL;
+ struct page *page;
assert_spin_locked(pmd_lockptr(mm, pmd));
- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
- goto out;
+ page = pmd_page(*pmd);
+ VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+ if ((flags & FOLL_WRITE) &&
+ !can_follow_write_pmd(*pmd, page, vma, flags))
+ return NULL;
/* Avoid dumping huge zero page */
if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
@@ -1424,10 +1450,7 @@ struct page *follow_trans_huge_pmd(struc
/* Full NUMA hinting faults to serialise migration in fault paths */
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
- goto out;
-
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+ return NULL;
if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
return ERR_PTR(-EMLINK);
@@ -1444,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struc
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-out:
return page;
}
_
Patches currently in -mm which might be from david(a)redhat.com are
The quilt patch titled
Subject: Revert "zram: remove double compression logic"
has been removed from the -mm tree. Its filename was
revert-zram-remove-double-compression-logic.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jiri Slaby <jslaby(a)suse.cz>
Subject: Revert "zram: remove double compression logic"
Date: Wed, 10 Aug 2022 09:06:09 +0200
This reverts commit e7be8d1dd983156b ("zram: remove double compression
logic") as it causes zram failures. It does not revert cleanly, PTR_ERR
handling was introduced in the meantime. This is handled by appropriate
IS_ERR.
When under memory pressure, zs_malloc() can fail. Before the above
commit, the allocation was retried with direct reclaim enabled (GFP_NOIO).
After the commit, it is not -- only __GFP_KSWAPD_RECLAIM is tried.
So when the failure occurs under memory pressure, the overlaying
filesystem such as ext2 (mounted by ext4 module in this case) can emit
failures, making the (file)system unusable:
EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744)
Buffer I/O error on device zram0, logical block 159744
With direct reclaim, memory is really reclaimed and allocation succeeds,
eventually. In the worst case, the oom killer is invoked, which is proper
outcome if user sets up zram too large (in comparison to available RAM).
This very diff doesn't apply to 5.19 (stable) cleanly (see PTR_ERR note
above). Use revert of e7be8d1dd983 directly.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1202203
Link: https://lkml.kernel.org/r/20220810070609.14402-1-jslaby@suse.cz
Fixes: e7be8d1dd983 ("zram: remove double compression logic")
Signed-off-by: Jiri Slaby <jslaby(a)suse.cz>
Reviewed-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Nitin Gupta <ngupta(a)vflare.org>
Cc: Alexey Romanov <avromanov(a)sberdevices.ru>
Cc: Dmitry Rokosov <ddrokosov(a)sberdevices.ru>
Cc: Lukas Czerner <lczerner(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [5.19]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 42 ++++++++++++++++++++++++--------
drivers/block/zram/zram_drv.h | 1
2 files changed, 33 insertions(+), 10 deletions(-)
--- a/drivers/block/zram/zram_drv.c~revert-zram-remove-double-compression-logic
+++ a/drivers/block/zram/zram_drv.c
@@ -1146,14 +1146,15 @@ static ssize_t bd_stat_show(struct devic
static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- int version = 2;
+ int version = 1;
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
- "version: %d\n%8llu\n",
+ "version: %d\n%8llu %8llu\n",
version,
+ (u64)atomic64_read(&zram->stats.writestall),
(u64)atomic64_read(&zram->stats.miss_free));
up_read(&zram->init_lock);
@@ -1351,7 +1352,7 @@ static int __zram_bvec_write(struct zram
{
int ret = 0;
unsigned long alloced_pages;
- unsigned long handle = 0;
+ unsigned long handle = -ENOMEM;
unsigned int comp_len = 0;
void *src, *dst, *mem;
struct zcomp_strm *zstrm;
@@ -1369,6 +1370,7 @@ static int __zram_bvec_write(struct zram
}
kunmap_atomic(mem);
+compress_again:
zstrm = zcomp_stream_get(zram->comp);
src = kmap_atomic(page);
ret = zcomp_compress(zstrm, src, &comp_len);
@@ -1377,20 +1379,39 @@ static int __zram_bvec_write(struct zram
if (unlikely(ret)) {
zcomp_stream_put(zram->comp);
pr_err("Compression failed! err=%d\n", ret);
+ zs_free(zram->mem_pool, handle);
return ret;
}
if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
-
- handle = zs_malloc(zram->mem_pool, comp_len,
- __GFP_KSWAPD_RECLAIM |
- __GFP_NOWARN |
- __GFP_HIGHMEM |
- __GFP_MOVABLE);
-
+ /*
+ * handle allocation has 2 paths:
+ * a) fast path is executed with preemption disabled (for
+ * per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
+ * since we can't sleep;
+ * b) slow path enables preemption and attempts to allocate
+ * the page with __GFP_DIRECT_RECLAIM bit set. we have to
+ * put per-cpu compression stream and, thus, to re-do
+ * the compression once handle is allocated.
+ *
+ * if we have a 'non-null' handle here then we are coming
+ * from the slow path and handle has already been allocated.
+ */
+ if (IS_ERR((void *)handle))
+ handle = zs_malloc(zram->mem_pool, comp_len,
+ __GFP_KSWAPD_RECLAIM |
+ __GFP_NOWARN |
+ __GFP_HIGHMEM |
+ __GFP_MOVABLE);
if (IS_ERR((void *)handle)) {
zcomp_stream_put(zram->comp);
+ atomic64_inc(&zram->stats.writestall);
+ handle = zs_malloc(zram->mem_pool, comp_len,
+ GFP_NOIO | __GFP_HIGHMEM |
+ __GFP_MOVABLE);
+ if (!IS_ERR((void *)handle))
+ goto compress_again;
return PTR_ERR((void *)handle);
}
@@ -1948,6 +1969,7 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
ret = device_add_disk(NULL, zram->disk, zram_disk_groups);
if (ret)
goto out_cleanup_disk;
--- a/drivers/block/zram/zram_drv.h~revert-zram-remove-double-compression-logic
+++ a/drivers/block/zram/zram_drv.h
@@ -81,6 +81,7 @@ struct zram_stats {
atomic64_t huge_pages_since; /* no. of huge pages since zram set up */
atomic64_t pages_stored; /* no. of pages currently stored */
atomic_long_t max_used_pages; /* no. of maximum pages stored */
+ atomic64_t writestall; /* no. of write slow paths */
atomic64_t miss_free; /* no. of missed free */
#ifdef CONFIG_ZRAM_WRITEBACK
atomic64_t bd_count; /* no. of pages in backing device */
_
Patches currently in -mm which might be from jslaby(a)suse.cz are
The patch titled
Subject: ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
ocfs2-fix-freeing-uninitialized-resource-on-ocfs2_dlm_shutdown.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Heming Zhao via Ocfs2-devel <ocfs2-devel(a)oss.oracle.com>
Subject: ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
Date: Mon, 15 Aug 2022 16:57:54 +0800
After commit 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job
before return error"), any procedure after ocfs2_dlm_init() fails will
trigger crash when calling ocfs2_dlm_shutdown().
ie: On local mount mode, no dlm resource is initialized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call
ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will
trigger kernel crash.
This solution should bypass uninitialized resources in
ocfs2_dlm_shutdown().
Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com
Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before return error")
Signed-off-by: Heming Zhao <heming.zhao(a)suse.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/dlmglue.c | 8 +++++---
fs/ocfs2/super.c | 3 +--
2 files changed, 6 insertions(+), 5 deletions(-)
--- a/fs/ocfs2/dlmglue.c~ocfs2-fix-freeing-uninitialized-resource-on-ocfs2_dlm_shutdown
+++ a/fs/ocfs2/dlmglue.c
@@ -3403,10 +3403,12 @@ void ocfs2_dlm_shutdown(struct ocfs2_sup
ocfs2_lock_res_free(&osb->osb_nfs_sync_lockres);
ocfs2_lock_res_free(&osb->osb_orphan_scan.os_lockres);
- ocfs2_cluster_disconnect(osb->cconn, hangup_pending);
- osb->cconn = NULL;
+ if (osb->cconn) {
+ ocfs2_cluster_disconnect(osb->cconn, hangup_pending);
+ osb->cconn = NULL;
- ocfs2_dlm_shutdown_debug(osb);
+ ocfs2_dlm_shutdown_debug(osb);
+ }
}
static int ocfs2_drop_lock(struct ocfs2_super *osb,
--- a/fs/ocfs2/super.c~ocfs2-fix-freeing-uninitialized-resource-on-ocfs2_dlm_shutdown
+++ a/fs/ocfs2/super.c
@@ -1914,8 +1914,7 @@ static void ocfs2_dismount_volume(struct
!ocfs2_is_hard_readonly(osb))
hangup_needed = 1;
- if (osb->cconn)
- ocfs2_dlm_shutdown(osb, hangup_needed);
+ ocfs2_dlm_shutdown(osb, hangup_needed);
ocfs2_blockcheck_stats_debugfs_remove(&osb->osb_ecc_stats);
debugfs_remove_recursive(osb->osb_debug_root);
_
Patches currently in -mm which might be from ocfs2-devel(a)oss.oracle.com are
ocfs2-fix-freeing-uninitialized-resource-on-ocfs2_dlm_shutdown.patch
This fixes broken atomic checks which cause a race between the
release-timer and processing of hid input.
I noticed that contacts were sometimes sticking, even with the "sticky
fingers" quirk enabled. This fixes that problem.
Cc: stable(a)vger.kernel.org
Fixes: 9609827458c3 ("HID: multitouch: optimize the sticky fingers timer")
Signed-off-by: Andri Yngvason <andri(a)yngvason.is>
---
V1 -> V2: Clarified where the race is and added Fixes tag as suggested
by Greg KH
V2 -> V3: Fix formatting of "Fixes" tag
drivers/hid/hid-multitouch.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/hid-multitouch.c b/drivers/hid/hid-multitouch.c
index 2e72922e36f5..91a4d3fc30e0 100644
--- a/drivers/hid/hid-multitouch.c
+++ b/drivers/hid/hid-multitouch.c
@@ -1186,7 +1186,7 @@ static void mt_touch_report(struct hid_device *hid,
int contact_count = -1;
/* sticky fingers release in progress, abort */
- if (test_and_set_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
+ if (test_and_set_bit_lock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
return;
scantime = *app->scantime;
@@ -1267,7 +1267,7 @@ static void mt_touch_report(struct hid_device *hid,
del_timer(&td->release_timer);
}
- clear_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
+ clear_bit_unlock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
}
static int mt_touch_input_configured(struct hid_device *hdev,
@@ -1699,11 +1699,11 @@ static void mt_expired_timeout(struct timer_list *t)
* An input report came in just before we release the sticky fingers,
* it will take care of the sticky fingers.
*/
- if (test_and_set_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
+ if (test_and_set_bit_lock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
return;
if (test_bit(MT_IO_FLAGS_PENDING_SLOTS, &td->mt_io_flags))
mt_release_contacts(hdev);
- clear_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
+ clear_bit_unlock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
}
static int mt_probe(struct hid_device *hdev, const struct hid_device_id *id)
--
2.37.1
This fixes broken atomic checks which cause a race between the
release-timer and processing of hid input.
I noticed that contacts were sometimes sticking, even with the "sticky
fingers" quirk enabled. This fixes that problem.
Cc: stable(a)vger.kernel.org
Fixes: 9609827458c37d7b2c37f2a9255631c603a5004c
Signed-off-by: Andri Yngvason <andri(a)yngvason.is>
---
V1 -> V2: Clarified where the race is and added Fixes tag as suggested
by Greg KH
drivers/hid/hid-multitouch.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/hid-multitouch.c b/drivers/hid/hid-multitouch.c
index 2e72922e36f5..91a4d3fc30e0 100644
--- a/drivers/hid/hid-multitouch.c
+++ b/drivers/hid/hid-multitouch.c
@@ -1186,7 +1186,7 @@ static void mt_touch_report(struct hid_device *hid,
int contact_count = -1;
/* sticky fingers release in progress, abort */
- if (test_and_set_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
+ if (test_and_set_bit_lock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
return;
scantime = *app->scantime;
@@ -1267,7 +1267,7 @@ static void mt_touch_report(struct hid_device *hid,
del_timer(&td->release_timer);
}
- clear_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
+ clear_bit_unlock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
}
static int mt_touch_input_configured(struct hid_device *hdev,
@@ -1699,11 +1699,11 @@ static void mt_expired_timeout(struct timer_list *t)
* An input report came in just before we release the sticky fingers,
* it will take care of the sticky fingers.
*/
- if (test_and_set_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
+ if (test_and_set_bit_lock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
return;
if (test_bit(MT_IO_FLAGS_PENDING_SLOTS, &td->mt_io_flags))
mt_release_contacts(hdev);
- clear_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
+ clear_bit_unlock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
}
static int mt_probe(struct hid_device *hdev, const struct hid_device_id *id)
--
2.37.1
The patch titled
Subject: Revert "memcg: cleanup racy sum avoidance code"
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
revert-memcg-cleanup-racy-sum-avoidance-code.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: Revert "memcg: cleanup racy sum avoidance code"
Date: Wed, 17 Aug 2022 17:21:39 +0000
This reverts commit 96e51ccf1af33e82f429a0d6baebba29c6448d0f.
Recently we started running the kernel with rstat infrastructure on
production traffic and begin to see negative memcg stats values.
Particularly the 'sock' stat is the one which we observed having negative
value.
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224
Re-run after couple of seconds
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248
For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat. I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often. So, this negative
sock is due to rstat flusher flushing the stats on the CPU that has seen
the decrement of sock but missed the CPU that has increments. A typical
race condition.
For easy stable backport, revert is the most simple solution. For long
term solution, I am thinking of two directions. First is just reduce the
race window by optimizing the rstat flusher. Second is if the reader sees
a negative stat value, force flush and restart the stat collection.
Basically retry but limited.
Link: https://lkml.kernel.org/r/20220817172139.3141101-1-shakeelb@google.com
Fixes: 96e51ccf1af33e8 ("memcg: cleanup racy sum avoidance code")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: "Michal Koutn��" <mkoutny(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Yosry Ahmed <yosryahmed(a)google.com>
Cc: Greg Thelen <gthelen(a)google.com>
Cc: <stable(a)vger.kernel.org> [5.15]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/memcontrol.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
--- a/include/linux/memcontrol.h~revert-memcg-cleanup-racy-sum-avoidance-code
+++ a/include/linux/memcontrol.h
@@ -987,19 +987,30 @@ static inline void mod_memcg_page_state(
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
- return READ_ONCE(memcg->vmstats.state[idx]);
+ long x = READ_ONCE(memcg->vmstats.state[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
+ long x;
if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- return READ_ONCE(pn->lruvec_stats.state[idx]);
+ x = READ_ONCE(pn->lruvec_stats.state[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
}
static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
_
Patches currently in -mm which might be from shakeelb(a)google.com are
revert-memcg-cleanup-racy-sum-avoidance-code.patch
This reverts commit 07313a2b29ed1079eaa7722624544b97b3ead84b.
Commit 0c24e061196c21d5 ("mm: kmemleak: add rbtree and store physical
address for objects allocated with PA") is not yet in 5.19 (but appears
in 6.0). Without 0c24e061196c21d5, kmemleak still stores phys objects
and non-phys objects in the same tree, and ignoring (instead of freeing)
will cause insertions into the kmemleak object tree by the slab
post-alloc hook to conflict with the pool object (see comment).
Reports such as the following would appear on boot, and effectively
disable kmemleak:
| kmemleak: Cannot insert 0xffffff806e24f000 into the object search tree (overlaps existing)
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-v8-0815+ #5
| Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
| Call trace:
| dump_backtrace.part.0+0x1dc/0x1ec
| show_stack+0x24/0x80
| dump_stack_lvl+0x8c/0xb8
| dump_stack+0x1c/0x38
| create_object.isra.0+0x490/0x4b0
| kmemleak_alloc+0x3c/0x50
| kmem_cache_alloc+0x2f8/0x450
| __proc_create+0x18c/0x400
| proc_create_reg+0x54/0xd0
| proc_create_seq_private+0x94/0x120
| init_mm_internals+0x1d8/0x248
| kernel_init_freeable+0x188/0x388
| kernel_init+0x30/0x150
| ret_from_fork+0x10/0x20
| kmemleak: Kernel memory leak detector disabled
| kmemleak: Object 0xffffff806e24d000 (size 2097152):
| kmemleak: comm "swapper", pid 0, jiffies 4294892296
| kmemleak: min_count = -1
| kmemleak: count = 0
| kmemleak: flags = 0x5
| kmemleak: checksum = 0
| kmemleak: backtrace:
| kmemleak_alloc_phys+0x94/0xb0
| memblock_alloc_range_nid+0x1c0/0x20c
| memblock_alloc_internal+0x88/0x100
| memblock_alloc_try_nid+0x148/0x1ac
| kfence_alloc_pool+0x44/0x6c
| mm_init+0x28/0x98
| start_kernel+0x178/0x3e8
| __primary_switched+0xc4/0xcc
Reported-by: Max Schulze <max.schulze(a)online.de>
Signed-off-by: Marco Elver <elver(a)google.com>
---
mm/kfence/core.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 6aff49f6b79e..4b5e5a3d3a63 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -603,6 +603,14 @@ static unsigned long kfence_init_pool(void)
addr += 2 * PAGE_SIZE;
}
+ /*
+ * The pool is live and will never be deallocated from this point on.
+ * Remove the pool object from the kmemleak object tree, as it would
+ * otherwise overlap with allocations returned by kfence_alloc(), which
+ * are registered with kmemleak through the slab post-alloc hook.
+ */
+ kmemleak_free(__kfence_pool);
+
return 0;
}
@@ -615,16 +623,8 @@ static bool __init kfence_init_pool_early(void)
addr = kfence_init_pool();
- if (!addr) {
- /*
- * The pool is live and will never be deallocated from this point on.
- * Ignore the pool object from the kmemleak phys object tree, as it would
- * otherwise overlap with allocations returned by kfence_alloc(), which
- * are registered with kmemleak through the slab post-alloc hook.
- */
- kmemleak_ignore_phys(__pa(__kfence_pool));
+ if (!addr)
return true;
- }
/*
* Only release unprotected pages, and do not try to go back and change
--
2.37.1.595.g718a3a8f04-goog
This fixes a race with the release-timer by adding acquire/release
barrier semantics.
I noticed that contacts were sometimes sticking, even with the "sticky
fingers" quirk enabled. This fixes that problem.
Cc: stable(a)vger.kernel.org
Signed-off-by: Andri Yngvason <andri(a)yngvason.is>
---
drivers/hid/hid-multitouch.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/hid-multitouch.c b/drivers/hid/hid-multitouch.c
index 2e72922e36f5..91a4d3fc30e0 100644
--- a/drivers/hid/hid-multitouch.c
+++ b/drivers/hid/hid-multitouch.c
@@ -1186,7 +1186,7 @@ static void mt_touch_report(struct hid_device *hid,
int contact_count = -1;
/* sticky fingers release in progress, abort */
- if (test_and_set_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
+ if (test_and_set_bit_lock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
return;
scantime = *app->scantime;
@@ -1267,7 +1267,7 @@ static void mt_touch_report(struct hid_device *hid,
del_timer(&td->release_timer);
}
- clear_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
+ clear_bit_unlock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
}
static int mt_touch_input_configured(struct hid_device *hdev,
@@ -1699,11 +1699,11 @@ static void mt_expired_timeout(struct timer_list *t)
* An input report came in just before we release the sticky fingers,
* it will take care of the sticky fingers.
*/
- if (test_and_set_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
+ if (test_and_set_bit_lock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags))
return;
if (test_bit(MT_IO_FLAGS_PENDING_SLOTS, &td->mt_io_flags))
mt_release_contacts(hdev);
- clear_bit(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
+ clear_bit_unlock(MT_IO_FLAGS_RUNNING, &td->mt_io_flags);
}
static int mt_probe(struct hid_device *hdev, const struct hid_device_id *id)
--
2.37.1
The commit 99c13b8c8896d7bcb92753bf
("x86/mm/pat: Don't report PAT on CPUs that don't support it")
incorrectly failed to account for the case in init_cache_modes() when
CPUs do support PAT and falsely reported PAT to be disabled when in
fact PAT is enabled. In some environments, notably in Xen PV domains,
MTRR is disabled but PAT is still enabled, and that is the case
that the aforementioned commit failed to account for.
As an unfortunate consequnce, the pat_enabled() function currently does
not correctly report that PAT is enabled in such environments. The fix
is implemented in init_cache_modes() by setting pat_bp_enabled to true
in init_cache_modes() for the case that commit 99c13b8c8896d7bcb92753bf
("x86/mm/pat: Don't report PAT on CPUs that don't support it") failed
to account for.
This approach arranges for pat_enabled() to return true in the Xen PV
environment without undermining the rest of PAT MSR management logic
that considers PAT to be disabled: Specifically, no writes to the PAT
MSR should occur.
This patch fixes a regression that some users are experiencing with
Linux as a Xen Dom0 driving particular Intel graphics devices by
correctly reporting to the Intel i915 driver that PAT is enabled where
previously it was falsely reporting that PAT is disabled. Some users
are experiencing system hangs in Xen PV Dom0 and all users on Xen PV
Dom0 are experiencing reduced graphics performance because the keying of
the use of WC mappings to pat_enabled() (see arch_can_pci_mmap_wc())
means that in particular graphics frame buffer accesses are quite a bit
less performant than possible without this patch.
Also, with the current code, in the Xen PV environment, PAT will not be
disabled if the administrator sets the "nopat" boot option. Introduce
a new boolean variable, pat_force_disable, to forcibly disable PAT
when the administrator sets the "nopat" option to override the default
behavior of using the PAT configuration that Xen has provided.
For the new boolean to live in .init.data, init_cache_modes() also needs
moving to .init.text (where it could/should have lived already before).
Fixes: 99c13b8c8896d7bcb92753bf ("x86/mm/pat: Don't report PAT on CPUs that don't support it")
Co-developed-by: Jan Beulich <jbeulich(a)suse.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Chuck Zmudzinski <brchuckz(a)aol.com>
---
v2: *Add force_pat_disabled variable to fix "nopat" on Xen PV (Jan Beulich)
*Add the necessary code to incorporate the "nopat" fix
*void init_cache_modes(void) -> void __init init_cache_modes(void)
*Add Jan Beulich as Co-developer (Jan has not signed off yet)
*Expand the commit message to include relevant parts of the commit
message of Jan Beulich's proposed patch for this problem
*Fix 'else if ... {' placement and indentation
*Remove indication the backport to stable branches is only back to 5.17.y
I think these changes address all the comments on the original patch
I added Jan Beulich as a Co-developer because Juergen Gross asked me to
include Jan's idea for fixing "nopat" that was missing from the first
version of the patch.
The patch has been tested, it works as expected with and without nopat
in the Xen PV Dom0 environment. That is, "nopat" causes the system to
exhibit the effects and problems that lack of PAT support causes.
arch/x86/mm/pat/memtype.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index d5ef64ddd35e..10a37d309d23 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -62,6 +62,7 @@
static bool __read_mostly pat_bp_initialized;
static bool __read_mostly pat_disabled = !IS_ENABLED(CONFIG_X86_PAT);
+static bool __initdata pat_force_disabled = !IS_ENABLED(CONFIG_X86_PAT);
static bool __read_mostly pat_bp_enabled;
static bool __read_mostly pat_cm_initialized;
@@ -86,6 +87,7 @@ void pat_disable(const char *msg_reason)
static int __init nopat(char *str)
{
pat_disable("PAT support disabled via boot option.");
+ pat_force_disabled = true;
return 0;
}
early_param("nopat", nopat);
@@ -272,7 +274,7 @@ static void pat_ap_init(u64 pat)
wrmsrl(MSR_IA32_CR_PAT, pat);
}
-void init_cache_modes(void)
+void __init init_cache_modes(void)
{
u64 pat = 0;
@@ -292,7 +294,7 @@ void init_cache_modes(void)
rdmsrl(MSR_IA32_CR_PAT, pat);
}
- if (!pat) {
+ if (!pat || pat_force_disabled) {
/*
* No PAT. Emulate the PAT table that corresponds to the two
* cache bits, PWT (Write Through) and PCD (Cache Disable).
@@ -313,6 +315,16 @@ void init_cache_modes(void)
*/
pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
PAT(4, WB) | PAT(5, WT) | PAT(6, UC_MINUS) | PAT(7, UC);
+ } else if (!pat_bp_enabled) {
+ /*
+ * In some environments, specifically Xen PV, PAT
+ * initialization is skipped because MTRRs are disabled even
+ * though PAT is available. In such environments, set PAT to
+ * enabled to correctly indicate to callers of pat_enabled()
+ * that CPU support for PAT is available.
+ */
+ pat_bp_enabled = true;
+ pr_info("x86/PAT: PAT enabled by init_cache_modes\n");
}
__init_cache_modes(pat);
--
2.36.1
Hi Greg and Sasha,
This two patches are backports for v5.18 branch.
These two patches are reducing the chance of destructive RMW cycle,
where btrfs can use corrupted data to generate new P/Q, thus making some
repairable data unrepairable.
Those patches are more important than what I initially thought, thus
unfortunately they are not CCed to stable by themselves.
Furthermore due to recent refactors/renames, there are quite some member
change related to those patches, thus have to be manually backported.
(The v5.18 backport is more like the v5.15 backport, with small tweaks
due to member naming change).
One of the fastest way to verify the behavior is the existing btrfs/125
test case from fstests. (not in auto group AFAIK).
Qu Wenruo (2):
btrfs: only write the sectors in the vertical stripe which has data
stripes
btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()
fs/btrfs/raid56.c | 74 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 17 deletions(-)
--
2.37.0