When I converted rk808 to device managed resources I converted the rk808
specific pm_power_off handler to devm_register_sys_off_handler() using
SYS_OFF_MODE_POWER_OFF_PREPARE, which is allowed to sleep. I did this
because the driver's poweroff function makes use of regmap and the backend
of that might sleep.
But the PMIC poweroff function will kill off the board power and the
kernel does some extra steps after the prepare handler. Thus the prepare
handler should not be used for the PMIC's poweroff routine. Instead the
normal SYS_OFF_MODE_POWER_OFF phase should be used. The old pm_power_off
method is also being called from there, so this would have been a
cleaner conversion anyways.
But it still makes sense to investigate the sleep handling and check
if there are any issues. Apparently the Rockchip and Meson I2C drivers
(the only platforms using the PMICs handled by this driver) both have
support for atomic transfers and thus may be called from the proper
poweroff context.
Things are different on the SPI side. That is so far only used by rk806
and that one is only used by Rockchip RK3588. Unfortunately the Rockchip
SPI driver does not support atomic transfers. That means using the
normal POWER_OFF handler would introduce the following error splash
during shutdown on all RK3588 boards currently supported upstream:
[ 13.761353] ------------[ cut here ]------------
[ 13.761764] Voluntary context switch within RCU read-side critical section!
[ 13.761776] WARNING: CPU: 0 PID: 1 at kernel/rcu/tree_plugin.h:330 rcu_note_context_switch+0x3ac/0x404
[ 13.763219] Modules linked in:
[ 13.763498] CPU: 0 UID: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.10.0-12284-g2818a9a19514 #1499
[ 13.764297] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
[ 13.764812] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 13.765427] pc : rcu_note_context_switch+0x3ac/0x404
[ 13.765871] lr : rcu_note_context_switch+0x3ac/0x404
[ 13.766314] sp : ffff800084f4b5b0
[ 13.766609] x29: ffff800084f4b5b0 x28: ffff00040139b800 x27: 00007dfb4439ae80
[ 13.767245] x26: ffff00040139bc80 x25: 0000000000000000 x24: ffff800082118470
[ 13.767880] x23: 0000000000000000 x22: ffff000400300000 x21: ffff000400300000
[ 13.768515] x20: ffff800083a9d600 x19: ffff0004fee48600 x18: fffffffffffed448
[ 13.769151] x17: 000000040044ffff x16: 005000f2b5503510 x15: 0000000000000048
[ 13.769787] x14: fffffffffffed490 x13: ffff80008473b3c0 x12: 0000000000000900
[ 13.770421] x11: 0000000000000300 x10: ffff800084797bc0 x9 : ffff80008473b3c0
[ 13.771057] x8 : 00000000ffffefff x7 : ffff8000847933c0 x6 : 0000000000000300
[ 13.771692] x5 : 0000000000000301 x4 : 40000000fffff300 x3 : 0000000000000000
[ 13.772328] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000400300000
[ 13.772964] Call trace:
[ 13.773184] rcu_note_context_switch+0x3ac/0x404
[ 13.773598] __schedule+0x94/0xb0c
[ 13.773907] schedule+0x34/0x104
[ 13.774198] schedule_timeout+0x84/0xfc
[ 13.774544] wait_for_completion_timeout+0x78/0x14c
[ 13.774980] spi_transfer_one_message+0x588/0x690
[ 13.775403] __spi_pump_transfer_message+0x19c/0x4ec
[ 13.775846] __spi_sync+0x2a8/0x3c4
[ 13.776161] spi_write_then_read+0x120/0x208
[ 13.776543] rk806_spi_bus_read+0x54/0x88
[ 13.776905] _regmap_raw_read+0xec/0x16c
[ 13.777257] _regmap_bus_read+0x44/0x7c
[ 13.777601] _regmap_read+0x60/0xd8
[ 13.777915] _regmap_update_bits+0xf4/0x13c
[ 13.778289] regmap_update_bits_base+0x64/0x98
[ 13.778686] rk808_power_off+0x70/0xfc
[ 13.779024] sys_off_notify+0x40/0x6c
[ 13.779356] atomic_notifier_call_chain+0x60/0x90
[ 13.779776] do_kernel_power_off+0x54/0x6c
[ 13.780146] machine_power_off+0x18/0x24
[ 13.780499] kernel_power_off+0x70/0x7c
[ 13.780845] __do_sys_reboot+0x210/0x270
[ 13.781198] __arm64_sys_reboot+0x24/0x30
[ 13.781558] invoke_syscall+0x48/0x10c
[ 13.781897] el0_svc_common+0x3c/0xe8
[ 13.782228] do_el0_svc+0x20/0x2c
[ 13.782528] el0_svc+0x34/0xd8
[ 13.782806] el0t_64_sync_handler+0x120/0x12c
[ 13.783197] el0t_64_sync+0x190/0x194
[ 13.783527] ---[ end trace 0000000000000000 ]---
To avoid this we keep the SYS_OFF_MODE_POWER_OFF_PREPARE handler for the
SPI backend. This is not great, but at least avoids regressions and the
fix should be small enough to allow backporting.
As a side-effect this also works around a shutdown problem on the Asus
C201. For reasons unknown that skips calling the prepare handler and
directly calls the final shutdown handler.
Fixes: 4fec8a5a85c49 ("mfd: rk808: Convert to device managed resources")
Cc: stable(a)vger.kernel.org
Reported-by: Urja <urja(a)urja.dev>
Signed-off-by: Sebastian Reichel <sebastian.reichel(a)collabora.com>
---
drivers/mfd/rk8xx-core.c | 15 +++++++++++++--
drivers/mfd/rk8xx-i2c.c | 2 +-
drivers/mfd/rk8xx-spi.c | 2 +-
include/linux/mfd/rk808.h | 2 +-
4 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/drivers/mfd/rk8xx-core.c b/drivers/mfd/rk8xx-core.c
index 5eda3c0dbbdf..757ef8181328 100644
--- a/drivers/mfd/rk8xx-core.c
+++ b/drivers/mfd/rk8xx-core.c
@@ -692,10 +692,11 @@ void rk8xx_shutdown(struct device *dev)
}
EXPORT_SYMBOL_GPL(rk8xx_shutdown);
-int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap *regmap)
+int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap *regmap, bool is_spi)
{
struct rk808 *rk808;
const struct rk808_reg_data *pre_init_reg;
+ enum sys_off_mode pwr_off_mode = SYS_OFF_MODE_POWER_OFF;
const struct mfd_cell *cells;
int dual_support = 0;
int nr_pre_init_regs;
@@ -785,10 +786,20 @@ int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap
if (ret)
return dev_err_probe(dev, ret, "failed to add MFD devices\n");
+ /*
+ * Currently the Rockchip SPI driver always sleeps when doing SPI
+ * transfers. This is not allowed in the SYS_OFF_MODE_POWER_OFF
+ * handler, so we are using the prepare handler as a workaround.
+ * This should be removed once the Rockchip SPI driver has been
+ * adapted.
+ */
+ if (is_spi)
+ pwr_off_mode = SYS_OFF_MODE_POWER_OFF_PREPARE;
+
if (device_property_read_bool(dev, "rockchip,system-power-controller") ||
device_property_read_bool(dev, "system-power-controller")) {
ret = devm_register_sys_off_handler(dev,
- SYS_OFF_MODE_POWER_OFF_PREPARE, SYS_OFF_PRIO_HIGH,
+ pwr_off_mode, SYS_OFF_PRIO_HIGH,
&rk808_power_off, rk808);
if (ret)
return dev_err_probe(dev, ret,
diff --git a/drivers/mfd/rk8xx-i2c.c b/drivers/mfd/rk8xx-i2c.c
index 69a6b297d723..a2029decd654 100644
--- a/drivers/mfd/rk8xx-i2c.c
+++ b/drivers/mfd/rk8xx-i2c.c
@@ -189,7 +189,7 @@ static int rk8xx_i2c_probe(struct i2c_client *client)
return dev_err_probe(&client->dev, PTR_ERR(regmap),
"regmap initialization failed\n");
- return rk8xx_probe(&client->dev, data->variant, client->irq, regmap);
+ return rk8xx_probe(&client->dev, data->variant, client->irq, regmap, false);
}
static void rk8xx_i2c_shutdown(struct i2c_client *client)
diff --git a/drivers/mfd/rk8xx-spi.c b/drivers/mfd/rk8xx-spi.c
index 3405fb82ff9f..20f9428f94bb 100644
--- a/drivers/mfd/rk8xx-spi.c
+++ b/drivers/mfd/rk8xx-spi.c
@@ -94,7 +94,7 @@ static int rk8xx_spi_probe(struct spi_device *spi)
return dev_err_probe(&spi->dev, PTR_ERR(regmap),
"Failed to init regmap\n");
- return rk8xx_probe(&spi->dev, RK806_ID, spi->irq, regmap);
+ return rk8xx_probe(&spi->dev, RK806_ID, spi->irq, regmap, true);
}
static const struct of_device_id rk8xx_spi_of_match[] = {
diff --git a/include/linux/mfd/rk808.h b/include/linux/mfd/rk808.h
index 69cbea78b430..be15b84cff9e 100644
--- a/include/linux/mfd/rk808.h
+++ b/include/linux/mfd/rk808.h
@@ -1349,7 +1349,7 @@ struct rk808 {
};
void rk8xx_shutdown(struct device *dev);
-int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap *regmap);
+int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap *regmap, bool is_spi);
int rk8xx_suspend(struct device *dev);
int rk8xx_resume(struct device *dev);
--
2.43.0
From: yangge <yangge1116(a)126.com>
If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
virtual machine with device passthrough, it will
call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.
Normally if a page is present and in CMA area, pin_user_pages_remote()
will migrate the page from CMA area to non-CMA area because of
FOLL_LONGTERM flag. But the current code will cause the migration failure
due to unexpected page refcounts, and eventually cause the virtual machine
fail to start.
If a page is added in LRU batch, its refcount increases one, remove the
page from LRU batch decreases one. Page migration requires the page is not
referenced by others except page mapping. Before migrating a page, we
should try to drain the page from LRU batch in case the page is in it,
however, folio_test_lru() is not sufficient to tell whether the page is
in LRU batch or not, if the page is in LRU batch, the migration will fail.
To solve the problem above, we modify the logic of adding to LRU batch.
Before adding a page to LRU batch, we clear the LRU flag of the page so
that we can check whether the page is in LRU batch by folio_test_lru(page).
It's quite valuable, because likely we don't want to blindly drain the LRU
batch simply because there is some unexpected reference on a page, as
described above.
This change makes the LRU flag of a page invisible for longer, which
may impact some programs. For example, as long as a page is on a LRU
batch, we cannot isolate it, and we cannot check if it's an LRU page.
Further, a page can now only be on exactly one LRU batch. This doesn't
seem to matter much, because a new page is allocated from buddy and
added to the lru batch, or be isolated, it's LRU flag may also be
invisible for a long time.
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)
V4:
Adjust commit message according to David's comments
V3:
Add fixes tag
V2:
Adjust code and commit message according to David's comments
diff --git a/mm/swap.c b/mm/swap.c
index dc205bd..9caf6b0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -211,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
for (i = 0; i < folio_batch_count(fbatch); i++) {
struct folio *folio = fbatch->folios[i];
- /* block memcg migration while the folio moves between lru */
- if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
- continue;
-
folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio);
@@ -255,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
void folio_rotate_reclaimable(struct folio *folio)
{
if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
- !folio_test_unevictable(folio) && folio_test_lru(folio)) {
+ !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
unsigned long flags;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock_irqsave(&lru_rotate.lock, flags);
fbatch = this_cpu_ptr(&lru_rotate.fbatch);
folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn);
@@ -352,11 +353,15 @@ static void folio_activate_drain(int cpu)
void folio_activate(struct folio *folio)
{
- if (folio_test_lru(folio) && !folio_test_active(folio) &&
- !folio_test_unevictable(folio)) {
+ if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.activate);
folio_batch_add_and_move(fbatch, folio, folio_activate_fn);
@@ -700,6 +705,11 @@ void deactivate_file_folio(struct folio *folio)
return;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file);
folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn);
@@ -716,11 +726,16 @@ void deactivate_file_folio(struct folio *folio)
*/
void folio_deactivate(struct folio *folio)
{
- if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
- (folio_test_active(folio) || lru_gen_enabled())) {
+ if (!folio_test_unevictable(folio) && (folio_test_active(folio) ||
+ lru_gen_enabled())) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate);
folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn);
@@ -737,12 +752,16 @@ void folio_deactivate(struct folio *folio)
*/
void folio_mark_lazyfree(struct folio *folio)
{
- if (folio_test_lru(folio) && folio_test_anon(folio) &&
- folio_test_swapbacked(folio) && !folio_test_swapcache(folio) &&
- !folio_test_unevictable(folio)) {
+ if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree);
folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn);
--
2.7.4
From: Kairui Song <kasong(a)tencent.com>
This series fixes the page cache corruption issue reported by Christian
Theune [1]. The issue was reported affects kernels back to 5.19.
Current maintained effected branches includes 6.1 and 6.6 and the fix
was included in 6.10 already.
This series can be applied for both 6.1 and 6.6.
Patch 3/3 is the fixing patch. It was initially submitted and merge as
an optimization but found to have fixed the corruption by handling race
correctly.
Patch 1/3 and 2/3 is required for 3/3.
Patch 3/3 included some unit test code, making the LOC of the backport a
bit higher, but should be OK to be kept, since they are just test code.
Note there seems still some unresolved problem in Link [1] but that
should be a different issue, and the commits being backported have been
well tested, they fix the corruption issue just fine.
Link: https://lore.kernel.org/linux-mm/A5A976CB-DB57-4513-A700-656580488AB6@flyin… [1]
Kairui Song (3):
mm/filemap: return early if failed to allocate memory for split
lib/xarray: introduce a new helper xas_get_order
mm/filemap: optimize filemap folio adding
include/linux/xarray.h | 6 +++
lib/test_xarray.c | 93 ++++++++++++++++++++++++++++++++++++++++++
lib/xarray.c | 49 ++++++++++++++--------
mm/filemap.c | 50 ++++++++++++++++++-----
4 files changed, 169 insertions(+), 29 deletions(-)
--
2.46.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 2003ee8a9aa14d766b06088156978d53c2e9be3d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024120323-snowiness-subway-3844@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2003ee8a9aa14d766b06088156978d53c2e9be3d Mon Sep 17 00:00:00 2001
From: Muchun Song <muchun.song(a)linux.dev>
Date: Mon, 14 Oct 2024 17:29:32 +0800
Subject: [PATCH] block: fix missing dispatching request when queue is started
or unquiesced
Supposing the following scenario with a virtio_blk driver.
CPU0 CPU1 CPU2
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_try_issue_directly()
if (blk_mq_hctx_stopped())
blk_mq_request_bypass_insert() blk_mq_run_hw_queue()
blk_mq_run_hw_queue() blk_mq_run_hw_queue()
blk_mq_insert_request()
return
After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.
Fixes: d964f04a8fde ("blk-mq: fix direct issue")
Cc: stable(a)vger.kernel.org
Cc: Muchun Song <muchun.song(a)linux.dev>
Signed-off-by: Muchun Song <songmuchun(a)bytedance.com>
Reviewed-by: Ming Lei <ming.lei(a)redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-2-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d05a56e3639..5deb9dffca0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2647,6 +2647,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
blk_mq_insert_request(rq, 0);
+ blk_mq_run_hw_queue(hctx, false);
return;
}
@@ -2677,6 +2678,7 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
blk_mq_insert_request(rq, 0);
+ blk_mq_run_hw_queue(hctx, false);
return BLK_STS_OK;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 6bda857bcbb86fb9d0e54fbef93a093d51172acc
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024120342-monsoon-wildcat-d0a1@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6bda857bcbb86fb9d0e54fbef93a093d51172acc Mon Sep 17 00:00:00 2001
From: Muchun Song <muchun.song(a)linux.dev>
Date: Mon, 14 Oct 2024 17:29:33 +0800
Subject: [PATCH] block: fix ordering between checking QUEUE_FLAG_QUIESCED
request adding
Supposing the following scenario.
CPU0 CPU1
blk_mq_insert_request() 1) store
blk_mq_unquiesce_queue()
blk_queue_flag_clear() 3) store
blk_mq_run_hw_queues()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_run_hw_queue()
if (blk_queue_quiesced()) 2) load
return
blk_mq_sched_dispatch_requests()
The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED
is cleared or CPU1 sees dispatch list or setting of bitmap of software
queue. Otherwise, either CPU will not rerun the hardware queue causing
starvation.
So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since
memory barrier is not easy to be maintained.
Fixes: f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Cc: stable(a)vger.kernel.org
Cc: Muchun Song <muchun.song(a)linux.dev>
Signed-off-by: Muchun Song <songmuchun(a)bytedance.com>
Reviewed-by: Ming Lei <ming.lei(a)redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-3-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5deb9dffca0a..bb4ee2380dce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2227,6 +2227,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
}
EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
+static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
+{
+ bool need_run;
+
+ /*
+ * When queue is quiesced, we may be switching io scheduler, or
+ * updating nr_hw_queues, or other things, and we can't run queue
+ * any more, even blk_mq_hctx_has_pending() can't be called safely.
+ *
+ * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+ * quiesced.
+ */
+ __blk_mq_run_dispatch_ops(hctx->queue, false,
+ need_run = !blk_queue_quiesced(hctx->queue) &&
+ blk_mq_hctx_has_pending(hctx));
+ return need_run;
+}
+
/**
* blk_mq_run_hw_queue - Start to run a hardware queue.
* @hctx: Pointer to the hardware queue to run.
@@ -2247,20 +2265,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
- /*
- * When queue is quiesced, we may be switching io scheduler, or
- * updating nr_hw_queues, or other things, and we can't run queue
- * any more, even __blk_mq_hctx_has_pending() can't be called safely.
- *
- * And queue will be rerun in blk_mq_unquiesce_queue() if it is
- * quiesced.
- */
- __blk_mq_run_dispatch_ops(hctx->queue, false,
- need_run = !blk_queue_quiesced(hctx->queue) &&
- blk_mq_hctx_has_pending(hctx));
+ need_run = blk_mq_hw_queue_need_run(hctx);
+ if (!need_run) {
+ unsigned long flags;
- if (!need_run)
- return;
+ /*
+ * Synchronize with blk_mq_unquiesce_queue(), because we check
+ * if hw queue is quiesced locklessly above, we need the use
+ * ->queue_lock to make sure we see the up-to-date status to
+ * not miss rerunning the hw queue.
+ */
+ spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+ need_run = blk_mq_hw_queue_need_run(hctx);
+ spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
+
+ if (!need_run)
+ return;
+ }
if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
blk_mq_delay_run_hw_queue(hctx, 0);
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
released in v6.1.54 and broke the failover when one of the servers
inside DFS becomes unavailable. We reproduced the problem on the EC2
instances of different types. Reverting aforementioned commint on top of
the latest stable verison v6.1.94 helps to resolve the problem.
Earliest working version is v6.2-rc1. There were two big merges of CIFS fixes:
[1] and [2]. We would like to ask for the help to investigate this problem and
if some of those patches need to be backported. Also, is it safe to just revert
problematic commit until proper fixes/backports will be available?
We will help to do testing and confirm if fix works, but let me also list the
steps we used to reproduce the problem if it will help to identify the problem:
1. Create Active Directory domain eg. 'corp.fsxtest.local' in AWS Directory
Service with:
- three AWS FSX file systems filesystem1..filesystem3
- three Windows servers; They have DFS installed as per
https://learn.microsoft.com/en-us/windows-server/storage/dfs-namespaces/dfs…:
- dfs-srv1: EC2AMAZ-2EGTM59
- dfs-srv2: EC2AMAZ-1N36PRD
- dfs-srv3: EC2AMAZ-0PAUH2U
2. Create DFS namespace eg. 'dfs-namespace' in Windows server 2008 mode
and three folders targets in it:
- referral-a mapped to filesystem1.corp.local
- referral-b mapped to filesystem2.corp.local
- referral-c mapped to filesystem3.corp.local
- local folders dfs-srv1..dfs-srv3 in C:\DFSRoots\dfs-namespace of every
Windows server. This helps to quickly define underlying server when
DFS is mounted.
3. Enabled cifs debug logs:
```
echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control
echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control
echo 7 > /proc/fs/cifs/cifsFYI
```
4. Mount DFS namespace on Amazon Linux 2023 instance running any vanilla
kernel v6.1.54+:
```
dmesg -c &>/dev/null
cd /mnt
mount -t cifs -o cred=/mnt/creds,echo_interval=5 \
//corp.fsxtest.local/dfs-namespace \
./dfs-namespace
```
5. List DFS root, it's also required to avoid recursive mounts that happen
during regular 'ls' run:
```
sh -c 'ls dfs-namespace'
dfs-srv2 referral-a referral-b
```
The DFS server is EC2AMAZ-1N36PRD, it's also listed in mount:
```
[root@ip-172-31-2-82 mnt]# mount | grep dfs
//corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
//EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
```
List files in first folder:
```
sh -c 'ls dfs-namespace/referral-a'
filea.txt.txt
```
6. Shutdown DFS server-2.
List DFS root again, server changed from dfs-srv2 to dfs-srv1 EC2AMAZ-2EGTM59:
```
sh -c 'ls dfs-namespace'
dfs-srv1 referral-a referral-b
```
7. Try to list files in another folder, this causes ls to fail with error:
```
sh -c 'ls dfs-namespace/referral-b'
ls: cannot access 'dfs-namespace/referral-b': No route to host```
Sometimes it's also 'Operation now in progress' error.
mount shows the same output:
```
//corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
//EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
```
I also attached kernel debug logs from this test.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
Reported-by: Andrei Paniakin <apanyaki(a)amazon.com>
Bisected-by: Simba Bonga <simbarb(a)amazon.com>
---
#regzbot introduced: v6.1.54..v6.2-rc1
From: Ard Biesheuvel <ardb(a)kernel.org>
GCC and Clang both implement stack protector support based on Thread
Local Storage (TLS) variables, and this is used in the kernel to
implement per-task stack cookies, by copying a task's stack cookie into
a per-CPU variable every time it is scheduled in.
Both now also implement -mstack-protector-guard-symbol=, which permits
the TLS variable to be specified directly. This is useful because it
will allow us to move away from using a fixed offset of 40 bytes into
the per-CPU area on x86_64, which requires a lot of special handling in
the per-CPU code and the runtime relocation code.
However, while GCC is rather lax in its implementation of this command
line option, Clang actually requires that the provided symbol name
refers to a TLS variable (i.e., one declared with __thread), although it
also permits the variable to be undeclared entirely, in which case it
will use an implicit declaration of the right type.
The upshot of this is that Clang will emit the correct references to the
stack cookie variable in most cases, e.g.,
10d: 64 a1 00 00 00 00 mov %fs:0x0,%eax
10f: R_386_32 __stack_chk_guard
However, if a non-TLS definition of the symbol in question is visible in
the same compilation unit (which amounts to the whole of vmlinux if LTO
is enabled), it will drop the per-CPU prefix and emit a load from a
bogus address.
Work around this by using a symbol name that never occurs in C code, and
emit it as an alias in the linker script.
Fixes: 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular percpu variable")
Cc: <stable(a)vger.kernel.org>
Cc: Fangrui Song <i(a)maskray.me>
Cc: Uros Bizjak <ubizjak(a)gmail.com>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Andy Lutomirski <luto(a)kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1854
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
Signed-off-by: Brian Gerst <brgerst(a)gmail.com>
---
arch/x86/Makefile | 5 +++--
arch/x86/entry/entry.S | 16 ++++++++++++++++
arch/x86/include/asm/asm-prototypes.h | 3 +++
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/vmlinux.lds.S | 3 +++
5 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index cd75e78a06c1..5b773b34768d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -142,9 +142,10 @@ ifeq ($(CONFIG_X86_32),y)
ifeq ($(CONFIG_STACKPROTECTOR),y)
ifeq ($(CONFIG_SMP),y)
- KBUILD_CFLAGS += -mstack-protector-guard-reg=fs -mstack-protector-guard-symbol=__stack_chk_guard
+ KBUILD_CFLAGS += -mstack-protector-guard-reg=fs \
+ -mstack-protector-guard-symbol=__ref_stack_chk_guard
else
- KBUILD_CFLAGS += -mstack-protector-guard=global
+ KBUILD_CFLAGS += -mstack-protector-guard=global
endif
endif
else
diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
index 324686bca368..b7ea3e8e9ecc 100644
--- a/arch/x86/entry/entry.S
+++ b/arch/x86/entry/entry.S
@@ -51,3 +51,19 @@ EXPORT_SYMBOL_GPL(mds_verw_sel);
.popsection
THUNK warn_thunk_thunk, __warn_thunk
+
+#ifndef CONFIG_X86_64
+/*
+ * Clang's implementation of TLS stack cookies requires the variable in
+ * question to be a TLS variable. If the variable happens to be defined as an
+ * ordinary variable with external linkage in the same compilation unit (which
+ * amounts to the whole of vmlinux with LTO enabled), Clang will drop the
+ * segment register prefix from the references, resulting in broken code. Work
+ * around this by avoiding the symbol used in -mstack-protector-guard-symbol=
+ * entirely in the C code, and use an alias emitted by the linker script
+ * instead.
+ */
+#ifdef CONFIG_STACKPROTECTOR
+EXPORT_SYMBOL(__ref_stack_chk_guard);
+#endif
+#endif
diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index 25466c4d2134..3674006e3974 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -20,3 +20,6 @@
extern void cmpxchg8b_emu(void);
#endif
+#if defined(__GENKSYMS__) && defined(CONFIG_STACKPROTECTOR)
+extern unsigned long __ref_stack_chk_guard;
+#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8f41ab219cf1..9d42bd15e06c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2091,8 +2091,10 @@ void syscall_init(void)
#ifdef CONFIG_STACKPROTECTOR
DEFINE_PER_CPU(unsigned long, __stack_chk_guard);
+#ifndef CONFIG_SMP
EXPORT_PER_CPU_SYMBOL(__stack_chk_guard);
#endif
+#endif
#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 410546bacc0f..d61c3584f3e6 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -468,6 +468,9 @@ SECTIONS
. = ASSERT((_end - LOAD_OFFSET <= KERNEL_IMAGE_SIZE),
"kernel image bigger than KERNEL_IMAGE_SIZE");
+/* needed for Clang - see arch/x86/entry/entry.S */
+PROVIDE(__ref_stack_chk_guard = __stack_chk_guard);
+
#ifdef CONFIG_X86_64
/*
* Per-cpu symbols which need to be offset from __per_cpu_load
--
2.47.0