While searching for possible refactor of napi_schedule_prep and
__napi_schedule it was notice that the mtk eth driver disable the
interrupt for rx and tx AFTER napi is scheduled.
While this is a very hard to repro case it might happen to have
situation where the interrupt is disabled and never enabled again as the
napi completes and the interrupt is enabled before.
This is caused by the fact that a napi driven by interrupt expect a
logic with:
1. interrupt received. napi prepared -> interrupt disabled -> napi
scheduled
2. napi triggered. ring cleared -> interrupt enabled -> wait for new
interrupt
To prevent this case, disable the interrupt BEFORE the napi is
scheduled.
Fixes: 656e705243fd ("net-next: mediatek: add support for MT7623 ethernet")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Marangi <ansuelsmth(a)gmail.com>
---
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 82b51072aad8..7669b446915a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -3172,8 +3172,8 @@ static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth)
eth->rx_events++;
if (likely(napi_schedule_prep(ð->rx_napi))) {
- __napi_schedule(ð->rx_napi);
mtk_rx_irq_disable(eth, eth->soc->txrx.rx_irq_done_mask);
+ __napi_schedule(ð->rx_napi);
}
return IRQ_HANDLED;
@@ -3185,8 +3185,8 @@ static irqreturn_t mtk_handle_irq_tx(int irq, void *_eth)
eth->tx_events++;
if (likely(napi_schedule_prep(ð->tx_napi))) {
- __napi_schedule(ð->tx_napi);
mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
+ __napi_schedule(ð->tx_napi);
}
return IRQ_HANDLED;
--
2.40.1
From: Miquel Raynal <miquel.raynal(a)bootlin.com>
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
CAN controller reception: the Rx buffer is only 5 messages long, so when
the bus loaded (eg. a message every 50us), overrun may easily
happen. Upon an overrun situation, due to a possible internal crosstalk
situation, the controller enters a frozen state which only can be
unlocked with a soft reset (experimentally). The solution was to offload
a call to sja1000_start() in a threaded handler. This needs to happen in
process context as this operation requires to sleep. sja1000_start()
basically enters "reset mode", performs a proper software reset and
returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in
reception. However it was sporadically observed that the transmit path
would now freeze. Further investigation blamed the fix mentioned above,
and especially the reset operation. Reproducing the reset in a loop
helped identifying what could possibly go wrong. The sja1000 is a single
Tx queue device, which leverages the netdev helpers to process one Tx
message at a time. The logic is: the queue is stopped, the message sent
to the transceiver, once properly transmitted the controller sets a
status bit which triggers an interrupt, in the interrupt handler the
transmission status is checked and the queue woken up. Unfortunately, if
an overrun happens, we might perform the soft reset precisely between
the transmission of the buffer to the transceiver and the advent of the
transmission status bit. We would then stop the transmission operation
without re-enabling the queue, leading to all further transmissions to
be ignored.
The reset interrupt can only happen while the device is "open", and
after a reset we anyway want to resume normal operations, no matter if a
packet to transmit got dropped in the process, so we shall wake up the
queue. Restarting the device and waking-up the queue is exactly what
sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
the queue state, we must acquire a lock both in the reset handler and in
the transmit path to ensure serialization of both operations. It turns
out, a lock is already held when entering the transmit path, so we can
just acquire/release it as well with the regular net helpers inside the
threaded interrupt handler and this way we should be safe. As the
reset handler might still be called after the transmission of a frame to
the transceiver but before it actually gets transmitted, we must ensure
we don't leak the skb, so we free it (the behavior is consistent, no
matter if there was an skb on the stack or not).
Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Link: https://lore.kernel.org/all/20231002160206.190953-1-miquel.raynal@bootlin.c…
[mkl: fixed call to can_free_echo_skb()]
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
drivers/net/can/sja1000/sja1000.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index 0ada0e160e93..743c2eb62b87 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -392,7 +392,13 @@ static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id)
struct net_device *dev = (struct net_device *)dev_id;
netdev_dbg(dev, "performing a soft reset upon overrun\n");
- sja1000_start(dev);
+
+ netif_tx_lock(dev);
+
+ can_free_echo_skb(dev, 0, NULL);
+ sja1000_set_mode(dev, CAN_MODE_START);
+
+ netif_tx_unlock(dev);
return IRQ_HANDLED;
}
--
2.40.1
Fix three issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (6):
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 78 +++++++++++--------
tools/testing/selftests/resctrl/resctrl_val.c | 22 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 +++++++---------
9 files changed, 88 insertions(+), 105 deletions(-)
--
2.30.2
The patch below does not apply to the 6.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y
git checkout FETCH_HEAD
git cherry-pick -x 8b4d9469d0b0e553208ee6f62f2807111fde18b9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100443-freewill-shanty-0f05@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8b4d9469d0b0e553208ee6f62f2807111fde18b9 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Tue, 5 Sep 2023 09:06:23 +0900
Subject: [PATCH] ata: libata-scsi: Fix delayed scsi_rescan_device() execution
Commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after
device resume") modified ata_scsi_dev_rescan() to check the scsi device
"is_suspended" power field to ensure that the scsi device associated
with an ATA device is fully resumed when scsi_rescan_device() is
executed. However, this fix is problematic as:
1) It relies on a PM internal field that should not be used without PM
device locking protection.
2) The check for is_suspended and the call to scsi_rescan_device() are
not atomic and a suspend PM event may be triggered between them,
casuing scsi_rescan_device() to be called on a suspended device and
in that function blocking while holding the scsi device lock. This
would deadlock a following resume operation.
These problems can trigger PM deadlocks on resume, especially with
resume operations triggered quickly after or during suspend operations.
E.g., a simple bash script like:
for (( i=0; i<10; i++ )); do
echo "+2 > /sys/class/rtc/rtc0/wakealarm
echo mem > /sys/power/state
done
that triggers a resume 2 seconds after starting suspending a system can
quickly lead to a PM deadlock preventing the system from correctly
resuming.
Fix this by replacing the check on is_suspended with a check on the
return value given by scsi_rescan_device() as that function will fail if
called against a suspended device. Also make sure rescan tasks already
scheduled are first cancelled before suspending an ata port.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Tested-by: Geert Uytterhoeven <geert+renesas(a)glider.be>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index a0bc01606b30..092372334e92 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5168,11 +5168,27 @@ static const unsigned int ata_port_suspend_ehi = ATA_EHI_QUIET
static void ata_port_suspend(struct ata_port *ap, pm_message_t mesg)
{
+ /*
+ * We are about to suspend the port, so we do not care about
+ * scsi_rescan_device() calls scheduled by previous resume operations.
+ * The next resume will schedule the rescan again. So cancel any rescan
+ * that is not done yet.
+ */
+ cancel_delayed_work_sync(&ap->scsi_rescan_task);
+
ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, false);
}
static void ata_port_suspend_async(struct ata_port *ap, pm_message_t mesg)
{
+ /*
+ * We are about to suspend the port, so we do not care about
+ * scsi_rescan_device() calls scheduled by previous resume operations.
+ * The next resume will schedule the rescan again. So cancel any rescan
+ * that is not done yet.
+ */
+ cancel_delayed_work_sync(&ap->scsi_rescan_task);
+
ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, true);
}
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a0e58d22d222..6850cac803c1 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4756,7 +4756,7 @@ void ata_scsi_dev_rescan(struct work_struct *work)
struct ata_link *link;
struct ata_device *dev;
unsigned long flags;
- bool delay_rescan = false;
+ int ret = 0;
mutex_lock(&ap->scsi_scan_mutex);
spin_lock_irqsave(ap->lock, flags);
@@ -4765,37 +4765,34 @@ void ata_scsi_dev_rescan(struct work_struct *work)
ata_for_each_dev(dev, link, ENABLED) {
struct scsi_device *sdev = dev->sdev;
+ /*
+ * If the port was suspended before this was scheduled,
+ * bail out.
+ */
+ if (ap->pflags & ATA_PFLAG_SUSPENDED)
+ goto unlock;
+
if (!sdev)
continue;
if (scsi_device_get(sdev))
continue;
- /*
- * If the rescan work was scheduled because of a resume
- * event, the port is already fully resumed, but the
- * SCSI device may not yet be fully resumed. In such
- * case, executing scsi_rescan_device() may cause a
- * deadlock with the PM code on device_lock(). Prevent
- * this by giving up and retrying rescan after a short
- * delay.
- */
- delay_rescan = sdev->sdev_gendev.power.is_suspended;
- if (delay_rescan) {
- scsi_device_put(sdev);
- break;
- }
-
spin_unlock_irqrestore(ap->lock, flags);
- scsi_rescan_device(sdev);
+ ret = scsi_rescan_device(sdev);
scsi_device_put(sdev);
spin_lock_irqsave(ap->lock, flags);
+
+ if (ret)
+ goto unlock;
}
}
+unlock:
spin_unlock_irqrestore(ap->lock, flags);
mutex_unlock(&ap->scsi_scan_mutex);
- if (delay_rescan)
+ /* Reschedule with a delay if scsi_rescan_device() returned an error */
+ if (ret)
schedule_delayed_work(&ap->scsi_rescan_task,
msecs_to_jiffies(5));
}
The patch below does not apply to the 6.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y
git checkout FETCH_HEAD
git cherry-pick -x ff48b37802e5c134e2dfc4d091f10b2eb5065a72
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100454-hamper-falsify-595d@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ff48b37802e5c134e2dfc4d091f10b2eb5065a72 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Fri, 15 Sep 2023 15:00:13 +0900
Subject: [PATCH] scsi: Do not attempt to rescan suspended devices
scsi_rescan_device() takes a scsi device lock before executing a device
handler and device driver rescan methods. Waiting for the completion of
any command issued to the device by these methods will thus be done with
the device lock held. As a result, there is a risk of deadlocking within
the power management code if scsi_rescan_device() is called to handle a
device resume with the associated scsi device not yet resumed.
Avoid such situation by checking that the target scsi device is in the
running state, that is, fully capable of executing commands, before
proceeding with the rescan and bailout returning -EWOULDBLOCK otherwise.
With this error return, the caller can retry rescaning the device after
a delay.
The state check is done with the device lock held and is thus safe
against incoming suspend power management operations.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Tested-by: Geert Uytterhoeven <geert+renesas(a)glider.be>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 52014b2d39e1..3db4d31a03a1 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1619,12 +1619,24 @@ int scsi_add_device(struct Scsi_Host *host, uint channel,
}
EXPORT_SYMBOL(scsi_add_device);
-void scsi_rescan_device(struct scsi_device *sdev)
+int scsi_rescan_device(struct scsi_device *sdev)
{
struct device *dev = &sdev->sdev_gendev;
+ int ret = 0;
device_lock(dev);
+ /*
+ * Bail out if the device is not running. Otherwise, the rescan may
+ * block waiting for commands to be executed, with us holding the
+ * device lock. This can result in a potential deadlock in the power
+ * management core code when system resume is on-going.
+ */
+ if (sdev->sdev_state != SDEV_RUNNING) {
+ ret = -EWOULDBLOCK;
+ goto unlock;
+ }
+
scsi_attach_vpd(sdev);
scsi_cdl_check(sdev);
@@ -1638,7 +1650,11 @@ void scsi_rescan_device(struct scsi_device *sdev)
drv->rescan(dev);
module_put(dev->driver->owner);
}
+
+unlock:
device_unlock(dev);
+
+ return ret;
}
EXPORT_SYMBOL(scsi_rescan_device);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 49f768d0ff37..4c2dc8150c6d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -764,7 +764,7 @@ scsi_template_proc_dir(const struct scsi_host_template *sht);
#define scsi_template_proc_dir(sht) NULL
#endif
extern void scsi_scan_host(struct Scsi_Host *);
-extern void scsi_rescan_device(struct scsi_device *);
+extern int scsi_rescan_device(struct scsi_device *sdev);
extern void scsi_remove_host(struct Scsi_Host *);
extern struct Scsi_Host *scsi_host_get(struct Scsi_Host *);
extern int scsi_host_busy(struct Scsi_Host *shost);
The quilt patch titled
Subject: mips: use nth_page() in place of direct struct page manipulation
has been removed from the -mm tree. Its filename was
mips-use-nth_page-in-place-of-direct-struct-page-manipulation.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mips: use nth_page() in place of direct struct page manipulation
Date: Wed, 13 Sep 2023 16:12:48 -0400
__flush_dcache_pages() is called during hugetlb migration via
migrate_pages() -> migrate_hugetlbs() -> unmap_and_move_huge_page() ->
move_to_new_folio() -> flush_dcache_folio(). And with hugetlb and without
sparsemem vmemmap, struct page is not guaranteed to be contiguous beyond a
section. Use nth_page() instead.
Without the fix, a wrong address might be used for data cache page flush.
No bug is reported. The fix comes from code inspection.
Link: https://lkml.kernel.org/r/20230913201248.452081-6-zi.yan@sent.com
Fixes: 15fa3e8e3269 ("mips: implement the new page table range API")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/mips/mm/cache.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/mips/mm/cache.c~mips-use-nth_page-in-place-of-direct-struct-page-manipulation
+++ a/arch/mips/mm/cache.c
@@ -117,7 +117,7 @@ void __flush_dcache_pages(struct page *p
* get faulted into the tlb (and thus flushed) anyways.
*/
for (i = 0; i < nr; i++) {
- addr = (unsigned long)kmap_local_page(page + i);
+ addr = (unsigned long)kmap_local_page(nth_page(page, i));
flush_data_cache_page(addr);
kunmap_local((void *)addr);
}
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
The quilt patch titled
Subject: mm/cma: use nth_page() in place of direct struct page manipulation
has been removed from the -mm tree. Its filename was
mm-cma-use-nth_page-in-place-of-direct-struct-page-manipulation.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/cma: use nth_page() in place of direct struct page manipulation
Date: Wed, 13 Sep 2023 16:12:44 -0400
Patch series "Use nth_page() in place of direct struct page manipulation",
v3.
On SPARSEMEM without VMEMMAP, struct page is not guaranteed to be
contiguous, since each memory section's memmap might be allocated
independently. hugetlb pages can go beyond a memory section size, thus
direct struct page manipulation on hugetlb pages/subpages might give wrong
struct page. Kernel provides nth_page() to do the manipulation properly.
Use that whenever code can see hugetlb pages.
This patch (of 5):
When dealing with hugetlb pages, manipulating struct page pointers
directly can get to wrong struct page, since struct page is not guaranteed
to be contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle
it properly.
Without the fix, page_kasan_tag_reset() could reset wrong page tags,
causing a wrong kasan result. No related bug is reported. The fix
comes from code inspection.
Link: https://lkml.kernel.org/r/20230913201248.452081-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20230913201248.452081-2-zi.yan@sent.com
Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/cma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/cma.c~mm-cma-use-nth_page-in-place-of-direct-struct-page-manipulation
+++ a/mm/cma.c
@@ -505,7 +505,7 @@ struct page *cma_alloc(struct cma *cma,
*/
if (page) {
for (i = 0; i < count; i++)
- page_kasan_tag_reset(page + i);
+ page_kasan_tag_reset(nth_page(page, i));
}
if (ret && !no_warn) {
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
The quilt patch titled
Subject: fs: use nth_page() in place of direct struct page manipulation
has been removed from the -mm tree. Its filename was
fs-use-nth_page-in-place-of-direct-struct-page-manipulation.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: fs: use nth_page() in place of direct struct page manipulation
Date: Wed, 13 Sep 2023 16:12:47 -0400
When dealing with hugetlb pages, struct page is not guaranteed to be
contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle it
properly.
Without the fix, a wrong subpage might be checked for HWPoison, causing wrong
number of bytes of a page copied to user space. No bug is reported. The fix
comes from code inspection.
Link: https://lkml.kernel.org/r/20230913201248.452081-5-zi.yan@sent.com
Fixes: 38c1ddbde6c6 ("hugetlbfs: improve read HWPOISON hugepage")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/fs/hugetlbfs/inode.c~fs-use-nth_page-in-place-of-direct-struct-page-manipulation
+++ a/fs/hugetlbfs/inode.c
@@ -295,7 +295,7 @@ static size_t adjust_range_hwpoison(stru
size_t res = 0;
/* First subpage to start the loop. */
- page += offset / PAGE_SIZE;
+ page = nth_page(page, offset / PAGE_SIZE);
offset %= PAGE_SIZE;
while (1) {
if (is_raw_hwpoison_page_in_hugepage(page))
@@ -309,7 +309,7 @@ static size_t adjust_range_hwpoison(stru
break;
offset += n;
if (offset == PAGE_SIZE) {
- page++;
+ page = nth_page(page, 1);
offset = 0;
}
}
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
The quilt patch titled
Subject: mm/memory_hotplug: use pfn math in place of direct struct page manipulation
has been removed from the -mm tree. Its filename was
mm-memory_hotplug-use-pfn-math-in-place-of-direct-struct-page-manipulation.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/memory_hotplug: use pfn math in place of direct struct page manipulation
Date: Wed, 13 Sep 2023 16:12:46 -0400
When dealing with hugetlb pages, manipulating struct page pointers
directly can get to wrong struct page, since struct page is not guaranteed
to be contiguous on SPARSEMEM without VMEMMAP. Use pfn calculation to
handle it properly.
Without the fix, a wrong number of page might be skipped. Since skip cannot be
negative, scan_movable_page() will end early and might miss a movable page with
-ENOENT. This might fail offline_pages(). No bug is reported. The fix comes
from code inspection.
Link: https://lkml.kernel.org/r/20230913201248.452081-4-zi.yan@sent.com
Fixes: eeb0efd071d8 ("mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory_hotplug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/memory_hotplug.c~mm-memory_hotplug-use-pfn-math-in-place-of-direct-struct-page-manipulation
+++ a/mm/memory_hotplug.c
@@ -1689,7 +1689,7 @@ static int scan_movable_pages(unsigned l
*/
if (HPageMigratable(head))
goto found;
- skip = compound_nr(head) - (page - head);
+ skip = compound_nr(head) - (pfn - page_to_pfn(head));
pfn += skip - 1;
}
return -ENOENT;
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
The quilt patch titled
Subject: mm/hugetlb: use nth_page() in place of direct struct page manipulation
has been removed from the -mm tree. Its filename was
mm-hugetlb-use-nth_page-in-place-of-direct-struct-page-manipulation.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/hugetlb: use nth_page() in place of direct struct page manipulation
Date: Wed, 13 Sep 2023 16:12:45 -0400
When dealing with hugetlb pages, manipulating struct page pointers
directly can get to wrong struct page, since struct page is not guaranteed
to be contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle
it properly.
A wrong or non-existing page might be tried to be grabbed, either
leading to a non freeable page or kernel memory access errors. No bug
is reported. It comes from code inspection.
Link: https://lkml.kernel.org/r/20230913201248.452081-3-zi.yan@sent.com
Fixes: 57a196a58421 ("hugetlb: simplify hugetlb handling in follow_page_mask")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c~mm-hugetlb-use-nth_page-in-place-of-direct-struct-page-manipulation
+++ a/mm/hugetlb.c
@@ -6493,7 +6493,7 @@ struct page *hugetlb_follow_page_mask(st
}
}
- page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+ page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
/*
* Note that page may be a sub-page, and with vmemmap
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 7b086755fb8cdbb6b3e45a1bbddc00e7f9b1dc03
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100429-postnasal-hassle-02be@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7b086755fb8cdbb6b3e45a1bbddc00e7f9b1dc03 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes(a)cmpxchg.org>
Date: Mon, 11 Sep 2023 14:11:08 -0400
Subject: [PATCH] mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong
buddy list
Commit 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock")
bypasses the pcplist on lock contention and returns the page directly to
the buddy list of the page's migratetype.
For pages that don't have their own pcplist, such as CMA and HIGHATOMIC,
the migratetype is temporarily updated such that the page can hitch a ride
on the MOVABLE pcplist. Their true type is later reassessed when flushing
in free_pcppages_bulk(). However, when lock contention is detected after
the type was already overridden, the bypass will then put the page on the
wrong buddy list.
Once on the MOVABLE buddy list, the page becomes eligible for fallbacks
and even stealing. In the case of HIGHATOMIC, otherwise ineligible
allocations can dip into the highatomic reserves. In the case of CMA, the
page can be lost from the CMA region permanently.
Use a separate pcpmigratetype variable for the pcplist override. Use the
original migratetype when going directly to the buddy. This fixes the bug
and should make the intentions more obvious in the code.
Originally sent here to address the HIGHATOMIC case:
https://lore.kernel.org/lkml/20230821183733.106619-4-hannes@cmpxchg.org/
Changelog updated in response to the CMA-specific bug report.
[mgorman(a)techsingularity.net: updated changelog]
Link: https://lkml.kernel.org/r/20230911181108.GA104295@cmpxchg.org
Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reported-by: Joe Liu <joe.liu(a)mediatek.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c5be12f9336..95546f376302 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2400,7 +2400,7 @@ void free_unref_page(struct page *page, unsigned int order)
struct per_cpu_pages *pcp;
struct zone *zone;
unsigned long pfn = page_to_pfn(page);
- int migratetype;
+ int migratetype, pcpmigratetype;
if (!free_unref_page_prepare(page, pfn, order))
return;
@@ -2408,24 +2408,24 @@ void free_unref_page(struct page *page, unsigned int order)
/*
* We only track unmovable, reclaimable and movable on pcp lists.
* Place ISOLATE pages on the isolated list because they are being
- * offlined but treat HIGHATOMIC as movable pages so we can get those
- * areas back if necessary. Otherwise, we may have to free
+ * offlined but treat HIGHATOMIC and CMA as movable pages so we can
+ * get those areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
*/
- migratetype = get_pcppage_migratetype(page);
+ migratetype = pcpmigratetype = get_pcppage_migratetype(page);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
if (unlikely(is_migrate_isolate(migratetype))) {
free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
return;
}
- migratetype = MIGRATE_MOVABLE;
+ pcpmigratetype = MIGRATE_MOVABLE;
}
zone = page_zone(page);
pcp_trylock_prepare(UP_flags);
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (pcp) {
- free_unref_page_commit(zone, pcp, page, migratetype, order);
+ free_unref_page_commit(zone, pcp, page, pcpmigratetype, order);
pcp_spin_unlock(pcp);
} else {
free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
Recent changes to kernel_connect() and kernel_bind() ensure that
callers are insulated from changes to the address parameter made by BPF
SOCK_ADDR hooks. This patch wraps direct calls to ops->connect() and
ops->bind() with kernel_connect() and kernel_bind() to ensure that SMB
mounts do not see their mount address overwritten in such cases.
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.cam…
Cc: <stable(a)vger.kernel.org> # 6.x.y
Signed-off-by: Jordan Rife <jrife(a)google.com>
---
fs/smb/client/connect.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
index 3902e90dca6b0..ce11165446cfb 100644
--- a/fs/smb/client/connect.c
+++ b/fs/smb/client/connect.c
@@ -2895,9 +2895,9 @@ bind_socket(struct TCP_Server_Info *server)
if (server->srcaddr.ss_family != AF_UNSPEC) {
/* Bind to the specified local IP address */
struct socket *socket = server->ssocket;
- rc = socket->ops->bind(socket,
- (struct sockaddr *) &server->srcaddr,
- sizeof(server->srcaddr));
+ rc = kernel_bind(socket,
+ (struct sockaddr *) &server->srcaddr,
+ sizeof(server->srcaddr));
if (rc < 0) {
struct sockaddr_in *saddr4;
struct sockaddr_in6 *saddr6;
@@ -3046,8 +3046,8 @@ generic_ip_connect(struct TCP_Server_Info *server)
socket->sk->sk_sndbuf,
socket->sk->sk_rcvbuf, socket->sk->sk_rcvtimeo);
- rc = socket->ops->connect(socket, saddr, slen,
- server->noblockcnt ? O_NONBLOCK : 0);
+ rc = kernel_connect(socket, saddr, slen,
+ server->noblockcnt ? O_NONBLOCK : 0);
/*
* When mounting SMB root file systems, we do not want to block in
* connect. Otherwise bail out and then let cifs_reconnect() perform
--
2.42.0.582.g8ccd20d70d-goog
Hi Greg,
> On 4 Oct 2023, at 16:58, gregkh(a)linuxfoundation.org wrote:
>
>
> This is a note to let you know that I've just added the patch titled
>
> ASoC: amd: yc: Fix non-functional mic on Lenovo 82QF and 82UG
>
> to the 6.1-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> asoc-amd-yc-fix-non-functional-mic-on-lenovo-82qf-and-82ug.patch
> and it can be found in the queue-6.1 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
> From 1263cc0f414d212129c0f1289b49b7df77f92084 Mon Sep 17 00:00:00 2001
> From: August Wikerfors <git(a)augustwikerfors.se>
> Date: Mon, 11 Sep 2023 23:34:09 +0200
> Subject: ASoC: amd: yc: Fix non-functional mic on Lenovo 82QF and 82UG
>
> From: August Wikerfors <git(a)augustwikerfors.se>
>
> commit 1263cc0f414d212129c0f1289b49b7df77f92084 upstream.
>
> Like the Lenovo 82TL and 82V2, the Lenovo 82QF (Yoga 7 14ARB7) and 82UG
> (Legion S7 16ARHA7) both need a quirk entry for the internal microphone to
> function. Commit c008323fe361 ("ASoC: amd: yc: Fix a non-functional mic on
> Lenovo 82SJ") restricted the quirk that previously matched "82" to "82V2",
> breaking microphone functionality on these devices. Fix this by adding
> specific quirks for these models, as was done for the Lenovo 82TL.
>
> Fixes: c008323fe361 ("ASoC: amd: yc: Fix a non-functional mic on Lenovo 82SJ")
> Closes: https://github.com/tomsom/yoga-linux/issues/51
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=208555#c780
> Cc: stable(a)vger.kernel.org
> Signed-off-by: August Wikerfors <git(a)augustwikerfors.se>
> Link: https://lore.kernel.org/r/20230911213409.6106-1-git@augustwikerfors.se
> Signed-off-by: Mark Brown <broonie(a)kernel.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Please also add commit cfff2a7794d2 ("ASoC: amd: yc: Fix a non-functional mic on Lenovo 82TL") to 6.1+ as it fixed the same regression for another model (but wasn’t tagged for stable).
Regards,
August Wikerfors
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100407-ruby-cancel-cdd0@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 19:01:03 +0200
Subject: [PATCH] rbd: take header_rwsem in rbd_dev_refresh() only when
updating
rbd_dev_refresh() has been holding header_rwsem across header and
parent info read-in unnecessarily for ages. With commit 870611e4877e
("rbd: get snapshot context after exclusive lock is ensured to be
held"), the potential for deadlocks became much more real owning to
a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores
not allowing new readers after a writer is registered.
For example, assuming that I/O request 1, I/O request 2 and header
read-in request all target the same OSD:
1. I/O request 1 comes in and gets submitted
2. watch error occurs
3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
releases lock_rwsem
4. after reestablishing the watch, rbd_reregister_watch() calls
rbd_dev_refresh() which takes header_rwsem for write and submits
a header read-in request
5. I/O request 2 comes in: after taking lock_rwsem for read in
__rbd_img_handle_request(), it blocks trying to take header_rwsem
for read in rbd_img_object_requests()
6. another watch error occurs
7. rbd_watch_errcb() blocks trying to take lock_rwsem for write
8. I/O request 1 completion is received by the messenger but can't be
processed because lock_rwsem won't be granted anymore
9. header read-in request completion can't be received, let alone
processed, because the messenger is stranded
Change rbd_dev_refresh() to take header_rwsem only for actually
updating rbd_dev->header. Header and parent info read-in don't need
any locking.
Cc: stable(a)vger.kernel.org # 0b035401c570: rbd: move rbd_dev_refresh() definition
Cc: stable(a)vger.kernel.org # 510a7330c82a: rbd: decouple header read-in from updating rbd_dev->header
Cc: stable(a)vger.kernel.org # c10311776f0a: rbd: decouple parent info read-in from updating rbd_dev
Cc: stable(a)vger.kernel.org
Fixes: 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held")
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index d62a0298c890..a999b698b131 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6986,7 +6986,14 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
rbd_assert(rbd_dev->header.object_prefix); /* !first_time */
- rbd_dev->header.image_size = header->image_size;
+ if (rbd_dev->header.image_size != header->image_size) {
+ rbd_dev->header.image_size = header->image_size;
+
+ if (!rbd_is_snap(rbd_dev)) {
+ rbd_dev->mapping.size = header->image_size;
+ rbd_dev_update_size(rbd_dev);
+ }
+ }
ceph_put_snap_context(rbd_dev->header.snapc);
rbd_dev->header.snapc = header->snapc;
@@ -7044,11 +7051,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
struct parent_image_info pii = { 0 };
- u64 mapping_size;
int ret;
- down_write(&rbd_dev->header_rwsem);
- mapping_size = rbd_dev->mapping.size;
+ dout("%s rbd_dev %p\n", __func__, rbd_dev);
ret = rbd_dev_header_info(rbd_dev, &header, false);
if (ret)
@@ -7064,18 +7069,13 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
goto out;
}
+ down_write(&rbd_dev->header_rwsem);
rbd_dev_update_header(rbd_dev, &header);
if (rbd_dev->parent)
rbd_dev_update_parent(rbd_dev, &pii);
-
- rbd_assert(!rbd_is_snap(rbd_dev));
- rbd_dev->mapping.size = rbd_dev->header.image_size;
+ up_write(&rbd_dev->header_rwsem);
out:
- up_write(&rbd_dev->header_rwsem);
- if (!ret && mapping_size != rbd_dev->mapping.size)
- rbd_dev_update_size(rbd_dev);
-
rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100406-routine-tackle-b16a@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 19:01:03 +0200
Subject: [PATCH] rbd: take header_rwsem in rbd_dev_refresh() only when
updating
rbd_dev_refresh() has been holding header_rwsem across header and
parent info read-in unnecessarily for ages. With commit 870611e4877e
("rbd: get snapshot context after exclusive lock is ensured to be
held"), the potential for deadlocks became much more real owning to
a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores
not allowing new readers after a writer is registered.
For example, assuming that I/O request 1, I/O request 2 and header
read-in request all target the same OSD:
1. I/O request 1 comes in and gets submitted
2. watch error occurs
3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
releases lock_rwsem
4. after reestablishing the watch, rbd_reregister_watch() calls
rbd_dev_refresh() which takes header_rwsem for write and submits
a header read-in request
5. I/O request 2 comes in: after taking lock_rwsem for read in
__rbd_img_handle_request(), it blocks trying to take header_rwsem
for read in rbd_img_object_requests()
6. another watch error occurs
7. rbd_watch_errcb() blocks trying to take lock_rwsem for write
8. I/O request 1 completion is received by the messenger but can't be
processed because lock_rwsem won't be granted anymore
9. header read-in request completion can't be received, let alone
processed, because the messenger is stranded
Change rbd_dev_refresh() to take header_rwsem only for actually
updating rbd_dev->header. Header and parent info read-in don't need
any locking.
Cc: stable(a)vger.kernel.org # 0b035401c570: rbd: move rbd_dev_refresh() definition
Cc: stable(a)vger.kernel.org # 510a7330c82a: rbd: decouple header read-in from updating rbd_dev->header
Cc: stable(a)vger.kernel.org # c10311776f0a: rbd: decouple parent info read-in from updating rbd_dev
Cc: stable(a)vger.kernel.org
Fixes: 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held")
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index d62a0298c890..a999b698b131 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6986,7 +6986,14 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
rbd_assert(rbd_dev->header.object_prefix); /* !first_time */
- rbd_dev->header.image_size = header->image_size;
+ if (rbd_dev->header.image_size != header->image_size) {
+ rbd_dev->header.image_size = header->image_size;
+
+ if (!rbd_is_snap(rbd_dev)) {
+ rbd_dev->mapping.size = header->image_size;
+ rbd_dev_update_size(rbd_dev);
+ }
+ }
ceph_put_snap_context(rbd_dev->header.snapc);
rbd_dev->header.snapc = header->snapc;
@@ -7044,11 +7051,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
struct parent_image_info pii = { 0 };
- u64 mapping_size;
int ret;
- down_write(&rbd_dev->header_rwsem);
- mapping_size = rbd_dev->mapping.size;
+ dout("%s rbd_dev %p\n", __func__, rbd_dev);
ret = rbd_dev_header_info(rbd_dev, &header, false);
if (ret)
@@ -7064,18 +7069,13 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
goto out;
}
+ down_write(&rbd_dev->header_rwsem);
rbd_dev_update_header(rbd_dev, &header);
if (rbd_dev->parent)
rbd_dev_update_parent(rbd_dev, &pii);
-
- rbd_assert(!rbd_is_snap(rbd_dev));
- rbd_dev->mapping.size = rbd_dev->header.image_size;
+ up_write(&rbd_dev->header_rwsem);
out:
- up_write(&rbd_dev->header_rwsem);
- if (!ret && mapping_size != rbd_dev->mapping.size)
- rbd_dev_update_size(rbd_dev);
-
rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100404-duffel-hungrily-2206@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 19:01:03 +0200
Subject: [PATCH] rbd: take header_rwsem in rbd_dev_refresh() only when
updating
rbd_dev_refresh() has been holding header_rwsem across header and
parent info read-in unnecessarily for ages. With commit 870611e4877e
("rbd: get snapshot context after exclusive lock is ensured to be
held"), the potential for deadlocks became much more real owning to
a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores
not allowing new readers after a writer is registered.
For example, assuming that I/O request 1, I/O request 2 and header
read-in request all target the same OSD:
1. I/O request 1 comes in and gets submitted
2. watch error occurs
3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
releases lock_rwsem
4. after reestablishing the watch, rbd_reregister_watch() calls
rbd_dev_refresh() which takes header_rwsem for write and submits
a header read-in request
5. I/O request 2 comes in: after taking lock_rwsem for read in
__rbd_img_handle_request(), it blocks trying to take header_rwsem
for read in rbd_img_object_requests()
6. another watch error occurs
7. rbd_watch_errcb() blocks trying to take lock_rwsem for write
8. I/O request 1 completion is received by the messenger but can't be
processed because lock_rwsem won't be granted anymore
9. header read-in request completion can't be received, let alone
processed, because the messenger is stranded
Change rbd_dev_refresh() to take header_rwsem only for actually
updating rbd_dev->header. Header and parent info read-in don't need
any locking.
Cc: stable(a)vger.kernel.org # 0b035401c570: rbd: move rbd_dev_refresh() definition
Cc: stable(a)vger.kernel.org # 510a7330c82a: rbd: decouple header read-in from updating rbd_dev->header
Cc: stable(a)vger.kernel.org # c10311776f0a: rbd: decouple parent info read-in from updating rbd_dev
Cc: stable(a)vger.kernel.org
Fixes: 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held")
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index d62a0298c890..a999b698b131 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6986,7 +6986,14 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
rbd_assert(rbd_dev->header.object_prefix); /* !first_time */
- rbd_dev->header.image_size = header->image_size;
+ if (rbd_dev->header.image_size != header->image_size) {
+ rbd_dev->header.image_size = header->image_size;
+
+ if (!rbd_is_snap(rbd_dev)) {
+ rbd_dev->mapping.size = header->image_size;
+ rbd_dev_update_size(rbd_dev);
+ }
+ }
ceph_put_snap_context(rbd_dev->header.snapc);
rbd_dev->header.snapc = header->snapc;
@@ -7044,11 +7051,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
struct parent_image_info pii = { 0 };
- u64 mapping_size;
int ret;
- down_write(&rbd_dev->header_rwsem);
- mapping_size = rbd_dev->mapping.size;
+ dout("%s rbd_dev %p\n", __func__, rbd_dev);
ret = rbd_dev_header_info(rbd_dev, &header, false);
if (ret)
@@ -7064,18 +7069,13 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
goto out;
}
+ down_write(&rbd_dev->header_rwsem);
rbd_dev_update_header(rbd_dev, &header);
if (rbd_dev->parent)
rbd_dev_update_parent(rbd_dev, &pii);
-
- rbd_assert(!rbd_is_snap(rbd_dev));
- rbd_dev->mapping.size = rbd_dev->header.image_size;
+ up_write(&rbd_dev->header_rwsem);
out:
- up_write(&rbd_dev->header_rwsem);
- if (!ret && mapping_size != rbd_dev->mapping.size)
- rbd_dev_update_size(rbd_dev);
-
rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100402-stomp-wrist-122b@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0b207d02bd9ab8dcc31b262ca9f60dbc1822500d Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 19:01:03 +0200
Subject: [PATCH] rbd: take header_rwsem in rbd_dev_refresh() only when
updating
rbd_dev_refresh() has been holding header_rwsem across header and
parent info read-in unnecessarily for ages. With commit 870611e4877e
("rbd: get snapshot context after exclusive lock is ensured to be
held"), the potential for deadlocks became much more real owning to
a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores
not allowing new readers after a writer is registered.
For example, assuming that I/O request 1, I/O request 2 and header
read-in request all target the same OSD:
1. I/O request 1 comes in and gets submitted
2. watch error occurs
3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
releases lock_rwsem
4. after reestablishing the watch, rbd_reregister_watch() calls
rbd_dev_refresh() which takes header_rwsem for write and submits
a header read-in request
5. I/O request 2 comes in: after taking lock_rwsem for read in
__rbd_img_handle_request(), it blocks trying to take header_rwsem
for read in rbd_img_object_requests()
6. another watch error occurs
7. rbd_watch_errcb() blocks trying to take lock_rwsem for write
8. I/O request 1 completion is received by the messenger but can't be
processed because lock_rwsem won't be granted anymore
9. header read-in request completion can't be received, let alone
processed, because the messenger is stranded
Change rbd_dev_refresh() to take header_rwsem only for actually
updating rbd_dev->header. Header and parent info read-in don't need
any locking.
Cc: stable(a)vger.kernel.org # 0b035401c570: rbd: move rbd_dev_refresh() definition
Cc: stable(a)vger.kernel.org # 510a7330c82a: rbd: decouple header read-in from updating rbd_dev->header
Cc: stable(a)vger.kernel.org # c10311776f0a: rbd: decouple parent info read-in from updating rbd_dev
Cc: stable(a)vger.kernel.org
Fixes: 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held")
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index d62a0298c890..a999b698b131 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6986,7 +6986,14 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
rbd_assert(rbd_dev->header.object_prefix); /* !first_time */
- rbd_dev->header.image_size = header->image_size;
+ if (rbd_dev->header.image_size != header->image_size) {
+ rbd_dev->header.image_size = header->image_size;
+
+ if (!rbd_is_snap(rbd_dev)) {
+ rbd_dev->mapping.size = header->image_size;
+ rbd_dev_update_size(rbd_dev);
+ }
+ }
ceph_put_snap_context(rbd_dev->header.snapc);
rbd_dev->header.snapc = header->snapc;
@@ -7044,11 +7051,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
struct parent_image_info pii = { 0 };
- u64 mapping_size;
int ret;
- down_write(&rbd_dev->header_rwsem);
- mapping_size = rbd_dev->mapping.size;
+ dout("%s rbd_dev %p\n", __func__, rbd_dev);
ret = rbd_dev_header_info(rbd_dev, &header, false);
if (ret)
@@ -7064,18 +7069,13 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
goto out;
}
+ down_write(&rbd_dev->header_rwsem);
rbd_dev_update_header(rbd_dev, &header);
if (rbd_dev->parent)
rbd_dev_update_parent(rbd_dev, &pii);
-
- rbd_assert(!rbd_is_snap(rbd_dev));
- rbd_dev->mapping.size = rbd_dev->header.image_size;
+ up_write(&rbd_dev->header_rwsem);
out:
- up_write(&rbd_dev->header_rwsem);
- if (!ret && mapping_size != rbd_dev->mapping.size)
- rbd_dev_update_size(rbd_dev);
-
rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x c10311776f0a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100450-headrest-ecologist-205f@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c10311776f0a8ddea2276df96e255625b07045a8 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 18:38:26 +0200
Subject: [PATCH] rbd: decouple parent info read-in from updating rbd_dev
Unlike header read-in, parent info read-in is already decoupled in
get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along
with the processing logic.
Separate the initial read-in and update read-in logic into
rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and
have rbd_dev_v2_parent_info() just populate struct parent_image_info
(i.e. what get_parent_info() did). Some existing QoI issues, like
flatten of a standalone clone being disregarded on refresh, remain.
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6ed5520ef303..d62a0298c890 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5594,6 +5594,14 @@ struct parent_image_info {
u64 overlap;
};
+static void rbd_parent_info_cleanup(struct parent_image_info *pii)
+{
+ kfree(pii->pool_ns);
+ kfree(pii->image_id);
+
+ memset(pii, 0, sizeof(*pii));
+}
+
/*
* The caller is responsible for @pii.
*/
@@ -5663,6 +5671,9 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
if (pii->has_overlap)
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
@@ -5701,14 +5712,17 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
pii->has_overlap = true;
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
return -EINVAL;
}
-static int get_parent_info(struct rbd_device *rbd_dev,
- struct parent_image_info *pii)
+static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
{
struct page *req_page, *reply_page;
void *p;
@@ -5736,7 +5750,7 @@ static int get_parent_info(struct rbd_device *rbd_dev,
return ret;
}
-static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
+static int rbd_dev_setup_parent(struct rbd_device *rbd_dev)
{
struct rbd_spec *parent_spec;
struct parent_image_info pii = { 0 };
@@ -5746,37 +5760,12 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
if (!parent_spec)
return -ENOMEM;
- ret = get_parent_info(rbd_dev, &pii);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out_err;
- dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
- __func__, pii.pool_id, pii.pool_ns, pii.image_id, pii.snap_id,
- pii.has_overlap, pii.overlap);
-
- if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap) {
- /*
- * Either the parent never existed, or we have
- * record of it but the image got flattened so it no
- * longer has a parent. When the parent of a
- * layered image disappears we immediately set the
- * overlap to 0. The effect of this is that all new
- * requests will be treated as if the image had no
- * parent.
- *
- * If !pii.has_overlap, the parent image spec is not
- * applicable. It's there to avoid duplication in each
- * snapshot record.
- */
- if (rbd_dev->parent_overlap) {
- rbd_dev->parent_overlap = 0;
- rbd_dev_parent_put(rbd_dev);
- pr_info("%s: clone image has been flattened\n",
- rbd_dev->disk->disk_name);
- }
-
+ if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap)
goto out; /* No parent? No problem. */
- }
/* The ceph file layout needs to fit pool id in 32 bits */
@@ -5788,46 +5777,34 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
}
/*
- * The parent won't change (except when the clone is
- * flattened, already handled that). So we only need to
- * record the parent spec we have not already done so.
+ * The parent won't change except when the clone is flattened,
+ * so we only need to record the parent image spec once.
*/
- if (!rbd_dev->parent_spec) {
- parent_spec->pool_id = pii.pool_id;
- if (pii.pool_ns && *pii.pool_ns) {
- parent_spec->pool_ns = pii.pool_ns;
- pii.pool_ns = NULL;
- }
- parent_spec->image_id = pii.image_id;
- pii.image_id = NULL;
- parent_spec->snap_id = pii.snap_id;
-
- rbd_dev->parent_spec = parent_spec;
- parent_spec = NULL; /* rbd_dev now owns this */
+ parent_spec->pool_id = pii.pool_id;
+ if (pii.pool_ns && *pii.pool_ns) {
+ parent_spec->pool_ns = pii.pool_ns;
+ pii.pool_ns = NULL;
}
+ parent_spec->image_id = pii.image_id;
+ pii.image_id = NULL;
+ parent_spec->snap_id = pii.snap_id;
+
+ rbd_assert(!rbd_dev->parent_spec);
+ rbd_dev->parent_spec = parent_spec;
+ parent_spec = NULL; /* rbd_dev now owns this */
/*
- * We always update the parent overlap. If it's zero we issue
- * a warning, as we will proceed as if there was no parent.
+ * Record the parent overlap. If it's zero, issue a warning as
+ * we will proceed as if there is no parent.
*/
- if (!pii.overlap) {
- if (parent_spec) {
- /* refresh, careful to warn just once */
- if (rbd_dev->parent_overlap)
- rbd_warn(rbd_dev,
- "clone now standalone (overlap became 0)");
- } else {
- /* initial probe */
- rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
- }
- }
+ if (!pii.overlap)
+ rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
rbd_dev->parent_overlap = pii.overlap;
out:
ret = 0;
out_err:
- kfree(pii.pool_ns);
- kfree(pii.image_id);
+ rbd_parent_info_cleanup(&pii);
rbd_spec_put(parent_spec);
return ret;
}
@@ -6977,7 +6954,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth)
}
if (rbd_dev->header.features & RBD_FEATURE_LAYERING) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_setup_parent(rbd_dev);
if (ret)
goto err_out_probe;
}
@@ -7026,9 +7003,47 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
}
}
+static void rbd_dev_update_parent(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
+{
+ if (pii->pool_id == CEPH_NOPOOL || !pii->has_overlap) {
+ /*
+ * Either the parent never existed, or we have
+ * record of it but the image got flattened so it no
+ * longer has a parent. When the parent of a
+ * layered image disappears we immediately set the
+ * overlap to 0. The effect of this is that all new
+ * requests will be treated as if the image had no
+ * parent.
+ *
+ * If !pii.has_overlap, the parent image spec is not
+ * applicable. It's there to avoid duplication in each
+ * snapshot record.
+ */
+ if (rbd_dev->parent_overlap) {
+ rbd_dev->parent_overlap = 0;
+ rbd_dev_parent_put(rbd_dev);
+ pr_info("%s: clone has been flattened\n",
+ rbd_dev->disk->disk_name);
+ }
+ } else {
+ rbd_assert(rbd_dev->parent_spec);
+
+ /*
+ * Update the parent overlap. If it became zero, issue
+ * a warning as we will proceed as if there is no parent.
+ */
+ if (!pii->overlap && rbd_dev->parent_overlap)
+ rbd_warn(rbd_dev,
+ "clone has become standalone (overlap 0)");
+ rbd_dev->parent_overlap = pii->overlap;
+ }
+}
+
static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
+ struct parent_image_info pii = { 0 };
u64 mapping_size;
int ret;
@@ -7044,12 +7059,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
* mapped image getting flattened.
*/
if (rbd_dev->parent) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out;
}
rbd_dev_update_header(rbd_dev, &header);
+ if (rbd_dev->parent)
+ rbd_dev_update_parent(rbd_dev, &pii);
rbd_assert(!rbd_is_snap(rbd_dev));
rbd_dev->mapping.size = rbd_dev->header.image_size;
@@ -7059,6 +7076,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
if (!ret && mapping_size != rbd_dev->mapping.size)
rbd_dev_update_size(rbd_dev);
+ rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x c10311776f0a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100449-sulk-spray-e75b@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c10311776f0a8ddea2276df96e255625b07045a8 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 18:38:26 +0200
Subject: [PATCH] rbd: decouple parent info read-in from updating rbd_dev
Unlike header read-in, parent info read-in is already decoupled in
get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along
with the processing logic.
Separate the initial read-in and update read-in logic into
rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and
have rbd_dev_v2_parent_info() just populate struct parent_image_info
(i.e. what get_parent_info() did). Some existing QoI issues, like
flatten of a standalone clone being disregarded on refresh, remain.
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6ed5520ef303..d62a0298c890 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5594,6 +5594,14 @@ struct parent_image_info {
u64 overlap;
};
+static void rbd_parent_info_cleanup(struct parent_image_info *pii)
+{
+ kfree(pii->pool_ns);
+ kfree(pii->image_id);
+
+ memset(pii, 0, sizeof(*pii));
+}
+
/*
* The caller is responsible for @pii.
*/
@@ -5663,6 +5671,9 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
if (pii->has_overlap)
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
@@ -5701,14 +5712,17 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
pii->has_overlap = true;
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
return -EINVAL;
}
-static int get_parent_info(struct rbd_device *rbd_dev,
- struct parent_image_info *pii)
+static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
{
struct page *req_page, *reply_page;
void *p;
@@ -5736,7 +5750,7 @@ static int get_parent_info(struct rbd_device *rbd_dev,
return ret;
}
-static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
+static int rbd_dev_setup_parent(struct rbd_device *rbd_dev)
{
struct rbd_spec *parent_spec;
struct parent_image_info pii = { 0 };
@@ -5746,37 +5760,12 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
if (!parent_spec)
return -ENOMEM;
- ret = get_parent_info(rbd_dev, &pii);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out_err;
- dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
- __func__, pii.pool_id, pii.pool_ns, pii.image_id, pii.snap_id,
- pii.has_overlap, pii.overlap);
-
- if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap) {
- /*
- * Either the parent never existed, or we have
- * record of it but the image got flattened so it no
- * longer has a parent. When the parent of a
- * layered image disappears we immediately set the
- * overlap to 0. The effect of this is that all new
- * requests will be treated as if the image had no
- * parent.
- *
- * If !pii.has_overlap, the parent image spec is not
- * applicable. It's there to avoid duplication in each
- * snapshot record.
- */
- if (rbd_dev->parent_overlap) {
- rbd_dev->parent_overlap = 0;
- rbd_dev_parent_put(rbd_dev);
- pr_info("%s: clone image has been flattened\n",
- rbd_dev->disk->disk_name);
- }
-
+ if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap)
goto out; /* No parent? No problem. */
- }
/* The ceph file layout needs to fit pool id in 32 bits */
@@ -5788,46 +5777,34 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
}
/*
- * The parent won't change (except when the clone is
- * flattened, already handled that). So we only need to
- * record the parent spec we have not already done so.
+ * The parent won't change except when the clone is flattened,
+ * so we only need to record the parent image spec once.
*/
- if (!rbd_dev->parent_spec) {
- parent_spec->pool_id = pii.pool_id;
- if (pii.pool_ns && *pii.pool_ns) {
- parent_spec->pool_ns = pii.pool_ns;
- pii.pool_ns = NULL;
- }
- parent_spec->image_id = pii.image_id;
- pii.image_id = NULL;
- parent_spec->snap_id = pii.snap_id;
-
- rbd_dev->parent_spec = parent_spec;
- parent_spec = NULL; /* rbd_dev now owns this */
+ parent_spec->pool_id = pii.pool_id;
+ if (pii.pool_ns && *pii.pool_ns) {
+ parent_spec->pool_ns = pii.pool_ns;
+ pii.pool_ns = NULL;
}
+ parent_spec->image_id = pii.image_id;
+ pii.image_id = NULL;
+ parent_spec->snap_id = pii.snap_id;
+
+ rbd_assert(!rbd_dev->parent_spec);
+ rbd_dev->parent_spec = parent_spec;
+ parent_spec = NULL; /* rbd_dev now owns this */
/*
- * We always update the parent overlap. If it's zero we issue
- * a warning, as we will proceed as if there was no parent.
+ * Record the parent overlap. If it's zero, issue a warning as
+ * we will proceed as if there is no parent.
*/
- if (!pii.overlap) {
- if (parent_spec) {
- /* refresh, careful to warn just once */
- if (rbd_dev->parent_overlap)
- rbd_warn(rbd_dev,
- "clone now standalone (overlap became 0)");
- } else {
- /* initial probe */
- rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
- }
- }
+ if (!pii.overlap)
+ rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
rbd_dev->parent_overlap = pii.overlap;
out:
ret = 0;
out_err:
- kfree(pii.pool_ns);
- kfree(pii.image_id);
+ rbd_parent_info_cleanup(&pii);
rbd_spec_put(parent_spec);
return ret;
}
@@ -6977,7 +6954,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth)
}
if (rbd_dev->header.features & RBD_FEATURE_LAYERING) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_setup_parent(rbd_dev);
if (ret)
goto err_out_probe;
}
@@ -7026,9 +7003,47 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
}
}
+static void rbd_dev_update_parent(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
+{
+ if (pii->pool_id == CEPH_NOPOOL || !pii->has_overlap) {
+ /*
+ * Either the parent never existed, or we have
+ * record of it but the image got flattened so it no
+ * longer has a parent. When the parent of a
+ * layered image disappears we immediately set the
+ * overlap to 0. The effect of this is that all new
+ * requests will be treated as if the image had no
+ * parent.
+ *
+ * If !pii.has_overlap, the parent image spec is not
+ * applicable. It's there to avoid duplication in each
+ * snapshot record.
+ */
+ if (rbd_dev->parent_overlap) {
+ rbd_dev->parent_overlap = 0;
+ rbd_dev_parent_put(rbd_dev);
+ pr_info("%s: clone has been flattened\n",
+ rbd_dev->disk->disk_name);
+ }
+ } else {
+ rbd_assert(rbd_dev->parent_spec);
+
+ /*
+ * Update the parent overlap. If it became zero, issue
+ * a warning as we will proceed as if there is no parent.
+ */
+ if (!pii->overlap && rbd_dev->parent_overlap)
+ rbd_warn(rbd_dev,
+ "clone has become standalone (overlap 0)");
+ rbd_dev->parent_overlap = pii->overlap;
+ }
+}
+
static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
+ struct parent_image_info pii = { 0 };
u64 mapping_size;
int ret;
@@ -7044,12 +7059,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
* mapped image getting flattened.
*/
if (rbd_dev->parent) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out;
}
rbd_dev_update_header(rbd_dev, &header);
+ if (rbd_dev->parent)
+ rbd_dev_update_parent(rbd_dev, &pii);
rbd_assert(!rbd_is_snap(rbd_dev));
rbd_dev->mapping.size = rbd_dev->header.image_size;
@@ -7059,6 +7076,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
if (!ret && mapping_size != rbd_dev->mapping.size)
rbd_dev_update_size(rbd_dev);
+ rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x c10311776f0a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100448-email-overkill-3d70@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c10311776f0a8ddea2276df96e255625b07045a8 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 18:38:26 +0200
Subject: [PATCH] rbd: decouple parent info read-in from updating rbd_dev
Unlike header read-in, parent info read-in is already decoupled in
get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along
with the processing logic.
Separate the initial read-in and update read-in logic into
rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and
have rbd_dev_v2_parent_info() just populate struct parent_image_info
(i.e. what get_parent_info() did). Some existing QoI issues, like
flatten of a standalone clone being disregarded on refresh, remain.
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6ed5520ef303..d62a0298c890 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5594,6 +5594,14 @@ struct parent_image_info {
u64 overlap;
};
+static void rbd_parent_info_cleanup(struct parent_image_info *pii)
+{
+ kfree(pii->pool_ns);
+ kfree(pii->image_id);
+
+ memset(pii, 0, sizeof(*pii));
+}
+
/*
* The caller is responsible for @pii.
*/
@@ -5663,6 +5671,9 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
if (pii->has_overlap)
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
@@ -5701,14 +5712,17 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
pii->has_overlap = true;
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
return -EINVAL;
}
-static int get_parent_info(struct rbd_device *rbd_dev,
- struct parent_image_info *pii)
+static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
{
struct page *req_page, *reply_page;
void *p;
@@ -5736,7 +5750,7 @@ static int get_parent_info(struct rbd_device *rbd_dev,
return ret;
}
-static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
+static int rbd_dev_setup_parent(struct rbd_device *rbd_dev)
{
struct rbd_spec *parent_spec;
struct parent_image_info pii = { 0 };
@@ -5746,37 +5760,12 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
if (!parent_spec)
return -ENOMEM;
- ret = get_parent_info(rbd_dev, &pii);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out_err;
- dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
- __func__, pii.pool_id, pii.pool_ns, pii.image_id, pii.snap_id,
- pii.has_overlap, pii.overlap);
-
- if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap) {
- /*
- * Either the parent never existed, or we have
- * record of it but the image got flattened so it no
- * longer has a parent. When the parent of a
- * layered image disappears we immediately set the
- * overlap to 0. The effect of this is that all new
- * requests will be treated as if the image had no
- * parent.
- *
- * If !pii.has_overlap, the parent image spec is not
- * applicable. It's there to avoid duplication in each
- * snapshot record.
- */
- if (rbd_dev->parent_overlap) {
- rbd_dev->parent_overlap = 0;
- rbd_dev_parent_put(rbd_dev);
- pr_info("%s: clone image has been flattened\n",
- rbd_dev->disk->disk_name);
- }
-
+ if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap)
goto out; /* No parent? No problem. */
- }
/* The ceph file layout needs to fit pool id in 32 bits */
@@ -5788,46 +5777,34 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
}
/*
- * The parent won't change (except when the clone is
- * flattened, already handled that). So we only need to
- * record the parent spec we have not already done so.
+ * The parent won't change except when the clone is flattened,
+ * so we only need to record the parent image spec once.
*/
- if (!rbd_dev->parent_spec) {
- parent_spec->pool_id = pii.pool_id;
- if (pii.pool_ns && *pii.pool_ns) {
- parent_spec->pool_ns = pii.pool_ns;
- pii.pool_ns = NULL;
- }
- parent_spec->image_id = pii.image_id;
- pii.image_id = NULL;
- parent_spec->snap_id = pii.snap_id;
-
- rbd_dev->parent_spec = parent_spec;
- parent_spec = NULL; /* rbd_dev now owns this */
+ parent_spec->pool_id = pii.pool_id;
+ if (pii.pool_ns && *pii.pool_ns) {
+ parent_spec->pool_ns = pii.pool_ns;
+ pii.pool_ns = NULL;
}
+ parent_spec->image_id = pii.image_id;
+ pii.image_id = NULL;
+ parent_spec->snap_id = pii.snap_id;
+
+ rbd_assert(!rbd_dev->parent_spec);
+ rbd_dev->parent_spec = parent_spec;
+ parent_spec = NULL; /* rbd_dev now owns this */
/*
- * We always update the parent overlap. If it's zero we issue
- * a warning, as we will proceed as if there was no parent.
+ * Record the parent overlap. If it's zero, issue a warning as
+ * we will proceed as if there is no parent.
*/
- if (!pii.overlap) {
- if (parent_spec) {
- /* refresh, careful to warn just once */
- if (rbd_dev->parent_overlap)
- rbd_warn(rbd_dev,
- "clone now standalone (overlap became 0)");
- } else {
- /* initial probe */
- rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
- }
- }
+ if (!pii.overlap)
+ rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
rbd_dev->parent_overlap = pii.overlap;
out:
ret = 0;
out_err:
- kfree(pii.pool_ns);
- kfree(pii.image_id);
+ rbd_parent_info_cleanup(&pii);
rbd_spec_put(parent_spec);
return ret;
}
@@ -6977,7 +6954,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth)
}
if (rbd_dev->header.features & RBD_FEATURE_LAYERING) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_setup_parent(rbd_dev);
if (ret)
goto err_out_probe;
}
@@ -7026,9 +7003,47 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
}
}
+static void rbd_dev_update_parent(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
+{
+ if (pii->pool_id == CEPH_NOPOOL || !pii->has_overlap) {
+ /*
+ * Either the parent never existed, or we have
+ * record of it but the image got flattened so it no
+ * longer has a parent. When the parent of a
+ * layered image disappears we immediately set the
+ * overlap to 0. The effect of this is that all new
+ * requests will be treated as if the image had no
+ * parent.
+ *
+ * If !pii.has_overlap, the parent image spec is not
+ * applicable. It's there to avoid duplication in each
+ * snapshot record.
+ */
+ if (rbd_dev->parent_overlap) {
+ rbd_dev->parent_overlap = 0;
+ rbd_dev_parent_put(rbd_dev);
+ pr_info("%s: clone has been flattened\n",
+ rbd_dev->disk->disk_name);
+ }
+ } else {
+ rbd_assert(rbd_dev->parent_spec);
+
+ /*
+ * Update the parent overlap. If it became zero, issue
+ * a warning as we will proceed as if there is no parent.
+ */
+ if (!pii->overlap && rbd_dev->parent_overlap)
+ rbd_warn(rbd_dev,
+ "clone has become standalone (overlap 0)");
+ rbd_dev->parent_overlap = pii->overlap;
+ }
+}
+
static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
+ struct parent_image_info pii = { 0 };
u64 mapping_size;
int ret;
@@ -7044,12 +7059,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
* mapped image getting flattened.
*/
if (rbd_dev->parent) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out;
}
rbd_dev_update_header(rbd_dev, &header);
+ if (rbd_dev->parent)
+ rbd_dev_update_parent(rbd_dev, &pii);
rbd_assert(!rbd_is_snap(rbd_dev));
rbd_dev->mapping.size = rbd_dev->header.image_size;
@@ -7059,6 +7076,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
if (!ret && mapping_size != rbd_dev->mapping.size)
rbd_dev_update_size(rbd_dev);
+ rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x c10311776f0a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100447-slate-onward-bd1f@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c10311776f0a8ddea2276df96e255625b07045a8 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Wed, 20 Sep 2023 18:38:26 +0200
Subject: [PATCH] rbd: decouple parent info read-in from updating rbd_dev
Unlike header read-in, parent info read-in is already decoupled in
get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along
with the processing logic.
Separate the initial read-in and update read-in logic into
rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and
have rbd_dev_v2_parent_info() just populate struct parent_image_info
(i.e. what get_parent_info() did). Some existing QoI issues, like
flatten of a standalone clone being disregarded on refresh, remain.
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang(a)easystack.cn>
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6ed5520ef303..d62a0298c890 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5594,6 +5594,14 @@ struct parent_image_info {
u64 overlap;
};
+static void rbd_parent_info_cleanup(struct parent_image_info *pii)
+{
+ kfree(pii->pool_ns);
+ kfree(pii->image_id);
+
+ memset(pii, 0, sizeof(*pii));
+}
+
/*
* The caller is responsible for @pii.
*/
@@ -5663,6 +5671,9 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
if (pii->has_overlap)
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
@@ -5701,14 +5712,17 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
pii->has_overlap = true;
ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+ dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
+ __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
+ pii->has_overlap, pii->overlap);
return 0;
e_inval:
return -EINVAL;
}
-static int get_parent_info(struct rbd_device *rbd_dev,
- struct parent_image_info *pii)
+static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
{
struct page *req_page, *reply_page;
void *p;
@@ -5736,7 +5750,7 @@ static int get_parent_info(struct rbd_device *rbd_dev,
return ret;
}
-static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
+static int rbd_dev_setup_parent(struct rbd_device *rbd_dev)
{
struct rbd_spec *parent_spec;
struct parent_image_info pii = { 0 };
@@ -5746,37 +5760,12 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
if (!parent_spec)
return -ENOMEM;
- ret = get_parent_info(rbd_dev, &pii);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out_err;
- dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
- __func__, pii.pool_id, pii.pool_ns, pii.image_id, pii.snap_id,
- pii.has_overlap, pii.overlap);
-
- if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap) {
- /*
- * Either the parent never existed, or we have
- * record of it but the image got flattened so it no
- * longer has a parent. When the parent of a
- * layered image disappears we immediately set the
- * overlap to 0. The effect of this is that all new
- * requests will be treated as if the image had no
- * parent.
- *
- * If !pii.has_overlap, the parent image spec is not
- * applicable. It's there to avoid duplication in each
- * snapshot record.
- */
- if (rbd_dev->parent_overlap) {
- rbd_dev->parent_overlap = 0;
- rbd_dev_parent_put(rbd_dev);
- pr_info("%s: clone image has been flattened\n",
- rbd_dev->disk->disk_name);
- }
-
+ if (pii.pool_id == CEPH_NOPOOL || !pii.has_overlap)
goto out; /* No parent? No problem. */
- }
/* The ceph file layout needs to fit pool id in 32 bits */
@@ -5788,46 +5777,34 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
}
/*
- * The parent won't change (except when the clone is
- * flattened, already handled that). So we only need to
- * record the parent spec we have not already done so.
+ * The parent won't change except when the clone is flattened,
+ * so we only need to record the parent image spec once.
*/
- if (!rbd_dev->parent_spec) {
- parent_spec->pool_id = pii.pool_id;
- if (pii.pool_ns && *pii.pool_ns) {
- parent_spec->pool_ns = pii.pool_ns;
- pii.pool_ns = NULL;
- }
- parent_spec->image_id = pii.image_id;
- pii.image_id = NULL;
- parent_spec->snap_id = pii.snap_id;
-
- rbd_dev->parent_spec = parent_spec;
- parent_spec = NULL; /* rbd_dev now owns this */
+ parent_spec->pool_id = pii.pool_id;
+ if (pii.pool_ns && *pii.pool_ns) {
+ parent_spec->pool_ns = pii.pool_ns;
+ pii.pool_ns = NULL;
}
+ parent_spec->image_id = pii.image_id;
+ pii.image_id = NULL;
+ parent_spec->snap_id = pii.snap_id;
+
+ rbd_assert(!rbd_dev->parent_spec);
+ rbd_dev->parent_spec = parent_spec;
+ parent_spec = NULL; /* rbd_dev now owns this */
/*
- * We always update the parent overlap. If it's zero we issue
- * a warning, as we will proceed as if there was no parent.
+ * Record the parent overlap. If it's zero, issue a warning as
+ * we will proceed as if there is no parent.
*/
- if (!pii.overlap) {
- if (parent_spec) {
- /* refresh, careful to warn just once */
- if (rbd_dev->parent_overlap)
- rbd_warn(rbd_dev,
- "clone now standalone (overlap became 0)");
- } else {
- /* initial probe */
- rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
- }
- }
+ if (!pii.overlap)
+ rbd_warn(rbd_dev, "clone is standalone (overlap 0)");
rbd_dev->parent_overlap = pii.overlap;
out:
ret = 0;
out_err:
- kfree(pii.pool_ns);
- kfree(pii.image_id);
+ rbd_parent_info_cleanup(&pii);
rbd_spec_put(parent_spec);
return ret;
}
@@ -6977,7 +6954,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth)
}
if (rbd_dev->header.features & RBD_FEATURE_LAYERING) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_setup_parent(rbd_dev);
if (ret)
goto err_out_probe;
}
@@ -7026,9 +7003,47 @@ static void rbd_dev_update_header(struct rbd_device *rbd_dev,
}
}
+static void rbd_dev_update_parent(struct rbd_device *rbd_dev,
+ struct parent_image_info *pii)
+{
+ if (pii->pool_id == CEPH_NOPOOL || !pii->has_overlap) {
+ /*
+ * Either the parent never existed, or we have
+ * record of it but the image got flattened so it no
+ * longer has a parent. When the parent of a
+ * layered image disappears we immediately set the
+ * overlap to 0. The effect of this is that all new
+ * requests will be treated as if the image had no
+ * parent.
+ *
+ * If !pii.has_overlap, the parent image spec is not
+ * applicable. It's there to avoid duplication in each
+ * snapshot record.
+ */
+ if (rbd_dev->parent_overlap) {
+ rbd_dev->parent_overlap = 0;
+ rbd_dev_parent_put(rbd_dev);
+ pr_info("%s: clone has been flattened\n",
+ rbd_dev->disk->disk_name);
+ }
+ } else {
+ rbd_assert(rbd_dev->parent_spec);
+
+ /*
+ * Update the parent overlap. If it became zero, issue
+ * a warning as we will proceed as if there is no parent.
+ */
+ if (!pii->overlap && rbd_dev->parent_overlap)
+ rbd_warn(rbd_dev,
+ "clone has become standalone (overlap 0)");
+ rbd_dev->parent_overlap = pii->overlap;
+ }
+}
+
static int rbd_dev_refresh(struct rbd_device *rbd_dev)
{
struct rbd_image_header header = { 0 };
+ struct parent_image_info pii = { 0 };
u64 mapping_size;
int ret;
@@ -7044,12 +7059,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
* mapped image getting flattened.
*/
if (rbd_dev->parent) {
- ret = rbd_dev_v2_parent_info(rbd_dev);
+ ret = rbd_dev_v2_parent_info(rbd_dev, &pii);
if (ret)
goto out;
}
rbd_dev_update_header(rbd_dev, &header);
+ if (rbd_dev->parent)
+ rbd_dev_update_parent(rbd_dev, &pii);
rbd_assert(!rbd_is_snap(rbd_dev));
rbd_dev->mapping.size = rbd_dev->header.image_size;
@@ -7059,6 +7076,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
if (!ret && mapping_size != rbd_dev->mapping.size)
rbd_dev_update_size(rbd_dev);
+ rbd_parent_info_cleanup(&pii);
rbd_image_header_cleanup(&header);
return ret;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 9af86694fd5d387992699ec99007ed374966ce9a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100456-previous-tarantula-2517@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9af86694fd5d387992699ec99007ed374966ce9a Mon Sep 17 00:00:00 2001
From: Bernd Schubert <bschubert(a)ddn.com>
Date: Wed, 6 Sep 2023 17:59:03 +0200
Subject: [PATCH] btrfs: file_remove_privs needs an exclusive lock in direct io
write
This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The problem
exists in FUSE and btrfs. We can fix it without any additional helpers
from VFS, in case the privileges would need to be dropped, change the
lock type to be exclusive and redo the loop.
Fixes: e9adabb9712e ("btrfs: use shared lock for direct writes within EOF")
CC: Miklos Szeredi <miklos(a)szeredi.hu>
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Bernd Schubert <bschubert(a)ddn.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6edad7b9a5d3..e8726a83b649 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1466,8 +1466,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (iocb->ki_flags & IOCB_NOWAIT)
ilock_flags |= BTRFS_ILOCK_TRY;
- /* If the write DIO is within EOF, use a shared lock */
- if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode))
+ /*
+ * If the write DIO is within EOF, use a shared lock and also only if
+ * security bits will likely not be dropped by file_remove_privs() called
+ * from btrfs_write_check(). Either will need to be rechecked after the
+ * lock was acquired.
+ */
+ if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
ilock_flags |= BTRFS_ILOCK_SHARED;
relock:
@@ -1475,6 +1480,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (err < 0)
return err;
+ /* Shared lock cannot be used with security bits set. */
+ if ((ilock_flags & BTRFS_ILOCK_SHARED) && !IS_NOSEC(inode)) {
+ btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
+ ilock_flags &= ~BTRFS_ILOCK_SHARED;
+ goto relock;
+ }
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 9af86694fd5d387992699ec99007ed374966ce9a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100400-appraisal-idealize-3243@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9af86694fd5d387992699ec99007ed374966ce9a Mon Sep 17 00:00:00 2001
From: Bernd Schubert <bschubert(a)ddn.com>
Date: Wed, 6 Sep 2023 17:59:03 +0200
Subject: [PATCH] btrfs: file_remove_privs needs an exclusive lock in direct io
write
This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The problem
exists in FUSE and btrfs. We can fix it without any additional helpers
from VFS, in case the privileges would need to be dropped, change the
lock type to be exclusive and redo the loop.
Fixes: e9adabb9712e ("btrfs: use shared lock for direct writes within EOF")
CC: Miklos Szeredi <miklos(a)szeredi.hu>
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Bernd Schubert <bschubert(a)ddn.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6edad7b9a5d3..e8726a83b649 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1466,8 +1466,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (iocb->ki_flags & IOCB_NOWAIT)
ilock_flags |= BTRFS_ILOCK_TRY;
- /* If the write DIO is within EOF, use a shared lock */
- if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode))
+ /*
+ * If the write DIO is within EOF, use a shared lock and also only if
+ * security bits will likely not be dropped by file_remove_privs() called
+ * from btrfs_write_check(). Either will need to be rechecked after the
+ * lock was acquired.
+ */
+ if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
ilock_flags |= BTRFS_ILOCK_SHARED;
relock:
@@ -1475,6 +1480,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (err < 0)
return err;
+ /* Shared lock cannot be used with security bits set. */
+ if ((ilock_flags & BTRFS_ILOCK_SHARED) && !IS_NOSEC(inode)) {
+ btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
+ ilock_flags &= ~BTRFS_ILOCK_SHARED;
+ goto relock;
+ }
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
The patch below does not apply to the 6.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y
git checkout FETCH_HEAD
git cherry-pick -x 479965a2b7ec481737df0cadf553331063b9c343
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100423-autistic-impaired-361f@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 479965a2b7ec481737df0cadf553331063b9c343 Mon Sep 17 00:00:00 2001
From: Kristina Martsenko <kristina.martsenko(a)arm.com>
Date: Tue, 12 Sep 2023 14:34:29 +0100
Subject: [PATCH] arm64: cpufeature: Fix CLRBHB and BC detection
ClearBHB support is indicated by the CLRBHB field in ID_AA64ISAR2_EL1.
Following some refactoring the kernel incorrectly checks the BC field
instead. Fix the detection to use the right field.
(Note: The original ClearBHB support had it as FTR_HIGHER_SAFE, but this
patch uses FTR_LOWER_SAFE, which seems more correct.)
Also fix the detection of BC (hinted conditional branches) to use
FTR_LOWER_SAFE, so that it is not reported on mismatched systems.
Fixes: 356137e68a9f ("arm64/sysreg: Make BHB clear feature defines match the architecture")
Fixes: 8fcc8285c0e3 ("arm64/sysreg: Convert ID_AA64ISAR2_EL1 to automatic generation")
Cc: stable(a)vger.kernel.org
Signed-off-by: Kristina Martsenko <kristina.martsenko(a)arm.com>
Reviewed-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20230912133429.2606875-1-kristina.martsenko@arm.c…
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 96e50227f940..5bba39376055 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -663,7 +663,7 @@ static inline bool supports_clearbhb(int scope)
isar2 = read_sanitised_ftr_reg(SYS_ID_AA64ISAR2_EL1);
return cpuid_feature_extract_unsigned_field(isar2,
- ID_AA64ISAR2_EL1_BC_SHIFT);
+ ID_AA64ISAR2_EL1_CLRBHB_SHIFT);
}
const struct cpumask *system_32bit_el0_cpumask(void);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index b018ae12ff5f..444a73c2e638 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -222,7 +222,8 @@ static const struct arm64_ftr_bits ftr_id_aa64isar1[] = {
static const struct arm64_ftr_bits ftr_id_aa64isar2[] = {
ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_CSSC_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_RPRFM_SHIFT, 4, 0),
- ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_SAFE, ID_AA64ISAR2_EL1_BC_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_CLRBHB_SHIFT, 4, 0),
+ ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_BC_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_MOPS_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_PTR_AUTH),
FTR_STRICT, FTR_EXACT, ID_AA64ISAR2_EL1_APA3_SHIFT, 4, 0),
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 2517ef7c21cf..76ce150e7347 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1347,7 +1347,11 @@ UnsignedEnum 51:48 RPRFM
0b0000 NI
0b0001 IMP
EndEnum
-Res0 47:28
+Res0 47:32
+UnsignedEnum 31:28 CLRBHB
+ 0b0000 NI
+ 0b0001 IMP
+EndEnum
UnsignedEnum 27:24 PAC_frac
0b0000 NI
0b0001 IMP
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x b7599d241778d0b10cdf7a5c755aa7db9b83250c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100431-penalty-pegboard-c849@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b7599d241778d0b10cdf7a5c755aa7db9b83250c Mon Sep 17 00:00:00 2001
From: Javier Pello <devel(a)otheo.eu>
Date: Sat, 2 Sep 2023 17:10:39 +0200
Subject: [PATCH] drm/i915/gt: Fix reservation address in ggtt_reserve_guc_top
There is an assertion in ggtt_reserve_guc_top that the global GTT
is of size at least GUC_GGTT_TOP, which is not the case on a 32-bit
platform; see commit 562d55d991b39ce376c492df2f7890fd6a541ffc
("drm/i915/bdw: Only use 2g GGTT for 32b platforms"). If GEM_BUG_ON
is enabled, this triggers a BUG(); if GEM_BUG_ON is disabled, the
subsequent reservation fails and the driver fails to initialise
the device:
i915 0000:00:02.0: [drm:i915_init_ggtt [i915]] Failed to reserve top of GGTT for GuC
i915 0000:00:02.0: Device initialization failed (-28)
i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
i915: probe of 0000:00:02.0 failed with error -28
Make the reservation at the top of the available space, whatever
that is, instead of assuming that the top will be GUC_GGTT_TOP.
Fixes: 911800765ef6 ("drm/i915/uc: Reserve upper range of GGTT")
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9080
Signed-off-by: Javier Pello <devel(a)otheo.eu>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio(a)intel.com>
Cc: Fernando Pacheco <fernando.pacheco(a)intel.com>
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)linux.intel.com>
Cc: intel-gfx(a)lists.freedesktop.org
Cc: stable(a)vger.kernel.org # v5.3+
Signed-off-by: John Harrison <John.C.Harrison(a)Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230902171039.2229126186d697…
(cherry picked from commit 0f3fa942d91165c2702577e9274d2ee1c7212afc)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index dd0ed941441a..da21f2786b5d 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -511,20 +511,31 @@ void intel_ggtt_unbind_vma(struct i915_address_space *vm,
vm->clear_range(vm, vma_res->start, vma_res->vma_size);
}
+/*
+ * Reserve the top of the GuC address space for firmware images. Addresses
+ * beyond GUC_GGTT_TOP in the GuC address space are inaccessible by GuC,
+ * which makes for a suitable range to hold GuC/HuC firmware images if the
+ * size of the GGTT is 4G. However, on a 32-bit platform the size of the GGTT
+ * is limited to 2G, which is less than GUC_GGTT_TOP, but we reserve a chunk
+ * of the same size anyway, which is far more than needed, to keep the logic
+ * in uc_fw_ggtt_offset() simple.
+ */
+#define GUC_TOP_RESERVE_SIZE (SZ_4G - GUC_GGTT_TOP)
+
static int ggtt_reserve_guc_top(struct i915_ggtt *ggtt)
{
- u64 size;
+ u64 offset;
int ret;
if (!intel_uc_uses_guc(&ggtt->vm.gt->uc))
return 0;
- GEM_BUG_ON(ggtt->vm.total <= GUC_GGTT_TOP);
- size = ggtt->vm.total - GUC_GGTT_TOP;
+ GEM_BUG_ON(ggtt->vm.total <= GUC_TOP_RESERVE_SIZE);
+ offset = ggtt->vm.total - GUC_TOP_RESERVE_SIZE;
- ret = i915_gem_gtt_reserve(&ggtt->vm, NULL, &ggtt->uc_fw, size,
- GUC_GGTT_TOP, I915_COLOR_UNEVICTABLE,
- PIN_NOEVICT);
+ ret = i915_gem_gtt_reserve(&ggtt->vm, NULL, &ggtt->uc_fw,
+ GUC_TOP_RESERVE_SIZE, offset,
+ I915_COLOR_UNEVICTABLE, PIN_NOEVICT);
if (ret)
drm_dbg(&ggtt->vm.i915->drm,
"Failed to reserve top of GGTT for GuC\n");
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x b7599d241778d0b10cdf7a5c755aa7db9b83250c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100427-familiar-tabasco-df57@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b7599d241778d0b10cdf7a5c755aa7db9b83250c Mon Sep 17 00:00:00 2001
From: Javier Pello <devel(a)otheo.eu>
Date: Sat, 2 Sep 2023 17:10:39 +0200
Subject: [PATCH] drm/i915/gt: Fix reservation address in ggtt_reserve_guc_top
There is an assertion in ggtt_reserve_guc_top that the global GTT
is of size at least GUC_GGTT_TOP, which is not the case on a 32-bit
platform; see commit 562d55d991b39ce376c492df2f7890fd6a541ffc
("drm/i915/bdw: Only use 2g GGTT for 32b platforms"). If GEM_BUG_ON
is enabled, this triggers a BUG(); if GEM_BUG_ON is disabled, the
subsequent reservation fails and the driver fails to initialise
the device:
i915 0000:00:02.0: [drm:i915_init_ggtt [i915]] Failed to reserve top of GGTT for GuC
i915 0000:00:02.0: Device initialization failed (-28)
i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
i915: probe of 0000:00:02.0 failed with error -28
Make the reservation at the top of the available space, whatever
that is, instead of assuming that the top will be GUC_GGTT_TOP.
Fixes: 911800765ef6 ("drm/i915/uc: Reserve upper range of GGTT")
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9080
Signed-off-by: Javier Pello <devel(a)otheo.eu>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio(a)intel.com>
Cc: Fernando Pacheco <fernando.pacheco(a)intel.com>
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)linux.intel.com>
Cc: intel-gfx(a)lists.freedesktop.org
Cc: stable(a)vger.kernel.org # v5.3+
Signed-off-by: John Harrison <John.C.Harrison(a)Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230902171039.2229126186d697…
(cherry picked from commit 0f3fa942d91165c2702577e9274d2ee1c7212afc)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index dd0ed941441a..da21f2786b5d 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -511,20 +511,31 @@ void intel_ggtt_unbind_vma(struct i915_address_space *vm,
vm->clear_range(vm, vma_res->start, vma_res->vma_size);
}
+/*
+ * Reserve the top of the GuC address space for firmware images. Addresses
+ * beyond GUC_GGTT_TOP in the GuC address space are inaccessible by GuC,
+ * which makes for a suitable range to hold GuC/HuC firmware images if the
+ * size of the GGTT is 4G. However, on a 32-bit platform the size of the GGTT
+ * is limited to 2G, which is less than GUC_GGTT_TOP, but we reserve a chunk
+ * of the same size anyway, which is far more than needed, to keep the logic
+ * in uc_fw_ggtt_offset() simple.
+ */
+#define GUC_TOP_RESERVE_SIZE (SZ_4G - GUC_GGTT_TOP)
+
static int ggtt_reserve_guc_top(struct i915_ggtt *ggtt)
{
- u64 size;
+ u64 offset;
int ret;
if (!intel_uc_uses_guc(&ggtt->vm.gt->uc))
return 0;
- GEM_BUG_ON(ggtt->vm.total <= GUC_GGTT_TOP);
- size = ggtt->vm.total - GUC_GGTT_TOP;
+ GEM_BUG_ON(ggtt->vm.total <= GUC_TOP_RESERVE_SIZE);
+ offset = ggtt->vm.total - GUC_TOP_RESERVE_SIZE;
- ret = i915_gem_gtt_reserve(&ggtt->vm, NULL, &ggtt->uc_fw, size,
- GUC_GGTT_TOP, I915_COLOR_UNEVICTABLE,
- PIN_NOEVICT);
+ ret = i915_gem_gtt_reserve(&ggtt->vm, NULL, &ggtt->uc_fw,
+ GUC_TOP_RESERVE_SIZE, offset,
+ I915_COLOR_UNEVICTABLE, PIN_NOEVICT);
if (ret)
drm_dbg(&ggtt->vm.i915->drm,
"Failed to reserve top of GGTT for GuC\n");
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x b7599d241778d0b10cdf7a5c755aa7db9b83250c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100426-vocally-babbling-a2bc@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b7599d241778d0b10cdf7a5c755aa7db9b83250c Mon Sep 17 00:00:00 2001
From: Javier Pello <devel(a)otheo.eu>
Date: Sat, 2 Sep 2023 17:10:39 +0200
Subject: [PATCH] drm/i915/gt: Fix reservation address in ggtt_reserve_guc_top
There is an assertion in ggtt_reserve_guc_top that the global GTT
is of size at least GUC_GGTT_TOP, which is not the case on a 32-bit
platform; see commit 562d55d991b39ce376c492df2f7890fd6a541ffc
("drm/i915/bdw: Only use 2g GGTT for 32b platforms"). If GEM_BUG_ON
is enabled, this triggers a BUG(); if GEM_BUG_ON is disabled, the
subsequent reservation fails and the driver fails to initialise
the device:
i915 0000:00:02.0: [drm:i915_init_ggtt [i915]] Failed to reserve top of GGTT for GuC
i915 0000:00:02.0: Device initialization failed (-28)
i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
i915: probe of 0000:00:02.0 failed with error -28
Make the reservation at the top of the available space, whatever
that is, instead of assuming that the top will be GUC_GGTT_TOP.
Fixes: 911800765ef6 ("drm/i915/uc: Reserve upper range of GGTT")
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9080
Signed-off-by: Javier Pello <devel(a)otheo.eu>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio(a)intel.com>
Cc: Fernando Pacheco <fernando.pacheco(a)intel.com>
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)linux.intel.com>
Cc: intel-gfx(a)lists.freedesktop.org
Cc: stable(a)vger.kernel.org # v5.3+
Signed-off-by: John Harrison <John.C.Harrison(a)Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230902171039.2229126186d697…
(cherry picked from commit 0f3fa942d91165c2702577e9274d2ee1c7212afc)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index dd0ed941441a..da21f2786b5d 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -511,20 +511,31 @@ void intel_ggtt_unbind_vma(struct i915_address_space *vm,
vm->clear_range(vm, vma_res->start, vma_res->vma_size);
}
+/*
+ * Reserve the top of the GuC address space for firmware images. Addresses
+ * beyond GUC_GGTT_TOP in the GuC address space are inaccessible by GuC,
+ * which makes for a suitable range to hold GuC/HuC firmware images if the
+ * size of the GGTT is 4G. However, on a 32-bit platform the size of the GGTT
+ * is limited to 2G, which is less than GUC_GGTT_TOP, but we reserve a chunk
+ * of the same size anyway, which is far more than needed, to keep the logic
+ * in uc_fw_ggtt_offset() simple.
+ */
+#define GUC_TOP_RESERVE_SIZE (SZ_4G - GUC_GGTT_TOP)
+
static int ggtt_reserve_guc_top(struct i915_ggtt *ggtt)
{
- u64 size;
+ u64 offset;
int ret;
if (!intel_uc_uses_guc(&ggtt->vm.gt->uc))
return 0;
- GEM_BUG_ON(ggtt->vm.total <= GUC_GGTT_TOP);
- size = ggtt->vm.total - GUC_GGTT_TOP;
+ GEM_BUG_ON(ggtt->vm.total <= GUC_TOP_RESERVE_SIZE);
+ offset = ggtt->vm.total - GUC_TOP_RESERVE_SIZE;
- ret = i915_gem_gtt_reserve(&ggtt->vm, NULL, &ggtt->uc_fw, size,
- GUC_GGTT_TOP, I915_COLOR_UNEVICTABLE,
- PIN_NOEVICT);
+ ret = i915_gem_gtt_reserve(&ggtt->vm, NULL, &ggtt->uc_fw,
+ GUC_TOP_RESERVE_SIZE, offset,
+ I915_COLOR_UNEVICTABLE, PIN_NOEVICT);
if (ret)
drm_dbg(&ggtt->vm.i915->drm,
"Failed to reserve top of GGTT for GuC\n");
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x 75e2bd5f1ede42a2bc88aa34b431e1ace8e0bea0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100441-finalist-fox-8bb6@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 75e2bd5f1ede42a2bc88aa34b431e1ace8e0bea0 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Fri, 8 Sep 2023 20:04:52 +0900
Subject: [PATCH] ata: libata-core: Do not register PM operations for SAS ports
libsas does its own domain based power management of ports. For such
ports, libata should not use a device type defining power management
operations as executing these operations for suspend/resume in addition
to libsas calls to ata_sas_port_suspend() and ata_sas_port_resume() is
not necessary (and likely dangerous to do, even though problems are not
seen currently).
Introduce the new ata_port_sas_type device_type for ports managed by
libsas. This new device type is used in ata_tport_add() and is defined
without power management operations.
Fixes: 2fcbdcb4c802 ("[SCSI] libata: export ata_port suspend/resume infrastructure for sas")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Tested-by: Chia-Lin Kao (AceLan) <acelan.kao(a)canonical.com>
Tested-by: Geert Uytterhoeven <geert+renesas(a)glider.be>
Reviewed-by: John Garry <john.g.garry(a)oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 092372334e92..261445c1851b 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5335,7 +5335,7 @@ EXPORT_SYMBOL_GPL(ata_host_resume);
#endif
const struct device_type ata_port_type = {
- .name = "ata_port",
+ .name = ATA_PORT_TYPE_NAME,
#ifdef CONFIG_PM
.pm = &ata_port_pm_ops,
#endif
diff --git a/drivers/ata/libata-transport.c b/drivers/ata/libata-transport.c
index e4fb9d1b9b39..3e49a877500e 100644
--- a/drivers/ata/libata-transport.c
+++ b/drivers/ata/libata-transport.c
@@ -266,6 +266,10 @@ void ata_tport_delete(struct ata_port *ap)
put_device(dev);
}
+static const struct device_type ata_port_sas_type = {
+ .name = ATA_PORT_TYPE_NAME,
+};
+
/** ata_tport_add - initialize a transport ATA port structure
*
* @parent: parent device
@@ -283,7 +287,10 @@ int ata_tport_add(struct device *parent,
struct device *dev = &ap->tdev;
device_initialize(dev);
- dev->type = &ata_port_type;
+ if (ap->flags & ATA_FLAG_SAS_HOST)
+ dev->type = &ata_port_sas_type;
+ else
+ dev->type = &ata_port_type;
dev->parent = parent;
ata_host_get(ap->host);
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 820299bd9d06..05ac80da8ebc 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -30,6 +30,8 @@ enum {
ATA_DNXFER_QUIET = (1 << 31),
};
+#define ATA_PORT_TYPE_NAME "ata_port"
+
extern atomic_t ata_print_id;
extern int atapi_passthru16;
extern int libata_fua;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x b595d25996329427b2c09d4b90395a165fb3ef8e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100402-reprint-snugness-793f@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b595d25996329427b2c09d4b90395a165fb3ef8e Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Fri, 8 Sep 2023 15:31:39 -0400
Subject: [PATCH] btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup. This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.
The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap. We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.
The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.
We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect. This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.
Fix this by not clearing uptodate when we have a write error. The only
thing we should be doing in this case is setting the mapping error and
carrying on. This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.
With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.
Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.
CC: stable(a)vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac3fca5a5e41..6954ae763b86 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -484,10 +484,8 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
bvec->bv_offset, bvec->bv_len);
btrfs_finish_ordered_extent(bbio->ordered, page, start, len, !error);
- if (error) {
- btrfs_page_clear_uptodate(fs_info, page, start, len);
+ if (error)
mapping_set_error(page->mapping, error);
- }
btrfs_page_clear_writeback(fs_info, page, start, len);
}
@@ -1456,8 +1454,6 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
if (ret) {
btrfs_mark_ordered_io_finished(BTRFS_I(inode), page, page_start,
PAGE_SIZE, !ret);
- btrfs_page_clear_uptodate(btrfs_sb(inode->i_sb), page,
- page_start, PAGE_SIZE);
mapping_set_error(page->mapping, ret);
}
unlock_page(page);
@@ -1624,8 +1620,6 @@ static void extent_buffer_write_end_io(struct btrfs_bio *bbio)
struct page *page = bvec->bv_page;
u32 len = bvec->bv_len;
- if (!uptodate)
- btrfs_page_clear_uptodate(fs_info, page, start, len);
btrfs_page_clear_writeback(fs_info, page, start, len);
bio_offset += len;
}
@@ -2201,7 +2195,6 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
if (ret) {
btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
cur, cur_len, !ret);
- btrfs_page_clear_uptodate(fs_info, page, cur, cur_len);
mapping_set_error(page->mapping, ret);
}
btrfs_page_unlock_writer(fs_info, page, cur, cur_len);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e211e88c6545..616fdcf40467 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1085,9 +1085,6 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
btrfs_mark_ordered_io_finished(inode, locked_page,
page_start, PAGE_SIZE,
!ret);
- btrfs_page_clear_uptodate(inode->root->fs_info,
- locked_page, page_start,
- PAGE_SIZE);
mapping_set_error(locked_page->mapping, ret);
unlock_page(locked_page);
}
@@ -2791,7 +2788,6 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
mapping_set_error(page->mapping, ret);
btrfs_mark_ordered_io_finished(inode, page, page_start,
PAGE_SIZE, !ret);
- btrfs_page_clear_uptodate(fs_info, page, page_start, PAGE_SIZE);
clear_page_dirty_for_io(page);
}
btrfs_page_clear_checked(fs_info, page, page_start, PAGE_SIZE);
The patch below does not apply to the 6.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y
git checkout FETCH_HEAD
git cherry-pick -x b595d25996329427b2c09d4b90395a165fb3ef8e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100400-garbage-caboose-c859@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b595d25996329427b2c09d4b90395a165fb3ef8e Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Fri, 8 Sep 2023 15:31:39 -0400
Subject: [PATCH] btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup. This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.
The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap. We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.
The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.
We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect. This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.
Fix this by not clearing uptodate when we have a write error. The only
thing we should be doing in this case is setting the mapping error and
carrying on. This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.
With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.
Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.
CC: stable(a)vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac3fca5a5e41..6954ae763b86 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -484,10 +484,8 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
bvec->bv_offset, bvec->bv_len);
btrfs_finish_ordered_extent(bbio->ordered, page, start, len, !error);
- if (error) {
- btrfs_page_clear_uptodate(fs_info, page, start, len);
+ if (error)
mapping_set_error(page->mapping, error);
- }
btrfs_page_clear_writeback(fs_info, page, start, len);
}
@@ -1456,8 +1454,6 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
if (ret) {
btrfs_mark_ordered_io_finished(BTRFS_I(inode), page, page_start,
PAGE_SIZE, !ret);
- btrfs_page_clear_uptodate(btrfs_sb(inode->i_sb), page,
- page_start, PAGE_SIZE);
mapping_set_error(page->mapping, ret);
}
unlock_page(page);
@@ -1624,8 +1620,6 @@ static void extent_buffer_write_end_io(struct btrfs_bio *bbio)
struct page *page = bvec->bv_page;
u32 len = bvec->bv_len;
- if (!uptodate)
- btrfs_page_clear_uptodate(fs_info, page, start, len);
btrfs_page_clear_writeback(fs_info, page, start, len);
bio_offset += len;
}
@@ -2201,7 +2195,6 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
if (ret) {
btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
cur, cur_len, !ret);
- btrfs_page_clear_uptodate(fs_info, page, cur, cur_len);
mapping_set_error(page->mapping, ret);
}
btrfs_page_unlock_writer(fs_info, page, cur, cur_len);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e211e88c6545..616fdcf40467 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1085,9 +1085,6 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
btrfs_mark_ordered_io_finished(inode, locked_page,
page_start, PAGE_SIZE,
!ret);
- btrfs_page_clear_uptodate(inode->root->fs_info,
- locked_page, page_start,
- PAGE_SIZE);
mapping_set_error(locked_page->mapping, ret);
unlock_page(locked_page);
}
@@ -2791,7 +2788,6 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
mapping_set_error(page->mapping, ret);
btrfs_mark_ordered_io_finished(inode, page, page_start,
PAGE_SIZE, !ret);
- btrfs_page_clear_uptodate(fs_info, page, page_start, PAGE_SIZE);
clear_page_dirty_for_io(page);
}
btrfs_page_clear_checked(fs_info, page, page_start, PAGE_SIZE);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x 45d99ea451d0c30bfd4864f0fe485d7dac014902
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100440-armory-unbalance-bd33@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Thu, 21 Sep 2023 20:54:25 +0800
Subject: [PATCH] ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengye…
Cc: stable(a)vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
local_set(&bpage->commit, 0);
}
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+ return local_read(&bpage->page->commit);
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* Increment overrun to account for the lost events.
*/
local_add(page_entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
}
@@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->reader_page->read);
}
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
- return local_read(&bpage->page->commit);
-}
-
static struct ring_buffer_event *
rb_iter_head_event(struct ring_buffer_iter *iter)
{
@@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
* the counters.
*/
local_add(entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
/*
@@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
event = __rb_page_index(tail_page, tail);
- /* account for padding bytes */
- local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
/*
* Save the original length to the meta data.
* This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
* write counter enough to allow another writer to slip
* in on this page.
* We put in a discarded commit instead, to make sure
- * that this space is not used again.
+ * that this space is not used again, and this space will
+ * not be accounted into 'entries_bytes'.
*
* If we are less than the minimum size, we don't need to
* worry about it.
@@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* time delta must be non zero */
event->time_delta = 1;
+ /* account for padding bytes */
+ local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
/* Make sure the padding is visible before the tail_page->write update */
smp_wmb();
@@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
/**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
* @buffer: The ring buffer
* @cpu: The per CPU buffer to read from.
*/
@@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
length = rb_event_length(event);
cpu_buffer->reader_page->read += length;
+ cpu_buffer->read_bytes += length;
}
static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
- cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+ cpu_buffer->read_bytes += rb_page_commit(reader);
/* swap the pages */
rb_init_page(bpage);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 45d99ea451d0c30bfd4864f0fe485d7dac014902
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100438-palpable-marina-68c8@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Thu, 21 Sep 2023 20:54:25 +0800
Subject: [PATCH] ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengye…
Cc: stable(a)vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
local_set(&bpage->commit, 0);
}
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+ return local_read(&bpage->page->commit);
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* Increment overrun to account for the lost events.
*/
local_add(page_entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
}
@@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->reader_page->read);
}
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
- return local_read(&bpage->page->commit);
-}
-
static struct ring_buffer_event *
rb_iter_head_event(struct ring_buffer_iter *iter)
{
@@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
* the counters.
*/
local_add(entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
/*
@@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
event = __rb_page_index(tail_page, tail);
- /* account for padding bytes */
- local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
/*
* Save the original length to the meta data.
* This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
* write counter enough to allow another writer to slip
* in on this page.
* We put in a discarded commit instead, to make sure
- * that this space is not used again.
+ * that this space is not used again, and this space will
+ * not be accounted into 'entries_bytes'.
*
* If we are less than the minimum size, we don't need to
* worry about it.
@@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* time delta must be non zero */
event->time_delta = 1;
+ /* account for padding bytes */
+ local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
/* Make sure the padding is visible before the tail_page->write update */
smp_wmb();
@@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
/**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
* @buffer: The ring buffer
* @cpu: The per CPU buffer to read from.
*/
@@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
length = rb_event_length(event);
cpu_buffer->reader_page->read += length;
+ cpu_buffer->read_bytes += length;
}
static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
- cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+ cpu_buffer->read_bytes += rb_page_commit(reader);
/* swap the pages */
rb_init_page(bpage);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 45d99ea451d0c30bfd4864f0fe485d7dac014902
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100437-backyard-cartwheel-4b2f@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Thu, 21 Sep 2023 20:54:25 +0800
Subject: [PATCH] ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengye…
Cc: stable(a)vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
local_set(&bpage->commit, 0);
}
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+ return local_read(&bpage->page->commit);
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* Increment overrun to account for the lost events.
*/
local_add(page_entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
}
@@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->reader_page->read);
}
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
- return local_read(&bpage->page->commit);
-}
-
static struct ring_buffer_event *
rb_iter_head_event(struct ring_buffer_iter *iter)
{
@@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
* the counters.
*/
local_add(entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
/*
@@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
event = __rb_page_index(tail_page, tail);
- /* account for padding bytes */
- local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
/*
* Save the original length to the meta data.
* This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
* write counter enough to allow another writer to slip
* in on this page.
* We put in a discarded commit instead, to make sure
- * that this space is not used again.
+ * that this space is not used again, and this space will
+ * not be accounted into 'entries_bytes'.
*
* If we are less than the minimum size, we don't need to
* worry about it.
@@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* time delta must be non zero */
event->time_delta = 1;
+ /* account for padding bytes */
+ local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
/* Make sure the padding is visible before the tail_page->write update */
smp_wmb();
@@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
/**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
* @buffer: The ring buffer
* @cpu: The per CPU buffer to read from.
*/
@@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
length = rb_event_length(event);
cpu_buffer->reader_page->read += length;
+ cpu_buffer->read_bytes += length;
}
static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
- cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+ cpu_buffer->read_bytes += rb_page_commit(reader);
/* swap the pages */
rb_init_page(bpage);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 45d99ea451d0c30bfd4864f0fe485d7dac014902
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100436-ergonomic-enviably-bf14@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Thu, 21 Sep 2023 20:54:25 +0800
Subject: [PATCH] ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengye…
Cc: stable(a)vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
local_set(&bpage->commit, 0);
}
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+ return local_read(&bpage->page->commit);
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* Increment overrun to account for the lost events.
*/
local_add(page_entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
}
@@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->reader_page->read);
}
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
- return local_read(&bpage->page->commit);
-}
-
static struct ring_buffer_event *
rb_iter_head_event(struct ring_buffer_iter *iter)
{
@@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
* the counters.
*/
local_add(entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
/*
@@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
event = __rb_page_index(tail_page, tail);
- /* account for padding bytes */
- local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
/*
* Save the original length to the meta data.
* This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
* write counter enough to allow another writer to slip
* in on this page.
* We put in a discarded commit instead, to make sure
- * that this space is not used again.
+ * that this space is not used again, and this space will
+ * not be accounted into 'entries_bytes'.
*
* If we are less than the minimum size, we don't need to
* worry about it.
@@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* time delta must be non zero */
event->time_delta = 1;
+ /* account for padding bytes */
+ local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
/* Make sure the padding is visible before the tail_page->write update */
smp_wmb();
@@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
/**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
* @buffer: The ring buffer
* @cpu: The per CPU buffer to read from.
*/
@@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
length = rb_event_length(event);
cpu_buffer->reader_page->read += length;
+ cpu_buffer->read_bytes += length;
}
static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
- cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+ cpu_buffer->read_bytes += rb_page_commit(reader);
/* swap the pages */
rb_init_page(bpage);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 45d99ea451d0c30bfd4864f0fe485d7dac014902
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100434-majestic-daisy-4a02@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Thu, 21 Sep 2023 20:54:25 +0800
Subject: [PATCH] ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengye…
Cc: stable(a)vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
local_set(&bpage->commit, 0);
}
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+ return local_read(&bpage->page->commit);
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* Increment overrun to account for the lost events.
*/
local_add(page_entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
}
@@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->reader_page->read);
}
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
- return local_read(&bpage->page->commit);
-}
-
static struct ring_buffer_event *
rb_iter_head_event(struct ring_buffer_iter *iter)
{
@@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
* the counters.
*/
local_add(entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
/*
@@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
event = __rb_page_index(tail_page, tail);
- /* account for padding bytes */
- local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
/*
* Save the original length to the meta data.
* This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
* write counter enough to allow another writer to slip
* in on this page.
* We put in a discarded commit instead, to make sure
- * that this space is not used again.
+ * that this space is not used again, and this space will
+ * not be accounted into 'entries_bytes'.
*
* If we are less than the minimum size, we don't need to
* worry about it.
@@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* time delta must be non zero */
event->time_delta = 1;
+ /* account for padding bytes */
+ local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
/* Make sure the padding is visible before the tail_page->write update */
smp_wmb();
@@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
/**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
* @buffer: The ring buffer
* @cpu: The per CPU buffer to read from.
*/
@@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
length = rb_event_length(event);
cpu_buffer->reader_page->read += length;
+ cpu_buffer->read_bytes += length;
}
static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
- cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+ cpu_buffer->read_bytes += rb_page_commit(reader);
/* swap the pages */
rb_init_page(bpage);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 45d99ea451d0c30bfd4864f0fe485d7dac014902
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100433-pumice-despise-a596@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Thu, 21 Sep 2023 20:54:25 +0800
Subject: [PATCH] ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengye…
Cc: stable(a)vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
local_set(&bpage->commit, 0);
}
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+ return local_read(&bpage->page->commit);
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* Increment overrun to account for the lost events.
*/
local_add(page_entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
}
@@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->reader_page->read);
}
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
- return local_read(&bpage->page->commit);
-}
-
static struct ring_buffer_event *
rb_iter_head_event(struct ring_buffer_iter *iter)
{
@@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
* the counters.
*/
local_add(entries, &cpu_buffer->overrun);
- local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+ local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
/*
@@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
event = __rb_page_index(tail_page, tail);
- /* account for padding bytes */
- local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
/*
* Save the original length to the meta data.
* This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
* write counter enough to allow another writer to slip
* in on this page.
* We put in a discarded commit instead, to make sure
- * that this space is not used again.
+ * that this space is not used again, and this space will
+ * not be accounted into 'entries_bytes'.
*
* If we are less than the minimum size, we don't need to
* worry about it.
@@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* time delta must be non zero */
event->time_delta = 1;
+ /* account for padding bytes */
+ local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
/* Make sure the padding is visible before the tail_page->write update */
smp_wmb();
@@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
/**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
* @buffer: The ring buffer
* @cpu: The per CPU buffer to read from.
*/
@@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
length = rb_event_length(event);
cpu_buffer->reader_page->read += length;
+ cpu_buffer->read_bytes += length;
}
static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
- cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+ cpu_buffer->read_bytes += rb_page_commit(reader);
/* swap the pages */
rb_init_page(bpage);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100431-ascertain-evade-5e53@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47 Mon Sep 17 00:00:00 2001
From: Yang Shi <yang(a)os.amperecomputing.com>
Date: Wed, 20 Sep 2023 15:32:42 -0700
Subject: [PATCH] mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and
MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100429-unwatched-blaspheme-2d9d@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47 Mon Sep 17 00:00:00 2001
From: Yang Shi <yang(a)os.amperecomputing.com>
Date: Wed, 20 Sep 2023 15:32:42 -0700
Subject: [PATCH] mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and
MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100427-urban-cognition-691b@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47 Mon Sep 17 00:00:00 2001
From: Yang Shi <yang(a)os.amperecomputing.com>
Date: Wed, 20 Sep 2023 15:32:42 -0700
Subject: [PATCH] mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and
MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100424-koala-ivory-c2e4@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47 Mon Sep 17 00:00:00 2001
From: Yang Shi <yang(a)os.amperecomputing.com>
Date: Wed, 20 Sep 2023 15:32:42 -0700
Subject: [PATCH] mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and
MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100422-twistable-umbrella-a71d@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47 Mon Sep 17 00:00:00 2001
From: Yang Shi <yang(a)os.amperecomputing.com>
Date: Wed, 20 Sep 2023 15:32:42 -0700
Subject: [PATCH] mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and
MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100420-grafting-raft-1e1b@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47 Mon Sep 17 00:00:00 2001
From: Yang Shi <yang(a)os.amperecomputing.com>
Date: Wed, 20 Sep 2023 15:32:42 -0700
Subject: [PATCH] mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and
MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x 578d7699e5c2add8c2e9549d9d75dfb56c460cb3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100409-shorten-demanding-15d2@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 578d7699e5c2add8c2e9549d9d75dfb56c460cb3 Mon Sep 17 00:00:00 2001
From: Ben Wolsieffer <Ben.Wolsieffer(a)hefring.com>
Date: Thu, 14 Sep 2023 12:30:20 -0400
Subject: [PATCH] proc: nommu: /proc/<pid>/maps: release mmap read lock
The no-MMU implementation of /proc/<pid>/map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock. Since _vml is NULL when the end of the
mappings is reached, the lock is not released.
Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc/<pid>/maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main(int argc, char *argv[]) {
void *buf;
sleep(1);
buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
puts("mmap returned");
return 0;
}
Then, run:
./proc_maps_bug &; cat /proc/$!/maps; fg
Without this patch, mmap() will hang and the command will never
complete.
This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.
The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.
Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.c…
Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer(a)hefring.com>
Acked-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Giulio Benetti <giulio.benetti(a)benettiengineering.com>
Cc: Greg Ungerer <gerg(a)uclinux.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a8ac0dd8041e..bc2e843f4810 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -192,11 +192,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ESRCH);
mm = priv->mm;
- if (!mm || !mmget_not_zero(mm))
+ if (!mm || !mmget_not_zero(mm)) {
+ put_task_struct(priv->task);
+ priv->task = NULL;
return NULL;
+ }
if (mmap_read_lock_killable(mm)) {
mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
return ERR_PTR(-EINTR);
}
@@ -205,23 +210,21 @@ static void *m_start(struct seq_file *m, loff_t *pos)
if (vma)
return vma;
- mmap_read_unlock(mm);
- mmput(mm);
return NULL;
}
-static void m_stop(struct seq_file *m, void *_vml)
+static void m_stop(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm = priv->mm;
- if (!IS_ERR_OR_NULL(_vml)) {
- mmap_read_unlock(priv->mm);
- mmput(priv->mm);
- }
- if (priv->task) {
- put_task_struct(priv->task);
- priv->task = NULL;
- }
+ if (!priv->task)
+ return;
+
+ mmap_read_unlock(mm);
+ mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
}
static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 578d7699e5c2add8c2e9549d9d75dfb56c460cb3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100408-swab-depict-d818@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 578d7699e5c2add8c2e9549d9d75dfb56c460cb3 Mon Sep 17 00:00:00 2001
From: Ben Wolsieffer <Ben.Wolsieffer(a)hefring.com>
Date: Thu, 14 Sep 2023 12:30:20 -0400
Subject: [PATCH] proc: nommu: /proc/<pid>/maps: release mmap read lock
The no-MMU implementation of /proc/<pid>/map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock. Since _vml is NULL when the end of the
mappings is reached, the lock is not released.
Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc/<pid>/maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main(int argc, char *argv[]) {
void *buf;
sleep(1);
buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
puts("mmap returned");
return 0;
}
Then, run:
./proc_maps_bug &; cat /proc/$!/maps; fg
Without this patch, mmap() will hang and the command will never
complete.
This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.
The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.
Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.c…
Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer(a)hefring.com>
Acked-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Giulio Benetti <giulio.benetti(a)benettiengineering.com>
Cc: Greg Ungerer <gerg(a)uclinux.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a8ac0dd8041e..bc2e843f4810 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -192,11 +192,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ESRCH);
mm = priv->mm;
- if (!mm || !mmget_not_zero(mm))
+ if (!mm || !mmget_not_zero(mm)) {
+ put_task_struct(priv->task);
+ priv->task = NULL;
return NULL;
+ }
if (mmap_read_lock_killable(mm)) {
mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
return ERR_PTR(-EINTR);
}
@@ -205,23 +210,21 @@ static void *m_start(struct seq_file *m, loff_t *pos)
if (vma)
return vma;
- mmap_read_unlock(mm);
- mmput(mm);
return NULL;
}
-static void m_stop(struct seq_file *m, void *_vml)
+static void m_stop(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm = priv->mm;
- if (!IS_ERR_OR_NULL(_vml)) {
- mmap_read_unlock(priv->mm);
- mmput(priv->mm);
- }
- if (priv->task) {
- put_task_struct(priv->task);
- priv->task = NULL;
- }
+ if (!priv->task)
+ return;
+
+ mmap_read_unlock(mm);
+ mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
}
static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 578d7699e5c2add8c2e9549d9d75dfb56c460cb3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100406-component-isolating-b700@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 578d7699e5c2add8c2e9549d9d75dfb56c460cb3 Mon Sep 17 00:00:00 2001
From: Ben Wolsieffer <Ben.Wolsieffer(a)hefring.com>
Date: Thu, 14 Sep 2023 12:30:20 -0400
Subject: [PATCH] proc: nommu: /proc/<pid>/maps: release mmap read lock
The no-MMU implementation of /proc/<pid>/map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock. Since _vml is NULL when the end of the
mappings is reached, the lock is not released.
Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc/<pid>/maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main(int argc, char *argv[]) {
void *buf;
sleep(1);
buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
puts("mmap returned");
return 0;
}
Then, run:
./proc_maps_bug &; cat /proc/$!/maps; fg
Without this patch, mmap() will hang and the command will never
complete.
This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.
The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.
Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.c…
Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer(a)hefring.com>
Acked-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Giulio Benetti <giulio.benetti(a)benettiengineering.com>
Cc: Greg Ungerer <gerg(a)uclinux.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a8ac0dd8041e..bc2e843f4810 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -192,11 +192,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ESRCH);
mm = priv->mm;
- if (!mm || !mmget_not_zero(mm))
+ if (!mm || !mmget_not_zero(mm)) {
+ put_task_struct(priv->task);
+ priv->task = NULL;
return NULL;
+ }
if (mmap_read_lock_killable(mm)) {
mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
return ERR_PTR(-EINTR);
}
@@ -205,23 +210,21 @@ static void *m_start(struct seq_file *m, loff_t *pos)
if (vma)
return vma;
- mmap_read_unlock(mm);
- mmput(mm);
return NULL;
}
-static void m_stop(struct seq_file *m, void *_vml)
+static void m_stop(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm = priv->mm;
- if (!IS_ERR_OR_NULL(_vml)) {
- mmap_read_unlock(priv->mm);
- mmput(priv->mm);
- }
- if (priv->task) {
- put_task_struct(priv->task);
- priv->task = NULL;
- }
+ if (!priv->task)
+ return;
+
+ mmap_read_unlock(mm);
+ mmput(mm);
+ put_task_struct(priv->task);
+ priv->task = NULL;
}
static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
The patch below does not apply to the 6.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y
git checkout FETCH_HEAD
git cherry-pick -x e765886249c533e1bb5cbc3cd741bad677417312
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100450-shone-catchy-89d1@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e765886249c533e1bb5cbc3cd741bad677417312 Mon Sep 17 00:00:00 2001
From: Sameer Pujar <spujar(a)nvidia.com>
Date: Thu, 7 Sep 2023 20:32:25 +0530
Subject: [PATCH] ASoC: tegra: Fix redundant PLLA and PLLA_OUT0 updates
Tegra audio graph card has many DAI links which connects internal
AHUB modules and external audio codecs. Since these are DPCM links,
hw_params() call in the machine driver happens for each connected
BE link and PLLA is updated every time. This is not really needed
for all links as only I/O link DAIs derive respective clocks from
PLLA_OUT0 and thus from PLLA. Hence add checks to limit the clock
updates to DAIs over I/O links.
This found to be fixing a DMIC clock discrepancy which is suspected
to happen because of back to back quick PLLA and PLLA_OUT0 rate
updates. This was observed on Jetson TX2 platform where DMIC clock
ended up with unexpected value.
Fixes: 202e2f774543 ("ASoC: tegra: Add audio graph based card driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sameer Pujar <spujar(a)nvidia.com>
Link: https://lore.kernel.org/r/1694098945-32760-3-git-send-email-spujar@nvidia.c…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/tegra/tegra_audio_graph_card.c b/sound/soc/tegra/tegra_audio_graph_card.c
index 1f2c5018bf5a..4737e776d383 100644
--- a/sound/soc/tegra/tegra_audio_graph_card.c
+++ b/sound/soc/tegra/tegra_audio_graph_card.c
@@ -10,6 +10,7 @@
#include <linux/platform_device.h>
#include <sound/graph_card.h>
#include <sound/pcm_params.h>
+#include <sound/soc-dai.h>
#define MAX_PLLA_OUT0_DIV 128
@@ -44,6 +45,21 @@ struct tegra_audio_cdata {
unsigned int plla_out0_rates[NUM_RATE_TYPE];
};
+static bool need_clk_update(struct snd_soc_dai *dai)
+{
+ if (snd_soc_dai_is_dummy(dai) ||
+ !dai->driver->ops ||
+ !dai->driver->name)
+ return false;
+
+ if (strstr(dai->driver->name, "I2S") ||
+ strstr(dai->driver->name, "DMIC") ||
+ strstr(dai->driver->name, "DSPK"))
+ return true;
+
+ return false;
+}
+
/* Setup PLL clock as per the given sample rate */
static int tegra_audio_graph_update_pll(struct snd_pcm_substream *substream,
struct snd_pcm_hw_params *params)
@@ -140,19 +156,7 @@ static int tegra_audio_graph_hw_params(struct snd_pcm_substream *substream,
struct snd_soc_dai *cpu_dai = asoc_rtd_to_cpu(rtd, 0);
int err;
- /*
- * This gets called for each DAI link (FE or BE) when DPCM is used.
- * We may not want to update PLLA rate for each call. So PLLA update
- * must be restricted to external I/O links (I2S, DMIC or DSPK) since
- * they actually depend on it. I/O modules update their clocks in
- * hw_param() of their respective component driver and PLLA rate
- * update here helps them to derive appropriate rates.
- *
- * TODO: When more HW accelerators get added (like sample rate
- * converter, volume gain controller etc., which don't really
- * depend on PLLA) we need a better way to filter here.
- */
- if (cpu_dai->driver->ops && rtd->dai_link->no_pcm) {
+ if (need_clk_update(cpu_dai)) {
err = tegra_audio_graph_update_pll(substream, params);
if (err)
return err;
Hi,
Would you be interested in acquiring Money20/20 USA Email List?
Number of Contacts: 11,542
Cost: $ 1,687
Interested? Email me back; I would love to provide more information on the list.
Kind Regards,
Mary Liyana
Marketing Coordinator
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x a52d4f657568d6458e873f74a9602e022afe666f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100446-broiler-liquid-20a4@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a52d4f657568d6458e873f74a9602e022afe666f Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 28 Sep 2023 09:23:27 -0600
Subject: [PATCH] io_uring/fs: remove sqe->rw_flags checking from LINKAT
This is unionized with the actual link flags, so they can of course be
set and they will be evaluated further down. If not we fail any LINKAT
that has to set option flags.
Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT")
Cc: stable(a)vger.kernel.org
Reported-by: Thomas Leonard <talex5(a)gmail.com>
Link: https://github.com/axboe/liburing/issues/955
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..08e3b175469c 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -243,7 +243,7 @@ int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
struct io_link *lnk = io_kiocb_to_cmd(req, struct io_link);
const char __user *oldf, *newf;
- if (sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+ if (sqe->buf_index || sqe->splice_fd_in)
return -EINVAL;
if (unlikely(req->flags & REQ_F_FIXED_FILE))
return -EBADF;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x e765886249c533e1bb5cbc3cd741bad677417312
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100451-matchless-aching-f16c@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e765886249c533e1bb5cbc3cd741bad677417312 Mon Sep 17 00:00:00 2001
From: Sameer Pujar <spujar(a)nvidia.com>
Date: Thu, 7 Sep 2023 20:32:25 +0530
Subject: [PATCH] ASoC: tegra: Fix redundant PLLA and PLLA_OUT0 updates
Tegra audio graph card has many DAI links which connects internal
AHUB modules and external audio codecs. Since these are DPCM links,
hw_params() call in the machine driver happens for each connected
BE link and PLLA is updated every time. This is not really needed
for all links as only I/O link DAIs derive respective clocks from
PLLA_OUT0 and thus from PLLA. Hence add checks to limit the clock
updates to DAIs over I/O links.
This found to be fixing a DMIC clock discrepancy which is suspected
to happen because of back to back quick PLLA and PLLA_OUT0 rate
updates. This was observed on Jetson TX2 platform where DMIC clock
ended up with unexpected value.
Fixes: 202e2f774543 ("ASoC: tegra: Add audio graph based card driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sameer Pujar <spujar(a)nvidia.com>
Link: https://lore.kernel.org/r/1694098945-32760-3-git-send-email-spujar@nvidia.c…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/tegra/tegra_audio_graph_card.c b/sound/soc/tegra/tegra_audio_graph_card.c
index 1f2c5018bf5a..4737e776d383 100644
--- a/sound/soc/tegra/tegra_audio_graph_card.c
+++ b/sound/soc/tegra/tegra_audio_graph_card.c
@@ -10,6 +10,7 @@
#include <linux/platform_device.h>
#include <sound/graph_card.h>
#include <sound/pcm_params.h>
+#include <sound/soc-dai.h>
#define MAX_PLLA_OUT0_DIV 128
@@ -44,6 +45,21 @@ struct tegra_audio_cdata {
unsigned int plla_out0_rates[NUM_RATE_TYPE];
};
+static bool need_clk_update(struct snd_soc_dai *dai)
+{
+ if (snd_soc_dai_is_dummy(dai) ||
+ !dai->driver->ops ||
+ !dai->driver->name)
+ return false;
+
+ if (strstr(dai->driver->name, "I2S") ||
+ strstr(dai->driver->name, "DMIC") ||
+ strstr(dai->driver->name, "DSPK"))
+ return true;
+
+ return false;
+}
+
/* Setup PLL clock as per the given sample rate */
static int tegra_audio_graph_update_pll(struct snd_pcm_substream *substream,
struct snd_pcm_hw_params *params)
@@ -140,19 +156,7 @@ static int tegra_audio_graph_hw_params(struct snd_pcm_substream *substream,
struct snd_soc_dai *cpu_dai = asoc_rtd_to_cpu(rtd, 0);
int err;
- /*
- * This gets called for each DAI link (FE or BE) when DPCM is used.
- * We may not want to update PLLA rate for each call. So PLLA update
- * must be restricted to external I/O links (I2S, DMIC or DSPK) since
- * they actually depend on it. I/O modules update their clocks in
- * hw_param() of their respective component driver and PLLA rate
- * update here helps them to derive appropriate rates.
- *
- * TODO: When more HW accelerators get added (like sample rate
- * converter, volume gain controller etc., which don't really
- * depend on PLLA) we need a better way to filter here.
- */
- if (cpu_dai->driver->ops && rtd->dai_link->no_pcm) {
+ if (need_clk_update(cpu_dai)) {
err = tegra_audio_graph_update_pll(substream, params);
if (err)
return err;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x e765886249c533e1bb5cbc3cd741bad677417312
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100420-corned-elm-de4b@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e765886249c533e1bb5cbc3cd741bad677417312 Mon Sep 17 00:00:00 2001
From: Sameer Pujar <spujar(a)nvidia.com>
Date: Thu, 7 Sep 2023 20:32:25 +0530
Subject: [PATCH] ASoC: tegra: Fix redundant PLLA and PLLA_OUT0 updates
Tegra audio graph card has many DAI links which connects internal
AHUB modules and external audio codecs. Since these are DPCM links,
hw_params() call in the machine driver happens for each connected
BE link and PLLA is updated every time. This is not really needed
for all links as only I/O link DAIs derive respective clocks from
PLLA_OUT0 and thus from PLLA. Hence add checks to limit the clock
updates to DAIs over I/O links.
This found to be fixing a DMIC clock discrepancy which is suspected
to happen because of back to back quick PLLA and PLLA_OUT0 rate
updates. This was observed on Jetson TX2 platform where DMIC clock
ended up with unexpected value.
Fixes: 202e2f774543 ("ASoC: tegra: Add audio graph based card driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sameer Pujar <spujar(a)nvidia.com>
Link: https://lore.kernel.org/r/1694098945-32760-3-git-send-email-spujar@nvidia.c…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/tegra/tegra_audio_graph_card.c b/sound/soc/tegra/tegra_audio_graph_card.c
index 1f2c5018bf5a..4737e776d383 100644
--- a/sound/soc/tegra/tegra_audio_graph_card.c
+++ b/sound/soc/tegra/tegra_audio_graph_card.c
@@ -10,6 +10,7 @@
#include <linux/platform_device.h>
#include <sound/graph_card.h>
#include <sound/pcm_params.h>
+#include <sound/soc-dai.h>
#define MAX_PLLA_OUT0_DIV 128
@@ -44,6 +45,21 @@ struct tegra_audio_cdata {
unsigned int plla_out0_rates[NUM_RATE_TYPE];
};
+static bool need_clk_update(struct snd_soc_dai *dai)
+{
+ if (snd_soc_dai_is_dummy(dai) ||
+ !dai->driver->ops ||
+ !dai->driver->name)
+ return false;
+
+ if (strstr(dai->driver->name, "I2S") ||
+ strstr(dai->driver->name, "DMIC") ||
+ strstr(dai->driver->name, "DSPK"))
+ return true;
+
+ return false;
+}
+
/* Setup PLL clock as per the given sample rate */
static int tegra_audio_graph_update_pll(struct snd_pcm_substream *substream,
struct snd_pcm_hw_params *params)
@@ -140,19 +156,7 @@ static int tegra_audio_graph_hw_params(struct snd_pcm_substream *substream,
struct snd_soc_dai *cpu_dai = asoc_rtd_to_cpu(rtd, 0);
int err;
- /*
- * This gets called for each DAI link (FE or BE) when DPCM is used.
- * We may not want to update PLLA rate for each call. So PLLA update
- * must be restricted to external I/O links (I2S, DMIC or DSPK) since
- * they actually depend on it. I/O modules update their clocks in
- * hw_param() of their respective component driver and PLLA rate
- * update here helps them to derive appropriate rates.
- *
- * TODO: When more HW accelerators get added (like sample rate
- * converter, volume gain controller etc., which don't really
- * depend on PLLA) we need a better way to filter here.
- */
- if (cpu_dai->driver->ops && rtd->dai_link->no_pcm) {
+ if (need_clk_update(cpu_dai)) {
err = tegra_audio_graph_update_pll(substream, params);
if (err)
return err;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 41b07476da38ac2878a14e5b8fe0312c41ea36e3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100456-acclaim-object-ca91@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 41b07476da38ac2878a14e5b8fe0312c41ea36e3 Mon Sep 17 00:00:00 2001
From: Kailang Yang <kailang(a)realtek.com>
Date: Tue, 19 Sep 2023 16:27:16 +0800
Subject: [PATCH] ALSA: hda/realtek - ALC287 Realtek I2S speaker platform
support
New platform SSID:0x231f.
0x17 was only speaker pin, DAC assigned will be 0x03. Headphone
assigned to 0x02.
Playback via headphone will get EQ filter processing.
So, it needs to swap DAC.
Signed-off-by: Kailang Yang <kailang(a)realtek.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/8d63c6e360124e3ea2523753050e6f05@realtek.com
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
index 883a7e865bc5..751783f3a15c 100644
--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -10577,6 +10577,10 @@ static const struct snd_hda_pin_quirk alc269_pin_fixup_tbl[] = {
{0x17, 0x90170110},
{0x19, 0x03a11030},
{0x21, 0x03211020}),
+ SND_HDA_PIN_QUIRK(0x10ec0287, 0x17aa, "Lenovo", ALC287_FIXUP_THINKPAD_I2S_SPK,
+ {0x17, 0x90170110}, /* 0x231f with RTK I2S AMP */
+ {0x19, 0x04a11040},
+ {0x21, 0x04211020}),
SND_HDA_PIN_QUIRK(0x10ec0286, 0x1025, "Acer", ALC286_FIXUP_ACER_AIO_MIC_NO_PRESENCE,
{0x12, 0x90a60130},
{0x17, 0x90170110},
The patch below does not apply to the 6.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y
git checkout FETCH_HEAD
git cherry-pick -x 41b07476da38ac2878a14e5b8fe0312c41ea36e3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100451-alarm-grasp-7872@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 41b07476da38ac2878a14e5b8fe0312c41ea36e3 Mon Sep 17 00:00:00 2001
From: Kailang Yang <kailang(a)realtek.com>
Date: Tue, 19 Sep 2023 16:27:16 +0800
Subject: [PATCH] ALSA: hda/realtek - ALC287 Realtek I2S speaker platform
support
New platform SSID:0x231f.
0x17 was only speaker pin, DAC assigned will be 0x03. Headphone
assigned to 0x02.
Playback via headphone will get EQ filter processing.
So, it needs to swap DAC.
Signed-off-by: Kailang Yang <kailang(a)realtek.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/8d63c6e360124e3ea2523753050e6f05@realtek.com
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
index 883a7e865bc5..751783f3a15c 100644
--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -10577,6 +10577,10 @@ static const struct snd_hda_pin_quirk alc269_pin_fixup_tbl[] = {
{0x17, 0x90170110},
{0x19, 0x03a11030},
{0x21, 0x03211020}),
+ SND_HDA_PIN_QUIRK(0x10ec0287, 0x17aa, "Lenovo", ALC287_FIXUP_THINKPAD_I2S_SPK,
+ {0x17, 0x90170110}, /* 0x231f with RTK I2S AMP */
+ {0x19, 0x04a11040},
+ {0x21, 0x04211020}),
SND_HDA_PIN_QUIRK(0x10ec0286, 0x1025, "Acer", ALC286_FIXUP_ACER_AIO_MIC_NO_PRESENCE,
{0x12, 0x90a60130},
{0x17, 0x90170110},
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x cff9b2332ab762b7e0586c793c431a8f2ea4db04
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100455-astrology-overcrowd-10e9@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cff9b2332ab762b7e0586c793c431a8f2ea4db04 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Fri, 15 Sep 2023 13:44:44 -0400
Subject: [PATCH] kernel/sched: Modify initial boot task idle setup
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52ee.
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+Zo…
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9..802551e0009b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
* PF_KTHREAD should already be set at this point; regardless, make it
* look like a proper per-CPU kthread.
*/
- idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
+ idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
kthread_set_per_cpu(idle, cpu);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 342f58a329f5..5007b25c5bc6 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
void cpu_startup_entry(enum cpuhp_state state)
{
+ current->flags |= PF_IDLE;
arch_cpu_idle_prepare();
cpuhp_online_idle(state);
while (1)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x cff9b2332ab762b7e0586c793c431a8f2ea4db04
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100454-crawlers-crinkle-fea4@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cff9b2332ab762b7e0586c793c431a8f2ea4db04 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Fri, 15 Sep 2023 13:44:44 -0400
Subject: [PATCH] kernel/sched: Modify initial boot task idle setup
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52ee.
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+Zo…
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9..802551e0009b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
* PF_KTHREAD should already be set at this point; regardless, make it
* look like a proper per-CPU kthread.
*/
- idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
+ idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
kthread_set_per_cpu(idle, cpu);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 342f58a329f5..5007b25c5bc6 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
void cpu_startup_entry(enum cpuhp_state state)
{
+ current->flags |= PF_IDLE;
arch_cpu_idle_prepare();
cpuhp_online_idle(state);
while (1)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x cff9b2332ab762b7e0586c793c431a8f2ea4db04
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100453-company-budget-dc97@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cff9b2332ab762b7e0586c793c431a8f2ea4db04 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Fri, 15 Sep 2023 13:44:44 -0400
Subject: [PATCH] kernel/sched: Modify initial boot task idle setup
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52ee.
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+Zo…
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9..802551e0009b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
* PF_KTHREAD should already be set at this point; regardless, make it
* look like a proper per-CPU kthread.
*/
- idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
+ idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
kthread_set_per_cpu(idle, cpu);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 342f58a329f5..5007b25c5bc6 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
void cpu_startup_entry(enum cpuhp_state state)
{
+ current->flags |= PF_IDLE;
arch_cpu_idle_prepare();
cpuhp_online_idle(state);
while (1)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x cff9b2332ab762b7e0586c793c431a8f2ea4db04
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100451-cough-encrypt-1d2e@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cff9b2332ab762b7e0586c793c431a8f2ea4db04 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Fri, 15 Sep 2023 13:44:44 -0400
Subject: [PATCH] kernel/sched: Modify initial boot task idle setup
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52ee.
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+Zo…
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9..802551e0009b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
* PF_KTHREAD should already be set at this point; regardless, make it
* look like a proper per-CPU kthread.
*/
- idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
+ idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
kthread_set_per_cpu(idle, cpu);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 342f58a329f5..5007b25c5bc6 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
void cpu_startup_entry(enum cpuhp_state state)
{
+ current->flags |= PF_IDLE;
arch_cpu_idle_prepare();
cpuhp_online_idle(state);
while (1)
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x ff48b37802e5c134e2dfc4d091f10b2eb5065a72
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100455-impulse-shucking-f603@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ff48b37802e5c134e2dfc4d091f10b2eb5065a72 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Fri, 15 Sep 2023 15:00:13 +0900
Subject: [PATCH] scsi: Do not attempt to rescan suspended devices
scsi_rescan_device() takes a scsi device lock before executing a device
handler and device driver rescan methods. Waiting for the completion of
any command issued to the device by these methods will thus be done with
the device lock held. As a result, there is a risk of deadlocking within
the power management code if scsi_rescan_device() is called to handle a
device resume with the associated scsi device not yet resumed.
Avoid such situation by checking that the target scsi device is in the
running state, that is, fully capable of executing commands, before
proceeding with the rescan and bailout returning -EWOULDBLOCK otherwise.
With this error return, the caller can retry rescaning the device after
a delay.
The state check is done with the device lock held and is thus safe
against incoming suspend power management operations.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Tested-by: Geert Uytterhoeven <geert+renesas(a)glider.be>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 52014b2d39e1..3db4d31a03a1 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1619,12 +1619,24 @@ int scsi_add_device(struct Scsi_Host *host, uint channel,
}
EXPORT_SYMBOL(scsi_add_device);
-void scsi_rescan_device(struct scsi_device *sdev)
+int scsi_rescan_device(struct scsi_device *sdev)
{
struct device *dev = &sdev->sdev_gendev;
+ int ret = 0;
device_lock(dev);
+ /*
+ * Bail out if the device is not running. Otherwise, the rescan may
+ * block waiting for commands to be executed, with us holding the
+ * device lock. This can result in a potential deadlock in the power
+ * management core code when system resume is on-going.
+ */
+ if (sdev->sdev_state != SDEV_RUNNING) {
+ ret = -EWOULDBLOCK;
+ goto unlock;
+ }
+
scsi_attach_vpd(sdev);
scsi_cdl_check(sdev);
@@ -1638,7 +1650,11 @@ void scsi_rescan_device(struct scsi_device *sdev)
drv->rescan(dev);
module_put(dev->driver->owner);
}
+
+unlock:
device_unlock(dev);
+
+ return ret;
}
EXPORT_SYMBOL(scsi_rescan_device);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 49f768d0ff37..4c2dc8150c6d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -764,7 +764,7 @@ scsi_template_proc_dir(const struct scsi_host_template *sht);
#define scsi_template_proc_dir(sht) NULL
#endif
extern void scsi_scan_host(struct Scsi_Host *);
-extern void scsi_rescan_device(struct scsi_device *);
+extern int scsi_rescan_device(struct scsi_device *sdev);
extern void scsi_remove_host(struct Scsi_Host *);
extern struct Scsi_Host *scsi_host_get(struct Scsi_Host *);
extern int scsi_host_busy(struct Scsi_Host *shost);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 8b4d9469d0b0e553208ee6f62f2807111fde18b9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100444-gown-footsore-2fd8@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8b4d9469d0b0e553208ee6f62f2807111fde18b9 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Tue, 5 Sep 2023 09:06:23 +0900
Subject: [PATCH] ata: libata-scsi: Fix delayed scsi_rescan_device() execution
Commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after
device resume") modified ata_scsi_dev_rescan() to check the scsi device
"is_suspended" power field to ensure that the scsi device associated
with an ATA device is fully resumed when scsi_rescan_device() is
executed. However, this fix is problematic as:
1) It relies on a PM internal field that should not be used without PM
device locking protection.
2) The check for is_suspended and the call to scsi_rescan_device() are
not atomic and a suspend PM event may be triggered between them,
casuing scsi_rescan_device() to be called on a suspended device and
in that function blocking while holding the scsi device lock. This
would deadlock a following resume operation.
These problems can trigger PM deadlocks on resume, especially with
resume operations triggered quickly after or during suspend operations.
E.g., a simple bash script like:
for (( i=0; i<10; i++ )); do
echo "+2 > /sys/class/rtc/rtc0/wakealarm
echo mem > /sys/power/state
done
that triggers a resume 2 seconds after starting suspending a system can
quickly lead to a PM deadlock preventing the system from correctly
resuming.
Fix this by replacing the check on is_suspended with a check on the
return value given by scsi_rescan_device() as that function will fail if
called against a suspended device. Also make sure rescan tasks already
scheduled are first cancelled before suspending an ata port.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Tested-by: Geert Uytterhoeven <geert+renesas(a)glider.be>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index a0bc01606b30..092372334e92 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5168,11 +5168,27 @@ static const unsigned int ata_port_suspend_ehi = ATA_EHI_QUIET
static void ata_port_suspend(struct ata_port *ap, pm_message_t mesg)
{
+ /*
+ * We are about to suspend the port, so we do not care about
+ * scsi_rescan_device() calls scheduled by previous resume operations.
+ * The next resume will schedule the rescan again. So cancel any rescan
+ * that is not done yet.
+ */
+ cancel_delayed_work_sync(&ap->scsi_rescan_task);
+
ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, false);
}
static void ata_port_suspend_async(struct ata_port *ap, pm_message_t mesg)
{
+ /*
+ * We are about to suspend the port, so we do not care about
+ * scsi_rescan_device() calls scheduled by previous resume operations.
+ * The next resume will schedule the rescan again. So cancel any rescan
+ * that is not done yet.
+ */
+ cancel_delayed_work_sync(&ap->scsi_rescan_task);
+
ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, true);
}
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a0e58d22d222..6850cac803c1 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4756,7 +4756,7 @@ void ata_scsi_dev_rescan(struct work_struct *work)
struct ata_link *link;
struct ata_device *dev;
unsigned long flags;
- bool delay_rescan = false;
+ int ret = 0;
mutex_lock(&ap->scsi_scan_mutex);
spin_lock_irqsave(ap->lock, flags);
@@ -4765,37 +4765,34 @@ void ata_scsi_dev_rescan(struct work_struct *work)
ata_for_each_dev(dev, link, ENABLED) {
struct scsi_device *sdev = dev->sdev;
+ /*
+ * If the port was suspended before this was scheduled,
+ * bail out.
+ */
+ if (ap->pflags & ATA_PFLAG_SUSPENDED)
+ goto unlock;
+
if (!sdev)
continue;
if (scsi_device_get(sdev))
continue;
- /*
- * If the rescan work was scheduled because of a resume
- * event, the port is already fully resumed, but the
- * SCSI device may not yet be fully resumed. In such
- * case, executing scsi_rescan_device() may cause a
- * deadlock with the PM code on device_lock(). Prevent
- * this by giving up and retrying rescan after a short
- * delay.
- */
- delay_rescan = sdev->sdev_gendev.power.is_suspended;
- if (delay_rescan) {
- scsi_device_put(sdev);
- break;
- }
-
spin_unlock_irqrestore(ap->lock, flags);
- scsi_rescan_device(sdev);
+ ret = scsi_rescan_device(sdev);
scsi_device_put(sdev);
spin_lock_irqsave(ap->lock, flags);
+
+ if (ret)
+ goto unlock;
}
}
+unlock:
spin_unlock_irqrestore(ap->lock, flags);
mutex_unlock(&ap->scsi_scan_mutex);
- if (delay_rescan)
+ /* Reschedule with a delay if scsi_rescan_device() returned an error */
+ if (ret)
schedule_delayed_work(&ap->scsi_rescan_task,
msecs_to_jiffies(5));
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 99398d2070ab03d13f90b758ad397e19a65fffb0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100441-granddad-chewable-dad2@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 99398d2070ab03d13f90b758ad397e19a65fffb0 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Fri, 8 Sep 2023 17:03:15 +0900
Subject: [PATCH] scsi: sd: Do not issue commands to suspended disks on
shutdown
If an error occurs when resuming a host adapter before the devices
attached to the adapter are resumed, the adapter low level driver may
remove the scsi host, resulting in a call to sd_remove() for the
disks of the host. This in turn results in a call to sd_shutdown() which
will issue a synchronize cache command and a start stop unit command to
spindown the disk. sd_shutdown() issues the commands only if the device
is not already runtime suspended but does not check the power state for
system-wide suspend/resume. That is, the commands may be issued with the
device in a suspended state, which causes PM resume to hang, forcing a
reset of the machine to recover.
Fix this by tracking the suspended state of a disk by introducing the
suspended boolean field in the scsi_disk structure. This flag is set to
true when the disk is suspended is sd_suspend_common() and resumed with
sd_resume(). When suspended is true, sd_shutdown() is not executed from
sd_remove().
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5a1b802d180f..83b6a3f3863b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3741,7 +3741,8 @@ static int sd_remove(struct device *dev)
device_del(&sdkp->disk_dev);
del_gendisk(sdkp->disk);
- sd_shutdown(dev);
+ if (!sdkp->suspended)
+ sd_shutdown(dev);
put_disk(sdkp->disk);
return 0;
@@ -3872,6 +3873,9 @@ static int sd_suspend_common(struct device *dev, bool runtime)
ret = 0;
}
+ if (!ret)
+ sdkp->suspended = true;
+
return ret;
}
@@ -3891,21 +3895,26 @@ static int sd_suspend_runtime(struct device *dev)
static int sd_resume(struct device *dev, bool runtime)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
- int ret;
+ int ret = 0;
if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */
return 0;
- if (!sd_do_start_stop(sdkp->device, runtime))
+ if (!sd_do_start_stop(sdkp->device, runtime)) {
+ sdkp->suspended = false;
return 0;
+ }
if (!sdkp->device->no_start_on_resume) {
sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
ret = sd_start_stop_device(sdkp, 1);
}
- if (!ret)
+ if (!ret) {
opal_unlock_from_suspend(sdkp->opal_dev);
+ sdkp->suspended = false;
+ }
+
return ret;
}
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5eea762f84d1..409dda5350d1 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -131,6 +131,7 @@ struct scsi_disk {
u8 provisioning_mode;
u8 zeroing_mode;
u8 nr_actuators; /* Number of actuators */
+ bool suspended; /* Disk is suspended (stopped) */
unsigned ATO : 1; /* state of disk ATO bit */
unsigned cache_override : 1; /* temp override of WCE,RCD */
unsigned WCE : 1; /* state of disk WCE bit */
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 99398d2070ab03d13f90b758ad397e19a65fffb0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100439-blighted-enjoyment-32b2@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 99398d2070ab03d13f90b758ad397e19a65fffb0 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal(a)kernel.org>
Date: Fri, 8 Sep 2023 17:03:15 +0900
Subject: [PATCH] scsi: sd: Do not issue commands to suspended disks on
shutdown
If an error occurs when resuming a host adapter before the devices
attached to the adapter are resumed, the adapter low level driver may
remove the scsi host, resulting in a call to sd_remove() for the
disks of the host. This in turn results in a call to sd_shutdown() which
will issue a synchronize cache command and a start stop unit command to
spindown the disk. sd_shutdown() issues the commands only if the device
is not already runtime suspended but does not check the power state for
system-wide suspend/resume. That is, the commands may be issued with the
device in a suspended state, which causes PM resume to hang, forcing a
reset of the machine to recover.
Fix this by tracking the suspended state of a disk by introducing the
suspended boolean field in the scsi_disk structure. This flag is set to
true when the disk is suspended is sd_suspend_common() and resumed with
sd_resume(). When suspended is true, sd_shutdown() is not executed from
sd_remove().
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal(a)kernel.org>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5a1b802d180f..83b6a3f3863b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3741,7 +3741,8 @@ static int sd_remove(struct device *dev)
device_del(&sdkp->disk_dev);
del_gendisk(sdkp->disk);
- sd_shutdown(dev);
+ if (!sdkp->suspended)
+ sd_shutdown(dev);
put_disk(sdkp->disk);
return 0;
@@ -3872,6 +3873,9 @@ static int sd_suspend_common(struct device *dev, bool runtime)
ret = 0;
}
+ if (!ret)
+ sdkp->suspended = true;
+
return ret;
}
@@ -3891,21 +3895,26 @@ static int sd_suspend_runtime(struct device *dev)
static int sd_resume(struct device *dev, bool runtime)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
- int ret;
+ int ret = 0;
if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */
return 0;
- if (!sd_do_start_stop(sdkp->device, runtime))
+ if (!sd_do_start_stop(sdkp->device, runtime)) {
+ sdkp->suspended = false;
return 0;
+ }
if (!sdkp->device->no_start_on_resume) {
sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
ret = sd_start_stop_device(sdkp, 1);
}
- if (!ret)
+ if (!ret) {
opal_unlock_from_suspend(sdkp->opal_dev);
+ sdkp->suspended = false;
+ }
+
return ret;
}
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5eea762f84d1..409dda5350d1 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -131,6 +131,7 @@ struct scsi_disk {
u8 provisioning_mode;
u8 zeroing_mode;
u8 nr_actuators; /* Number of actuators */
+ bool suspended; /* Disk is suspended (stopped) */
unsigned ATO : 1; /* state of disk ATO bit */
unsigned cache_override : 1; /* temp override of WCE,RCD */
unsigned WCE : 1; /* state of disk WCE bit */
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 27e5ccc2d5a50ed61bb73153edb1066104b108b3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100415-word-epidural-cdee@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 27e5ccc2d5a50ed61bb73153edb1066104b108b3 Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:49 +0200
Subject: [PATCH] mptcp: fix dangling connection hang-up
According to RFC 8684 section 3.3:
A connection is not closed unless [...] or an implementation-specific
connection-level send timeout.
Currently the MPTCP protocol does not implement such timeout, and
connection timing-out at the TCP-level never move to close state.
Introduces a catch-up condition at subflow close time to move the
MPTCP socket to close, too.
That additionally allows removing similar existing inside the worker.
Finally, allow some additional timeout for plain ESTABLISHED mptcp
sockets, as the protocol allows creating new subflows even at that
point and making the connection functional again.
This issue is actually present since the beginning, but it is basically
impossible to solve without a long chain of functional pre-requisites
topped by commit bbd49d114d57 ("mptcp: consolidate transition to
TCP_CLOSE in mptcp_do_fastclose()"). When backporting this current
patch, please also backport this other commit as well.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/430
Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c8f38f303a90..e252539b1e19 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -892,6 +892,7 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk)
mptcp_subflow_ctx(ssk)->subflow_id = msk->subflow_id++;
mptcp_sockopt_sync_locked(msk, ssk);
mptcp_subflow_joined(msk, ssk);
+ mptcp_stop_tout_timer(sk);
return true;
}
@@ -2369,18 +2370,14 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
bool dispose_it, need_push = false;
/* If the first subflow moved to a close state before accept, e.g. due
- * to an incoming reset, mptcp either:
- * - if either the subflow or the msk are dead, destroy the context
- * (the subflow socket is deleted by inet_child_forget) and the msk
- * - otherwise do nothing at the moment and take action at accept and/or
- * listener shutdown - user-space must be able to accept() the closed
- * socket.
+ * to an incoming reset or listener shutdown, the subflow socket is
+ * already deleted by inet_child_forget() and the mptcp socket can't
+ * survive too.
*/
- if (msk->in_accept_queue && msk->first == ssk) {
- if (!sock_flag(sk, SOCK_DEAD) && !sock_flag(ssk, SOCK_DEAD))
- return;
-
+ if (msk->in_accept_queue && msk->first == ssk &&
+ (sock_flag(sk, SOCK_DEAD) || sock_flag(ssk, SOCK_DEAD))) {
/* ensure later check in mptcp_worker() will dispose the msk */
+ mptcp_set_close_tout(sk, tcp_jiffies32 - (TCP_TIMEWAIT_LEN + 1));
sock_set_flag(sk, SOCK_DEAD);
lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
mptcp_subflow_drop_ctx(ssk);
@@ -2443,6 +2440,22 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
out:
if (need_push)
__mptcp_push_pending(sk, 0);
+
+ /* Catch every 'all subflows closed' scenario, including peers silently
+ * closing them, e.g. due to timeout.
+ * For established sockets, allow an additional timeout before closing,
+ * as the protocol can still create more subflows.
+ */
+ if (list_is_singular(&msk->conn_list) && msk->first &&
+ inet_sk_state_load(msk->first) == TCP_CLOSE) {
+ if (sk->sk_state != TCP_ESTABLISHED ||
+ msk->in_accept_queue || sock_flag(sk, SOCK_DEAD)) {
+ inet_sk_state_store(sk, TCP_CLOSE);
+ mptcp_close_wake_up(sk);
+ } else {
+ mptcp_start_tout_timer(sk);
+ }
+ }
}
void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
@@ -2486,23 +2499,14 @@ static void __mptcp_close_subflow(struct sock *sk)
}
-static bool mptcp_should_close(const struct sock *sk)
+static bool mptcp_close_tout_expired(const struct sock *sk)
{
- s32 delta = tcp_jiffies32 - inet_csk(sk)->icsk_mtup.probe_timestamp;
- struct mptcp_subflow_context *subflow;
+ if (!inet_csk(sk)->icsk_mtup.probe_timestamp ||
+ sk->sk_state == TCP_CLOSE)
+ return false;
- if (delta >= TCP_TIMEWAIT_LEN || mptcp_sk(sk)->in_accept_queue)
- return true;
-
- /* if all subflows are in closed status don't bother with additional
- * timeout
- */
- mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
- if (inet_sk_state_load(mptcp_subflow_tcp_sock(subflow)) !=
- TCP_CLOSE)
- return false;
- }
- return true;
+ return time_after32(tcp_jiffies32,
+ inet_csk(sk)->icsk_mtup.probe_timestamp + TCP_TIMEWAIT_LEN);
}
static void mptcp_check_fastclose(struct mptcp_sock *msk)
@@ -2641,15 +2645,16 @@ void mptcp_reset_tout_timer(struct mptcp_sock *msk, unsigned long fail_tout)
struct sock *sk = (struct sock *)msk;
unsigned long timeout, close_timeout;
- if (!fail_tout && !sock_flag(sk, SOCK_DEAD))
+ if (!fail_tout && !inet_csk(sk)->icsk_mtup.probe_timestamp)
return;
- close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies + TCP_TIMEWAIT_LEN;
+ close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies +
+ TCP_TIMEWAIT_LEN;
/* the close timeout takes precedence on the fail one, and here at least one of
* them is active
*/
- timeout = sock_flag(sk, SOCK_DEAD) ? close_timeout : fail_tout;
+ timeout = inet_csk(sk)->icsk_mtup.probe_timestamp ? close_timeout : fail_tout;
sk_reset_timer(sk, &sk->sk_timer, timeout);
}
@@ -2668,8 +2673,6 @@ static void mptcp_mp_fail_no_response(struct mptcp_sock *msk)
mptcp_subflow_reset(ssk);
WRITE_ONCE(mptcp_subflow_ctx(ssk)->fail_tout, 0);
unlock_sock_fast(ssk, slow);
-
- mptcp_reset_tout_timer(msk, 0);
}
static void mptcp_do_fastclose(struct sock *sk)
@@ -2706,18 +2709,14 @@ static void mptcp_worker(struct work_struct *work)
if (test_and_clear_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
__mptcp_close_subflow(sk);
- /* There is no point in keeping around an orphaned sk timedout or
- * closed, but we need the msk around to reply to incoming DATA_FIN,
- * even if it is orphaned and in FIN_WAIT2 state
- */
- if (sock_flag(sk, SOCK_DEAD)) {
- if (mptcp_should_close(sk))
- mptcp_do_fastclose(sk);
+ if (mptcp_close_tout_expired(sk)) {
+ mptcp_do_fastclose(sk);
+ mptcp_close_wake_up(sk);
+ }
- if (sk->sk_state == TCP_CLOSE) {
- __mptcp_destroy_sock(sk);
- goto unlock;
- }
+ if (sock_flag(sk, SOCK_DEAD) && sk->sk_state == TCP_CLOSE) {
+ __mptcp_destroy_sock(sk);
+ goto unlock;
}
if (test_and_clear_bit(MPTCP_WORK_RTX, &msk->flags))
@@ -3016,7 +3015,6 @@ bool __mptcp_close(struct sock *sk, long timeout)
cleanup:
/* orphan all the subflows */
- inet_csk(sk)->icsk_mtup.probe_timestamp = tcp_jiffies32;
mptcp_for_each_subflow(msk, subflow) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
bool slow = lock_sock_fast_nested(ssk);
@@ -3053,7 +3051,7 @@ bool __mptcp_close(struct sock *sk, long timeout)
__mptcp_destroy_sock(sk);
do_cancel_work = true;
} else {
- mptcp_reset_tout_timer(msk, 0);
+ mptcp_start_tout_timer(sk);
}
return do_cancel_work;
@@ -3117,7 +3115,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
inet_sk_state_store(sk, TCP_CLOSE);
mptcp_stop_rtx_timer(sk);
- sk_stop_timer(sk, &sk->sk_timer);
+ mptcp_stop_tout_timer(sk);
if (msk->token)
mptcp_event(MPTCP_EVENT_CLOSED, msk, NULL, GFP_KERNEL);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 5e2026815c8e..ed61d6850cce 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -719,6 +719,28 @@ void mptcp_get_options(const struct sk_buff *skb,
void mptcp_finish_connect(struct sock *sk);
void __mptcp_set_connected(struct sock *sk);
void mptcp_reset_tout_timer(struct mptcp_sock *msk, unsigned long fail_tout);
+
+static inline void mptcp_stop_tout_timer(struct sock *sk)
+{
+ if (!inet_csk(sk)->icsk_mtup.probe_timestamp)
+ return;
+
+ sk_stop_timer(sk, &sk->sk_timer);
+ inet_csk(sk)->icsk_mtup.probe_timestamp = 0;
+}
+
+static inline void mptcp_set_close_tout(struct sock *sk, unsigned long tout)
+{
+ /* avoid 0 timestamp, as that means no close timeout */
+ inet_csk(sk)->icsk_mtup.probe_timestamp = tout ? : 1;
+}
+
+static inline void mptcp_start_tout_timer(struct sock *sk)
+{
+ mptcp_set_close_tout(sk, tcp_jiffies32);
+ mptcp_reset_tout_timer(mptcp_sk(sk), 0);
+}
+
static inline bool mptcp_is_fully_established(struct sock *sk)
{
return inet_sk_state_load(sk) == TCP_ESTABLISHED &&
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 433f290984c8..918c1a235790 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1552,6 +1552,7 @@ int __mptcp_subflow_connect(struct sock *sk, const struct mptcp_addr_info *loc,
mptcp_sock_graft(ssk, sk->sk_socket);
iput(SOCK_INODE(sf));
WRITE_ONCE(msk->allow_infinite_fallback, false);
+ mptcp_stop_tout_timer(sk);
return 0;
failed_unlink:
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 27e5ccc2d5a50ed61bb73153edb1066104b108b3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100414-reverse-yippee-3d2e@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 27e5ccc2d5a50ed61bb73153edb1066104b108b3 Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:49 +0200
Subject: [PATCH] mptcp: fix dangling connection hang-up
According to RFC 8684 section 3.3:
A connection is not closed unless [...] or an implementation-specific
connection-level send timeout.
Currently the MPTCP protocol does not implement such timeout, and
connection timing-out at the TCP-level never move to close state.
Introduces a catch-up condition at subflow close time to move the
MPTCP socket to close, too.
That additionally allows removing similar existing inside the worker.
Finally, allow some additional timeout for plain ESTABLISHED mptcp
sockets, as the protocol allows creating new subflows even at that
point and making the connection functional again.
This issue is actually present since the beginning, but it is basically
impossible to solve without a long chain of functional pre-requisites
topped by commit bbd49d114d57 ("mptcp: consolidate transition to
TCP_CLOSE in mptcp_do_fastclose()"). When backporting this current
patch, please also backport this other commit as well.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/430
Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c8f38f303a90..e252539b1e19 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -892,6 +892,7 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk)
mptcp_subflow_ctx(ssk)->subflow_id = msk->subflow_id++;
mptcp_sockopt_sync_locked(msk, ssk);
mptcp_subflow_joined(msk, ssk);
+ mptcp_stop_tout_timer(sk);
return true;
}
@@ -2369,18 +2370,14 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
bool dispose_it, need_push = false;
/* If the first subflow moved to a close state before accept, e.g. due
- * to an incoming reset, mptcp either:
- * - if either the subflow or the msk are dead, destroy the context
- * (the subflow socket is deleted by inet_child_forget) and the msk
- * - otherwise do nothing at the moment and take action at accept and/or
- * listener shutdown - user-space must be able to accept() the closed
- * socket.
+ * to an incoming reset or listener shutdown, the subflow socket is
+ * already deleted by inet_child_forget() and the mptcp socket can't
+ * survive too.
*/
- if (msk->in_accept_queue && msk->first == ssk) {
- if (!sock_flag(sk, SOCK_DEAD) && !sock_flag(ssk, SOCK_DEAD))
- return;
-
+ if (msk->in_accept_queue && msk->first == ssk &&
+ (sock_flag(sk, SOCK_DEAD) || sock_flag(ssk, SOCK_DEAD))) {
/* ensure later check in mptcp_worker() will dispose the msk */
+ mptcp_set_close_tout(sk, tcp_jiffies32 - (TCP_TIMEWAIT_LEN + 1));
sock_set_flag(sk, SOCK_DEAD);
lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
mptcp_subflow_drop_ctx(ssk);
@@ -2443,6 +2440,22 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
out:
if (need_push)
__mptcp_push_pending(sk, 0);
+
+ /* Catch every 'all subflows closed' scenario, including peers silently
+ * closing them, e.g. due to timeout.
+ * For established sockets, allow an additional timeout before closing,
+ * as the protocol can still create more subflows.
+ */
+ if (list_is_singular(&msk->conn_list) && msk->first &&
+ inet_sk_state_load(msk->first) == TCP_CLOSE) {
+ if (sk->sk_state != TCP_ESTABLISHED ||
+ msk->in_accept_queue || sock_flag(sk, SOCK_DEAD)) {
+ inet_sk_state_store(sk, TCP_CLOSE);
+ mptcp_close_wake_up(sk);
+ } else {
+ mptcp_start_tout_timer(sk);
+ }
+ }
}
void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
@@ -2486,23 +2499,14 @@ static void __mptcp_close_subflow(struct sock *sk)
}
-static bool mptcp_should_close(const struct sock *sk)
+static bool mptcp_close_tout_expired(const struct sock *sk)
{
- s32 delta = tcp_jiffies32 - inet_csk(sk)->icsk_mtup.probe_timestamp;
- struct mptcp_subflow_context *subflow;
+ if (!inet_csk(sk)->icsk_mtup.probe_timestamp ||
+ sk->sk_state == TCP_CLOSE)
+ return false;
- if (delta >= TCP_TIMEWAIT_LEN || mptcp_sk(sk)->in_accept_queue)
- return true;
-
- /* if all subflows are in closed status don't bother with additional
- * timeout
- */
- mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
- if (inet_sk_state_load(mptcp_subflow_tcp_sock(subflow)) !=
- TCP_CLOSE)
- return false;
- }
- return true;
+ return time_after32(tcp_jiffies32,
+ inet_csk(sk)->icsk_mtup.probe_timestamp + TCP_TIMEWAIT_LEN);
}
static void mptcp_check_fastclose(struct mptcp_sock *msk)
@@ -2641,15 +2645,16 @@ void mptcp_reset_tout_timer(struct mptcp_sock *msk, unsigned long fail_tout)
struct sock *sk = (struct sock *)msk;
unsigned long timeout, close_timeout;
- if (!fail_tout && !sock_flag(sk, SOCK_DEAD))
+ if (!fail_tout && !inet_csk(sk)->icsk_mtup.probe_timestamp)
return;
- close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies + TCP_TIMEWAIT_LEN;
+ close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies +
+ TCP_TIMEWAIT_LEN;
/* the close timeout takes precedence on the fail one, and here at least one of
* them is active
*/
- timeout = sock_flag(sk, SOCK_DEAD) ? close_timeout : fail_tout;
+ timeout = inet_csk(sk)->icsk_mtup.probe_timestamp ? close_timeout : fail_tout;
sk_reset_timer(sk, &sk->sk_timer, timeout);
}
@@ -2668,8 +2673,6 @@ static void mptcp_mp_fail_no_response(struct mptcp_sock *msk)
mptcp_subflow_reset(ssk);
WRITE_ONCE(mptcp_subflow_ctx(ssk)->fail_tout, 0);
unlock_sock_fast(ssk, slow);
-
- mptcp_reset_tout_timer(msk, 0);
}
static void mptcp_do_fastclose(struct sock *sk)
@@ -2706,18 +2709,14 @@ static void mptcp_worker(struct work_struct *work)
if (test_and_clear_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
__mptcp_close_subflow(sk);
- /* There is no point in keeping around an orphaned sk timedout or
- * closed, but we need the msk around to reply to incoming DATA_FIN,
- * even if it is orphaned and in FIN_WAIT2 state
- */
- if (sock_flag(sk, SOCK_DEAD)) {
- if (mptcp_should_close(sk))
- mptcp_do_fastclose(sk);
+ if (mptcp_close_tout_expired(sk)) {
+ mptcp_do_fastclose(sk);
+ mptcp_close_wake_up(sk);
+ }
- if (sk->sk_state == TCP_CLOSE) {
- __mptcp_destroy_sock(sk);
- goto unlock;
- }
+ if (sock_flag(sk, SOCK_DEAD) && sk->sk_state == TCP_CLOSE) {
+ __mptcp_destroy_sock(sk);
+ goto unlock;
}
if (test_and_clear_bit(MPTCP_WORK_RTX, &msk->flags))
@@ -3016,7 +3015,6 @@ bool __mptcp_close(struct sock *sk, long timeout)
cleanup:
/* orphan all the subflows */
- inet_csk(sk)->icsk_mtup.probe_timestamp = tcp_jiffies32;
mptcp_for_each_subflow(msk, subflow) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
bool slow = lock_sock_fast_nested(ssk);
@@ -3053,7 +3051,7 @@ bool __mptcp_close(struct sock *sk, long timeout)
__mptcp_destroy_sock(sk);
do_cancel_work = true;
} else {
- mptcp_reset_tout_timer(msk, 0);
+ mptcp_start_tout_timer(sk);
}
return do_cancel_work;
@@ -3117,7 +3115,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
inet_sk_state_store(sk, TCP_CLOSE);
mptcp_stop_rtx_timer(sk);
- sk_stop_timer(sk, &sk->sk_timer);
+ mptcp_stop_tout_timer(sk);
if (msk->token)
mptcp_event(MPTCP_EVENT_CLOSED, msk, NULL, GFP_KERNEL);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 5e2026815c8e..ed61d6850cce 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -719,6 +719,28 @@ void mptcp_get_options(const struct sk_buff *skb,
void mptcp_finish_connect(struct sock *sk);
void __mptcp_set_connected(struct sock *sk);
void mptcp_reset_tout_timer(struct mptcp_sock *msk, unsigned long fail_tout);
+
+static inline void mptcp_stop_tout_timer(struct sock *sk)
+{
+ if (!inet_csk(sk)->icsk_mtup.probe_timestamp)
+ return;
+
+ sk_stop_timer(sk, &sk->sk_timer);
+ inet_csk(sk)->icsk_mtup.probe_timestamp = 0;
+}
+
+static inline void mptcp_set_close_tout(struct sock *sk, unsigned long tout)
+{
+ /* avoid 0 timestamp, as that means no close timeout */
+ inet_csk(sk)->icsk_mtup.probe_timestamp = tout ? : 1;
+}
+
+static inline void mptcp_start_tout_timer(struct sock *sk)
+{
+ mptcp_set_close_tout(sk, tcp_jiffies32);
+ mptcp_reset_tout_timer(mptcp_sk(sk), 0);
+}
+
static inline bool mptcp_is_fully_established(struct sock *sk)
{
return inet_sk_state_load(sk) == TCP_ESTABLISHED &&
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 433f290984c8..918c1a235790 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1552,6 +1552,7 @@ int __mptcp_subflow_connect(struct sock *sk, const struct mptcp_addr_info *loc,
mptcp_sock_graft(ssk, sk->sk_socket);
iput(SOCK_INODE(sf));
WRITE_ONCE(msk->allow_infinite_fallback, false);
+ mptcp_stop_tout_timer(sk);
return 0;
failed_unlink:
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 9f1a98813b4b686482e5ef3c9d998581cace0ba6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100446-wasp-granny-2378@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9f1a98813b4b686482e5ef3c9d998581cace0ba6 Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:47 +0200
Subject: [PATCH] mptcp: process pending subflow error on close
On incoming TCP reset, subflow closing could happen before error
propagation. That in turn could cause the socket error being ignored,
and a missing socket state transition, as reported by Daire-Byrne.
Address the issues explicitly checking for subflow socket error at
close time. To avoid code duplication, factor-out of __mptcp_error_report()
a new helper implementing the relevant bits.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/429
Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 915860027b1a..1c96b8da71df 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -770,40 +770,44 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
return moved;
}
+static bool __mptcp_subflow_error_report(struct sock *sk, struct sock *ssk)
+{
+ int err = sock_error(ssk);
+ int ssk_state;
+
+ if (!err)
+ return false;
+
+ /* only propagate errors on fallen-back sockets or
+ * on MPC connect
+ */
+ if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(mptcp_sk(sk)))
+ return false;
+
+ /* We need to propagate only transition to CLOSE state.
+ * Orphaned socket will see such state change via
+ * subflow_sched_work_if_closed() and that path will properly
+ * destroy the msk as needed.
+ */
+ ssk_state = inet_sk_state_load(ssk);
+ if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
+ inet_sk_state_store(sk, ssk_state);
+ WRITE_ONCE(sk->sk_err, -err);
+
+ /* This barrier is coupled with smp_rmb() in mptcp_poll() */
+ smp_wmb();
+ sk_error_report(sk);
+ return true;
+}
+
void __mptcp_error_report(struct sock *sk)
{
struct mptcp_subflow_context *subflow;
struct mptcp_sock *msk = mptcp_sk(sk);
- mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- int err = sock_error(ssk);
- int ssk_state;
-
- if (!err)
- continue;
-
- /* only propagate errors on fallen-back sockets or
- * on MPC connect
- */
- if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
- continue;
-
- /* We need to propagate only transition to CLOSE state.
- * Orphaned socket will see such state change via
- * subflow_sched_work_if_closed() and that path will properly
- * destroy the msk as needed.
- */
- ssk_state = inet_sk_state_load(ssk);
- if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
- inet_sk_state_store(sk, ssk_state);
- WRITE_ONCE(sk->sk_err, -err);
-
- /* This barrier is coupled with smp_rmb() in mptcp_poll() */
- smp_wmb();
- sk_error_report(sk);
- break;
- }
+ mptcp_for_each_subflow(msk, subflow)
+ if (__mptcp_subflow_error_report(sk, mptcp_subflow_tcp_sock(subflow)))
+ break;
}
/* In most cases we will be able to lock the mptcp socket. If its already
@@ -2428,6 +2432,7 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
}
out_release:
+ __mptcp_subflow_error_report(sk, ssk);
release_sock(ssk);
sock_put(ssk);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x d5fbeff1ab812b6c473b6924bee8748469462e2c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100423-modify-dreadlock-ee1f@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d5fbeff1ab812b6c473b6924bee8748469462e2c Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:46 +0200
Subject: [PATCH] mptcp: move __mptcp_error_report in protocol.c
This will simplify the next patch ("mptcp: process pending subflow error
on close").
No functional change intended.
Cc: stable(a)vger.kernel.org # v5.12+
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a7fc16f5175d..915860027b1a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -770,6 +770,42 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
return moved;
}
+void __mptcp_error_report(struct sock *sk)
+{
+ struct mptcp_subflow_context *subflow;
+ struct mptcp_sock *msk = mptcp_sk(sk);
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ int err = sock_error(ssk);
+ int ssk_state;
+
+ if (!err)
+ continue;
+
+ /* only propagate errors on fallen-back sockets or
+ * on MPC connect
+ */
+ if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
+ continue;
+
+ /* We need to propagate only transition to CLOSE state.
+ * Orphaned socket will see such state change via
+ * subflow_sched_work_if_closed() and that path will properly
+ * destroy the msk as needed.
+ */
+ ssk_state = inet_sk_state_load(ssk);
+ if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
+ inet_sk_state_store(sk, ssk_state);
+ WRITE_ONCE(sk->sk_err, -err);
+
+ /* This barrier is coupled with smp_rmb() in mptcp_poll() */
+ smp_wmb();
+ sk_error_report(sk);
+ break;
+ }
+}
+
/* In most cases we will be able to lock the mptcp socket. If its already
* owned, we need to defer to the work queue to avoid ABBA deadlock.
*/
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9bf3c7bc1762..2f40c23fdb0d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1362,42 +1362,6 @@ void mptcp_space(const struct sock *ssk, int *space, int *full_space)
*full_space = mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf));
}
-void __mptcp_error_report(struct sock *sk)
-{
- struct mptcp_subflow_context *subflow;
- struct mptcp_sock *msk = mptcp_sk(sk);
-
- mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- int err = sock_error(ssk);
- int ssk_state;
-
- if (!err)
- continue;
-
- /* only propagate errors on fallen-back sockets or
- * on MPC connect
- */
- if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
- continue;
-
- /* We need to propagate only transition to CLOSE state.
- * Orphaned socket will see such state change via
- * subflow_sched_work_if_closed() and that path will properly
- * destroy the msk as needed.
- */
- ssk_state = inet_sk_state_load(ssk);
- if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
- inet_sk_state_store(sk, ssk_state);
- WRITE_ONCE(sk->sk_err, -err);
-
- /* This barrier is coupled with smp_rmb() in mptcp_poll() */
- smp_wmb();
- sk_error_report(sk);
- break;
- }
-}
-
static void subflow_error_report(struct sock *ssk)
{
struct sock *sk = mptcp_subflow_ctx(ssk)->conn;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x d5fbeff1ab812b6c473b6924bee8748469462e2c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100421-earthly-entail-0945@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d5fbeff1ab812b6c473b6924bee8748469462e2c Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:46 +0200
Subject: [PATCH] mptcp: move __mptcp_error_report in protocol.c
This will simplify the next patch ("mptcp: process pending subflow error
on close").
No functional change intended.
Cc: stable(a)vger.kernel.org # v5.12+
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a7fc16f5175d..915860027b1a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -770,6 +770,42 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
return moved;
}
+void __mptcp_error_report(struct sock *sk)
+{
+ struct mptcp_subflow_context *subflow;
+ struct mptcp_sock *msk = mptcp_sk(sk);
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ int err = sock_error(ssk);
+ int ssk_state;
+
+ if (!err)
+ continue;
+
+ /* only propagate errors on fallen-back sockets or
+ * on MPC connect
+ */
+ if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
+ continue;
+
+ /* We need to propagate only transition to CLOSE state.
+ * Orphaned socket will see such state change via
+ * subflow_sched_work_if_closed() and that path will properly
+ * destroy the msk as needed.
+ */
+ ssk_state = inet_sk_state_load(ssk);
+ if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
+ inet_sk_state_store(sk, ssk_state);
+ WRITE_ONCE(sk->sk_err, -err);
+
+ /* This barrier is coupled with smp_rmb() in mptcp_poll() */
+ smp_wmb();
+ sk_error_report(sk);
+ break;
+ }
+}
+
/* In most cases we will be able to lock the mptcp socket. If its already
* owned, we need to defer to the work queue to avoid ABBA deadlock.
*/
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9bf3c7bc1762..2f40c23fdb0d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1362,42 +1362,6 @@ void mptcp_space(const struct sock *ssk, int *space, int *full_space)
*full_space = mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf));
}
-void __mptcp_error_report(struct sock *sk)
-{
- struct mptcp_subflow_context *subflow;
- struct mptcp_sock *msk = mptcp_sk(sk);
-
- mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- int err = sock_error(ssk);
- int ssk_state;
-
- if (!err)
- continue;
-
- /* only propagate errors on fallen-back sockets or
- * on MPC connect
- */
- if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
- continue;
-
- /* We need to propagate only transition to CLOSE state.
- * Orphaned socket will see such state change via
- * subflow_sched_work_if_closed() and that path will properly
- * destroy the msk as needed.
- */
- ssk_state = inet_sk_state_load(ssk);
- if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
- inet_sk_state_store(sk, ssk_state);
- WRITE_ONCE(sk->sk_err, -err);
-
- /* This barrier is coupled with smp_rmb() in mptcp_poll() */
- smp_wmb();
- sk_error_report(sk);
- break;
- }
-}
-
static void subflow_error_report(struct sock *ssk)
{
struct sock *sk = mptcp_subflow_ctx(ssk)->conn;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x d5fbeff1ab812b6c473b6924bee8748469462e2c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100422-unnerve-zodiac-03e0@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d5fbeff1ab812b6c473b6924bee8748469462e2c Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:46 +0200
Subject: [PATCH] mptcp: move __mptcp_error_report in protocol.c
This will simplify the next patch ("mptcp: process pending subflow error
on close").
No functional change intended.
Cc: stable(a)vger.kernel.org # v5.12+
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a7fc16f5175d..915860027b1a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -770,6 +770,42 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
return moved;
}
+void __mptcp_error_report(struct sock *sk)
+{
+ struct mptcp_subflow_context *subflow;
+ struct mptcp_sock *msk = mptcp_sk(sk);
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ int err = sock_error(ssk);
+ int ssk_state;
+
+ if (!err)
+ continue;
+
+ /* only propagate errors on fallen-back sockets or
+ * on MPC connect
+ */
+ if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
+ continue;
+
+ /* We need to propagate only transition to CLOSE state.
+ * Orphaned socket will see such state change via
+ * subflow_sched_work_if_closed() and that path will properly
+ * destroy the msk as needed.
+ */
+ ssk_state = inet_sk_state_load(ssk);
+ if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
+ inet_sk_state_store(sk, ssk_state);
+ WRITE_ONCE(sk->sk_err, -err);
+
+ /* This barrier is coupled with smp_rmb() in mptcp_poll() */
+ smp_wmb();
+ sk_error_report(sk);
+ break;
+ }
+}
+
/* In most cases we will be able to lock the mptcp socket. If its already
* owned, we need to defer to the work queue to avoid ABBA deadlock.
*/
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9bf3c7bc1762..2f40c23fdb0d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1362,42 +1362,6 @@ void mptcp_space(const struct sock *ssk, int *space, int *full_space)
*full_space = mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf));
}
-void __mptcp_error_report(struct sock *sk)
-{
- struct mptcp_subflow_context *subflow;
- struct mptcp_sock *msk = mptcp_sk(sk);
-
- mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- int err = sock_error(ssk);
- int ssk_state;
-
- if (!err)
- continue;
-
- /* only propagate errors on fallen-back sockets or
- * on MPC connect
- */
- if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
- continue;
-
- /* We need to propagate only transition to CLOSE state.
- * Orphaned socket will see such state change via
- * subflow_sched_work_if_closed() and that path will properly
- * destroy the msk as needed.
- */
- ssk_state = inet_sk_state_load(ssk);
- if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
- inet_sk_state_store(sk, ssk_state);
- WRITE_ONCE(sk->sk_err, -err);
-
- /* This barrier is coupled with smp_rmb() in mptcp_poll() */
- smp_wmb();
- sk_error_report(sk);
- break;
- }
-}
-
static void subflow_error_report(struct sock *ssk)
{
struct sock *sk = mptcp_subflow_ctx(ssk)->conn;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x d5fbeff1ab812b6c473b6924bee8748469462e2c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100420-spirits-clumsy-63d9@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d5fbeff1ab812b6c473b6924bee8748469462e2c Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Sat, 16 Sep 2023 12:52:46 +0200
Subject: [PATCH] mptcp: move __mptcp_error_report in protocol.c
This will simplify the next patch ("mptcp: process pending subflow error
on close").
No functional change intended.
Cc: stable(a)vger.kernel.org # v5.12+
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a7fc16f5175d..915860027b1a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -770,6 +770,42 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
return moved;
}
+void __mptcp_error_report(struct sock *sk)
+{
+ struct mptcp_subflow_context *subflow;
+ struct mptcp_sock *msk = mptcp_sk(sk);
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ int err = sock_error(ssk);
+ int ssk_state;
+
+ if (!err)
+ continue;
+
+ /* only propagate errors on fallen-back sockets or
+ * on MPC connect
+ */
+ if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
+ continue;
+
+ /* We need to propagate only transition to CLOSE state.
+ * Orphaned socket will see such state change via
+ * subflow_sched_work_if_closed() and that path will properly
+ * destroy the msk as needed.
+ */
+ ssk_state = inet_sk_state_load(ssk);
+ if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
+ inet_sk_state_store(sk, ssk_state);
+ WRITE_ONCE(sk->sk_err, -err);
+
+ /* This barrier is coupled with smp_rmb() in mptcp_poll() */
+ smp_wmb();
+ sk_error_report(sk);
+ break;
+ }
+}
+
/* In most cases we will be able to lock the mptcp socket. If its already
* owned, we need to defer to the work queue to avoid ABBA deadlock.
*/
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9bf3c7bc1762..2f40c23fdb0d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1362,42 +1362,6 @@ void mptcp_space(const struct sock *ssk, int *space, int *full_space)
*full_space = mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf));
}
-void __mptcp_error_report(struct sock *sk)
-{
- struct mptcp_subflow_context *subflow;
- struct mptcp_sock *msk = mptcp_sk(sk);
-
- mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- int err = sock_error(ssk);
- int ssk_state;
-
- if (!err)
- continue;
-
- /* only propagate errors on fallen-back sockets or
- * on MPC connect
- */
- if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(msk))
- continue;
-
- /* We need to propagate only transition to CLOSE state.
- * Orphaned socket will see such state change via
- * subflow_sched_work_if_closed() and that path will properly
- * destroy the msk as needed.
- */
- ssk_state = inet_sk_state_load(ssk);
- if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
- inet_sk_state_store(sk, ssk_state);
- WRITE_ONCE(sk->sk_err, -err);
-
- /* This barrier is coupled with smp_rmb() in mptcp_poll() */
- smp_wmb();
- sk_error_report(sk);
- break;
- }
-}
-
static void subflow_error_report(struct sock *ssk)
{
struct sock *sk = mptcp_subflow_ctx(ssk)->conn;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 0df9dab891ff0d9b646d82e4fe038229e4c02451
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100425-avoid-alkaline-3fd9@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0df9dab891ff0d9b646d82e4fe038229e4c02451 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Fri, 15 Sep 2023 17:39:15 -0700
Subject: [PATCH] KVM: x86/mmu: Stop zapping invalidated TDP MMU roots
asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated. Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).
While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot. Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost. Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.
When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner. In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.
Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong. Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.
Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events). Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.
Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.
Reported-by: Pattara Teerapong <pteerapong(a)google.com>
Cc: David Stevens <stevensd(a)google.com>
Cc: Yiwei Zhang<zzyiwei(a)google.com>
Cc: Paul Hsia <paulhsia(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20230916003916.2545000-4-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..17715cb8731d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1419,7 +1419,6 @@ struct kvm_arch {
* the thread holds the MMU lock in write mode.
*/
spinlock_t tdp_mmu_pages_lock;
- struct workqueue_struct *tdp_mmu_zap_wq;
#endif /* CONFIG_X86_64 */
/*
@@ -1835,7 +1834,7 @@ void kvm_mmu_vendor_module_exit(void);
void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
int kvm_mmu_create(struct kvm_vcpu *vcpu);
-int kvm_mmu_init_vm(struct kvm *kvm);
+void kvm_mmu_init_vm(struct kvm *kvm);
void kvm_mmu_uninit_vm(struct kvm *kvm);
void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 54f94f644b42..f7901cb4d2fa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6167,20 +6167,15 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
}
-int kvm_mmu_init_vm(struct kvm *kvm)
+void kvm_mmu_init_vm(struct kvm *kvm)
{
- int r;
-
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
- if (tdp_mmu_enabled) {
- r = kvm_mmu_init_tdp_mmu(kvm);
- if (r < 0)
- return r;
- }
+ if (tdp_mmu_enabled)
+ kvm_mmu_init_tdp_mmu(kvm);
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
@@ -6189,8 +6184,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
-
- return 0;
}
static void mmu_free_vm_memory_caches(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b102014e2c60..decc1f153669 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -58,7 +58,12 @@ struct kvm_mmu_page {
bool tdp_mmu_page;
bool unsync;
- u8 mmu_valid_gen;
+ union {
+ u8 mmu_valid_gen;
+
+ /* Only accessed under slots_lock. */
+ bool tdp_mmu_scheduled_root_to_zap;
+ };
/*
* The shadow page can't be replaced by an equivalent huge page
@@ -100,13 +105,7 @@ struct kvm_mmu_page {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
};
- union {
- DECLARE_BITMAP(unsync_child_bitmap, 512);
- struct {
- struct work_struct tdp_mmu_async_work;
- void *tdp_mmu_async_data;
- };
- };
+ DECLARE_BITMAP(unsync_child_bitmap, 512);
/*
* Tracks shadow pages that, if zapped, would allow KVM to create an NX
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aa90901d2871..6cd4dd631a2f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -12,18 +12,10 @@
#include <trace/events/kvm.h>
/* Initializes the TDP MMU for the VM, if enabled. */
-int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
+void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
{
- struct workqueue_struct *wq;
-
- wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
- if (!wq)
- return -ENOMEM;
-
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
- kvm->arch.tdp_mmu_zap_wq = wq;
- return 1;
}
/* Arbitrarily returns true so that this may be used in if statements. */
@@ -46,20 +38,15 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
* ultimately frees all roots.
*/
kvm_tdp_mmu_invalidate_all_roots(kvm);
-
- /*
- * Destroying a workqueue also first flushes the workqueue, i.e. no
- * need to invoke kvm_tdp_mmu_zap_invalidated_roots().
- */
- destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
+ kvm_tdp_mmu_zap_invalidated_roots(kvm);
WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
/*
* Ensure that all the outstanding RCU callbacks to free shadow pages
- * can run before the VM is torn down. Work items on tdp_mmu_zap_wq
- * can call kvm_tdp_mmu_put_root and create new callbacks.
+ * can run before the VM is torn down. Putting the last reference to
+ * zapped roots will create new callbacks.
*/
rcu_barrier();
}
@@ -86,46 +73,6 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
tdp_mmu_free_sp(sp);
}
-static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
- bool shared);
-
-static void tdp_mmu_zap_root_work(struct work_struct *work)
-{
- struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
- tdp_mmu_async_work);
- struct kvm *kvm = root->tdp_mmu_async_data;
-
- read_lock(&kvm->mmu_lock);
-
- /*
- * A TLB flush is not necessary as KVM performs a local TLB flush when
- * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
- * to a different pCPU. Note, the local TLB flush on reuse also
- * invalidates any paging-structure-cache entries, i.e. TLB entries for
- * intermediate paging structures, that may be zapped, as such entries
- * are associated with the ASID on both VMX and SVM.
- */
- tdp_mmu_zap_root(kvm, root, true);
-
- /*
- * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
- * avoiding an infinite loop. By design, the root is reachable while
- * it's being asynchronously zapped, thus a different task can put its
- * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
- * asynchronously zapped root is unavoidable.
- */
- kvm_tdp_mmu_put_root(kvm, root, true);
-
- read_unlock(&kvm->mmu_lock);
-}
-
-static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
-{
- root->tdp_mmu_async_data = kvm;
- INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
- queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
-}
-
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -211,11 +158,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \
- for (_root = tdp_mmu_next_root(_kvm, NULL, false, false); \
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _shared) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, false); \
_root; \
- _root = tdp_mmu_next_root(_kvm, _root, false, false)) \
- if (!kvm_lockdep_assert_mmu_lock_held(_kvm, false)) { \
+ _root = tdp_mmu_next_root(_kvm, _root, _shared, false)) \
+ if (!kvm_lockdep_assert_mmu_lock_held(_kvm, _shared)) { \
} else
/*
@@ -296,7 +243,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
* by a memslot update or by the destruction of the VM. Initialize the
* refcount to two; one reference for the vCPU, and one reference for
* the TDP MMU itself, which is held until the root is invalidated and
- * is ultimately put by tdp_mmu_zap_root_work().
+ * is ultimately put by kvm_tdp_mmu_zap_invalidated_roots().
*/
refcount_set(&root->tdp_mmu_root_count, 2);
@@ -885,7 +832,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
{
struct kvm_mmu_page *root;
- for_each_tdp_mmu_root_yield_safe(kvm, root)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, false)
flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
return flush;
@@ -907,7 +854,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
*/
- for_each_tdp_mmu_root_yield_safe(kvm, root)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, false)
tdp_mmu_zap_root(kvm, root, false);
}
@@ -917,18 +864,47 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
*/
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
- flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
+ struct kvm_mmu_page *root;
+
+ read_lock(&kvm->mmu_lock);
+
+ for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
+ if (!root->tdp_mmu_scheduled_root_to_zap)
+ continue;
+
+ root->tdp_mmu_scheduled_root_to_zap = false;
+ KVM_BUG_ON(!root->role.invalid, kvm);
+
+ /*
+ * A TLB flush is not necessary as KVM performs a local TLB
+ * flush when allocating a new root (see kvm_mmu_load()), and
+ * when migrating a vCPU to a different pCPU. Note, the local
+ * TLB flush on reuse also invalidates paging-structure-cache
+ * entries, i.e. TLB entries for intermediate paging structures,
+ * that may be zapped, as such entries are associated with the
+ * ASID on both VMX and SVM.
+ */
+ tdp_mmu_zap_root(kvm, root, true);
+
+ /*
+ * The referenced needs to be put *after* zapping the root, as
+ * the root must be reachable by mmu_notifiers while it's being
+ * zapped
+ */
+ kvm_tdp_mmu_put_root(kvm, root, true);
+ }
+
+ read_unlock(&kvm->mmu_lock);
}
/*
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
* is about to be zapped, e.g. in response to a memslots update. The actual
- * zapping is performed asynchronously. Using a separate workqueue makes it
- * easy to ensure that the destruction is performed before the "fast zap"
- * completes, without keeping a separate list of invalidated roots; the list is
- * effectively the list of work items in the workqueue.
+ * zapping is done separately so that it happens with mmu_lock with read,
+ * whereas invalidating roots must be done with mmu_lock held for write (unless
+ * the VM is being destroyed).
*
- * Note, the asynchronous worker is gifted the TDP MMU's reference.
+ * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_get_vcpu_root_hpa().
*/
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
@@ -953,19 +929,20 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
/*
* As above, mmu_lock isn't held when destroying the VM! There can't
* be other references to @kvm, i.e. nothing else can invalidate roots
- * or be consuming roots, but walking the list of roots does need to be
- * guarded against roots being deleted by the asynchronous zap worker.
+ * or get/put references to roots.
*/
- rcu_read_lock();
-
- list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+ list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ /*
+ * Note, invalid roots can outlive a memslot update! Invalid
+ * roots must be *zapped* before the memslot update completes,
+ * but a different task can acquire a reference and keep the
+ * root alive after its been zapped.
+ */
if (!root->role.invalid) {
+ root->tdp_mmu_scheduled_root_to_zap = true;
root->role.invalid = true;
- tdp_mmu_schedule_zap_root(kvm, root);
}
}
-
- rcu_read_unlock();
}
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bc088953f929..733a3aef3a96 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,7 +7,7 @@
#include "spte.h"
-int kvm_mmu_init_tdp_mmu(struct kvm *kvm);
+void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..9f18b06bbda6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12308,9 +12308,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (ret)
goto out;
- ret = kvm_mmu_init_vm(kvm);
- if (ret)
- goto out_page_track;
+ kvm_mmu_init_vm(kvm);
ret = static_call(kvm_x86_vm_init)(kvm);
if (ret)
@@ -12355,7 +12353,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
out_uninit_mmu:
kvm_mmu_uninit_vm(kvm);
-out_page_track:
kvm_page_track_cleanup(kvm);
out:
return ret;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 1527b076ae2cb6a9c590a02725ed39399fcad1cf
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100458-lullaby-relearn-726c@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
1527b076ae2c ("spi: zynqmp-gqspi: fix clock imbalance on probe failure")
3ffefa1d9c9e ("spi: zynqmp-gqspi: Convert to platform remove callback returning void")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1527b076ae2cb6a9c590a02725ed39399fcad1cf Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Thu, 22 Jun 2023 10:24:35 +0200
Subject: [PATCH] spi: zynqmp-gqspi: fix clock imbalance on probe failure
Make sure that the device is not runtime suspended before explicitly
disabling the clocks on probe failure and on driver unbind to avoid a
clock enable-count imbalance.
Fixes: 9e3a000362ae ("spi: zynqmp: Add pm runtime support")
Cc: stable(a)vger.kernel.org # 4.19
Cc: Naga Sureshkumar Relli <naga.sureshkumar.relli(a)xilinx.com>
Cc: Shubhrajyoti Datta <shubhrajyoti.datta(a)xilinx.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Link: https://lore.kernel.org/r/Message-Id: <20230622082435.7873-1-johan+linaro(a)kernel.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-zynqmp-gqspi.c b/drivers/spi/spi-zynqmp-gqspi.c
index fb2ca9b90eab..c309dedfd602 100644
--- a/drivers/spi/spi-zynqmp-gqspi.c
+++ b/drivers/spi/spi-zynqmp-gqspi.c
@@ -1342,9 +1342,9 @@ static int zynqmp_qspi_probe(struct platform_device *pdev)
return 0;
clk_dis_all:
- pm_runtime_put_sync(&pdev->dev);
- pm_runtime_set_suspended(&pdev->dev);
pm_runtime_disable(&pdev->dev);
+ pm_runtime_put_noidle(&pdev->dev);
+ pm_runtime_set_suspended(&pdev->dev);
clk_disable_unprepare(xqspi->refclk);
clk_dis_pclk:
clk_disable_unprepare(xqspi->pclk);
@@ -1368,11 +1368,15 @@ static void zynqmp_qspi_remove(struct platform_device *pdev)
{
struct zynqmp_qspi *xqspi = platform_get_drvdata(pdev);
+ pm_runtime_get_sync(&pdev->dev);
+
zynqmp_gqspi_write(xqspi, GQSPI_EN_OFST, 0x0);
+
+ pm_runtime_disable(&pdev->dev);
+ pm_runtime_put_noidle(&pdev->dev);
+ pm_runtime_set_suspended(&pdev->dev);
clk_disable_unprepare(xqspi->refclk);
clk_disable_unprepare(xqspi->pclk);
- pm_runtime_set_suspended(&pdev->dev);
- pm_runtime_disable(&pdev->dev);
}
MODULE_DEVICE_TABLE(of, zynqmp_qspi_of_match);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 1527b076ae2cb6a9c590a02725ed39399fcad1cf
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023100450-unnamable-reemerge-dcef@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
1527b076ae2c ("spi: zynqmp-gqspi: fix clock imbalance on probe failure")
3ffefa1d9c9e ("spi: zynqmp-gqspi: Convert to platform remove callback returning void")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1527b076ae2cb6a9c590a02725ed39399fcad1cf Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Thu, 22 Jun 2023 10:24:35 +0200
Subject: [PATCH] spi: zynqmp-gqspi: fix clock imbalance on probe failure
Make sure that the device is not runtime suspended before explicitly
disabling the clocks on probe failure and on driver unbind to avoid a
clock enable-count imbalance.
Fixes: 9e3a000362ae ("spi: zynqmp: Add pm runtime support")
Cc: stable(a)vger.kernel.org # 4.19
Cc: Naga Sureshkumar Relli <naga.sureshkumar.relli(a)xilinx.com>
Cc: Shubhrajyoti Datta <shubhrajyoti.datta(a)xilinx.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Link: https://lore.kernel.org/r/Message-Id: <20230622082435.7873-1-johan+linaro(a)kernel.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/drivers/spi/spi-zynqmp-gqspi.c b/drivers/spi/spi-zynqmp-gqspi.c
index fb2ca9b90eab..c309dedfd602 100644
--- a/drivers/spi/spi-zynqmp-gqspi.c
+++ b/drivers/spi/spi-zynqmp-gqspi.c
@@ -1342,9 +1342,9 @@ static int zynqmp_qspi_probe(struct platform_device *pdev)
return 0;
clk_dis_all:
- pm_runtime_put_sync(&pdev->dev);
- pm_runtime_set_suspended(&pdev->dev);
pm_runtime_disable(&pdev->dev);
+ pm_runtime_put_noidle(&pdev->dev);
+ pm_runtime_set_suspended(&pdev->dev);
clk_disable_unprepare(xqspi->refclk);
clk_dis_pclk:
clk_disable_unprepare(xqspi->pclk);
@@ -1368,11 +1368,15 @@ static void zynqmp_qspi_remove(struct platform_device *pdev)
{
struct zynqmp_qspi *xqspi = platform_get_drvdata(pdev);
+ pm_runtime_get_sync(&pdev->dev);
+
zynqmp_gqspi_write(xqspi, GQSPI_EN_OFST, 0x0);
+
+ pm_runtime_disable(&pdev->dev);
+ pm_runtime_put_noidle(&pdev->dev);
+ pm_runtime_set_suspended(&pdev->dev);
clk_disable_unprepare(xqspi->refclk);
clk_disable_unprepare(xqspi->pclk);
- pm_runtime_set_suspended(&pdev->dev);
- pm_runtime_disable(&pdev->dev);
}
MODULE_DEVICE_TABLE(of, zynqmp_qspi_of_match);
This sets SB_POSIXACL only if ACL support is really enabled, instead
of always setting SB_POSIXACL if the NFS protocol version
theoretically supports ACL.
The code comment says "We will [apply the umask] ourselves", but that
happens in posix_acl_create() only if the kernel has POSIX ACL
support. Without it, posix_acl_create() is an empty dummy function.
So let's not pretend we will apply the umask if we can already know
that we will never.
This fixes a problem where the umask is always ignored in the NFS
client when compiled without CONFIG_FS_POSIX_ACL. This is a 4 year
old regression caused by commit 013cdf1088d723 which itself was not
completely wrong, but failed to consider all the side effects by
misdesigned VFS code.
Reviewed-by: J. Bruce Fields <bfields(a)redhat.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann(a)ionos.com>
---
fs/nfs/super.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 0d6473cb00cb..051986b422b0 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -1064,14 +1064,19 @@ static void nfs_fill_super(struct super_block *sb, struct nfs_fs_context *ctx)
* The VFS shouldn't apply the umask to mode bits.
* We will do so ourselves when necessary.
*/
- sb->s_flags |= SB_POSIXACL;
+ if (NFS_SB(sb)->caps & NFS_CAP_ACLS) {
+ sb->s_flags |= SB_POSIXACL;
+ }
+
sb->s_time_gran = 1;
sb->s_time_min = 0;
sb->s_time_max = U32_MAX;
sb->s_export_op = &nfs_export_ops;
break;
case 4:
- sb->s_flags |= SB_POSIXACL;
+ if (NFS_SB(sb)->caps & NFS_CAP_ACLS) {
+ sb->s_flags |= SB_POSIXACL;
+ }
sb->s_time_gran = 1;
sb->s_time_min = S64_MIN;
sb->s_time_max = S64_MAX;
--
2.39.2
Hi stable maintainers,
The following upstream commits fix a couple of overlayfs handling problems
in Smack. Could you please cherry-pick them for 4.19.y, 5.4.y, 5.10.y,
5.15.y and 6.1.y?
#3 2c085f3a8f23 smack: Record transmuting in smk_transmuted
#2 3a3d8fce31a4 smack: Retrieve transmuting information in smack_inode_getsecurity()
#1 387ef964460f Smack:- Use overlay inode label in smack_inode_copy_up() # not needed for 6.1.y
Note that 4.19.y needs some adjustments. I'll send backported patches
separately.
I'm not an author of these commits, but have been hitting the problems with
multiple kernels based on the trees and it seems worth cherry-picking.
Perhaps, it's better to add a tag in cherry-picked commits:
Fixes: d6d80cb57be4 ("Smack: Base support for overlayfs")
Regards,
Munehisa
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
CAN controller reception: the Rx buffer is only 5 messages long, so when
the bus loaded (eg. a message every 50us), overrun may easily
happen. Upon an overrun situation, due to a possible internal crosstalk
situation, the controller enters a frozen state which only can be
unlocked with a soft reset (experimentally). The solution was to offload
a call to sja1000_start() in a threaded handler. This needs to happen in
process context as this operation requires to sleep. sja1000_start()
basically enters "reset mode", performs a proper software reset and
returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in
reception. However it was sporadically observed that the transmit path
would now freeze. Further investigation blamed the fix mentioned above,
and especially the reset operation. Reproducing the reset in a loop
helped identifying what could possibly go wrong. The sja1000 is a single
Tx queue device, which leverages the netdev helpers to process one Tx
message at a time. The logic is: the queue is stopped, the message sent
to the transceiver, once properly transmitted the controller sets a
status bit which triggers an interrupt, in the interrupt handler the
transmission status is checked and the queue woken up. Unfortunately, if
an overrun happens, we might perform the soft reset precisely between
the transmission of the buffer to the transceiver and the advent of the
transmission status bit. We would then stop the transmission operation
without re-enabling the queue, leading to all further transmissions to
be ignored.
The reset interrupt can only happen while the device is "open", and
after a reset we anyway want to resume normal operations, no matter if a
packet to transmit got dropped in the process, so we shall wake up the
queue. Restarting the device and waking-up the queue is exactly what
sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
the queue state, we must acquire a lock both in the reset handler and in
the transmit path to ensure serialization of both operations. It turns
out, a lock is already held when entering the transmit path, so we can
just acquire/release it as well with the regular net helpers inside the
threaded interrupt handler and this way we should be safe. As the
reset handler might still be called after the transmission of a frame to
the transceiver but before it actually gets transmitted, we must ensure
we don't leak the skb, so we free it (the behavior is consistent, no
matter if there was an skb on the stack or not).
Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
Changes in v3:
* Fix new implementation by just acquiring the tx lock when required.
Changes in v2:
* As Marc sugested, use netif_tx_{,un}lock() instead of our own
spin_lock.
drivers/net/can/sja1000/sja1000.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index ae47fc72aa96..9531684d47cd 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -396,7 +396,13 @@ static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id)
struct net_device *dev = (struct net_device *)dev_id;
netdev_dbg(dev, "performing a soft reset upon overrun\n");
- sja1000_start(dev);
+
+ netif_tx_lock(dev);
+
+ can_free_echo_skb(dev, 0);
+ sja1000_set_mode(dev, CAN_MODE_START);
+
+ netif_tx_unlock(dev);
return IRQ_HANDLED;
}
--
2.34.1
Hi,
Could we please backport commit:
6d2779ecaeb56f92d7105c56772346c71c88c278
("locking/atomic: scripts: fix fallback ifdeffery")
... to the 6.5.y stable tree?
I forgot to Cc stable when I submitted the original patch, and had (mistakenly)
assumed that the Fixes tag was sufficient.
The patch fixes a dentry cache corruption issue observed on arm64 and which is
in theory possible on other architectures. I've recevied an off-list report
from someone who's hit the issue on the v6.5.y tree specifically.
Thanks,
Mark
On Wed, Oct 04, 2023 at 03:26:42PM +0900, Woo-kwang Lee wrote:
> This issue occurs when connecting Galaxy S22 and abnormal SEC Dex Adapter.
> When the abnormal adapter is connected, kernel panic always occurs after a
> few seconds.
> This occurs due to unable to get BOS descriptor, usb_release_bos_descriptor
> set dev->bos = NULL.
>
> - usb_reset_and_verify_device
> - hub_port_init
> - usb_release_bos_descriptor
> - dev->bos = NULL;
>
> hub_port_connect_change() calls portspeed(), and portspeed() calls hub_is_s
> uperspeedplus().
> Finally, hub_is_superspeedplus() calls hdev->bos->ssp_cap.
> It needs to check hdev->bos is NULL to prevent a kernel panic.
>
> usb 3-1: new SuperSpeed Gen 1 USB device number 16 using xhci-hcd-exynos
> usb 3-1: unable to get BOS descriptor set
> usb 3-1: Product: USB3.0 Hub
> Unable to handle kernel NULL pointer dereference at virtual address 0000018
>
> Call trace:
> hub_port_connect_change+0x8c/0x538
> port_event+0x244/0x764
> hub_event+0x158/0x474
> process_one_work+0x204/0x550
> worker_thread+0x28c/0x580
> kthread+0x13c/0x178
> ret_from_fork+0x10/0x30
>
> - hub_port_connect_change
> - portspeed
> - hub_is_superspeedplus
>
> Fixes: 0cdd49a1d1a4 ("usb: Support USB 3.1 extended port status request")
> Signed-off-by: Woo-kwang Lee <wookwang.lee(a)samsung.com>
> ---
> drivers/usb/core/hub.h | 2 ++
> 1 file changed, 2 insertions(+)
Are you sure this isn't already fixed by commit f74a7afc224a ("usb: hub:
Guard against accesses to uninitialized BOS descriptors") in linux-next?
thanks,
greg k-h
The MediaTek DRM driver implements GEM PRIME vmap by fetching the
sg_table for the object, iterating through the pages, and then
vmapping them. In essence, unlike the GEM DMA helpers which vmap
when the object is first created or imported, the MediaTek version
does it on request.
Unfortunately, the code never correctly frees the sg_table contents.
This results in a kernel memory leak. On a Hayato device with a text
console on the internal display, this results in the system running
out of memory in a few days from all the console screen cursor updates.
Add sg_free_table() to correctly free the contents of the sg_table. This
was missing despite explicitly required by mtk_gem_prime_get_sg_table().
Fixes: 3df64d7b0a4f ("drm/mediatek: Implement gem prime vmap/vunmap function")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Chen-Yu Tsai <wenst(a)chromium.org>
---
Please merge for v6.6 fixes.
Also, I was wondering why the MediaTek DRM driver implements a lot of
the GEM functionality itself, instead of using the GEM DMA helpers.
From what I could tell, the code closely follows the DMA helpers, except
that it vmaps the buffers only upon request.
drivers/gpu/drm/mediatek/mtk_drm_gem.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/mediatek/mtk_drm_gem.c b/drivers/gpu/drm/mediatek/mtk_drm_gem.c
index 9f364df52478..297ee090e02e 100644
--- a/drivers/gpu/drm/mediatek/mtk_drm_gem.c
+++ b/drivers/gpu/drm/mediatek/mtk_drm_gem.c
@@ -239,6 +239,7 @@ int mtk_drm_gem_prime_vmap(struct drm_gem_object *obj, struct iosys_map *map)
npages = obj->size >> PAGE_SHIFT;
mtk_gem->pages = kcalloc(npages, sizeof(*mtk_gem->pages), GFP_KERNEL);
if (!mtk_gem->pages) {
+ sg_free_table(sgt);
kfree(sgt);
return -ENOMEM;
}
@@ -248,11 +249,13 @@ int mtk_drm_gem_prime_vmap(struct drm_gem_object *obj, struct iosys_map *map)
mtk_gem->kvaddr = vmap(mtk_gem->pages, npages, VM_MAP,
pgprot_writecombine(PAGE_KERNEL));
if (!mtk_gem->kvaddr) {
+ sg_free_table(sgt);
kfree(sgt);
kfree(mtk_gem->pages);
return -ENOMEM;
}
out:
+ sg_free_table(sgt);
kfree(sgt);
iosys_map_set_vaddr(map, mtk_gem->kvaddr);
--
2.42.0.582.g8ccd20d70d-goog
Hello!
I bought a new laptop fujitsu life book and everything is going fine
on artix just the bt makes trouble:
/var/log/error.log
Sep 30 18:43:48 nexus bluetoothd[2266]:
src/adapter.c:reset_adv_monitors_complete() Failed to reset Adv
Monitors: Failed (0x03)
Sep 30 18:43:48 nexus bluetoothd[2266]: Failed to clear UUIDs: Failed (0x03)
Sep 30 18:43:48 nexus bluetoothd[2266]: Failed to add UUID: Failed (0x03)
Sep 30 18:43:48 nexus bluetoothd[2266]: Failed to add UUID: Failed (0x03)
i searched a bit the webs and found a new commit at kernel org that
does do the trouble:
https://bugs.archlinux.org/task/78980
follow the linkeys inside the commits there or read this one:
---------------before------------------------------------
/* interface numbers are hardcoded in the spec */
if (intf->cur_altsetting->desc.bInterfaceNumber != 0) {
if (!(id->driver_info & BTUSB_IFNUM_2))
return -ENODEV;
if (intf->cur_altsetting->desc.bInterfaceNumber != 2)
return -ENODEV;
}
-----------after----------------------------------------------------
if ((id->driver_info & BTUSB_IFNUM_2) &&
(intf->cur_altsetting->desc.bInterfaceNumber != 0) &&
(intf->cur_altsetting->desc.bInterfaceNumber != 2))
return -ENODEV;
--------------------------------------------------------
the dude just hooked up 3 conditions in a row with && where before it
was 2 conditions in 1 condition. + the comment was removed.
please reconsider this commit.
Yours
E
The patch titled
Subject: hugetlbfs: close race between MADV_DONTNEED and page fault
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
hugetlbfs-close-race-between-madv_dontneed-and-page-fault.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Rik van Riel <riel(a)surriel.com>
Subject: hugetlbfs: close race between MADV_DONTNEED and page fault
Date: Tue, 3 Oct 2023 23:25:04 -0400
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.
Link: https://lkml.kernel.org/r/20231004032814.3108383-3-riel@surriel.com
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <riel(a)surriel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 35 +++++++++++++++++++++++++++++++++--
mm/hugetlb.c | 28 ++++++++++++++++------------
mm/memory.c | 13 ++++++++-----
3 files changed, 57 insertions(+), 19 deletions(-)
--- a/include/linux/hugetlb.h~hugetlbfs-close-race-between-madv_dontneed-and-page-fault
+++ a/include/linux/hugetlb.h
@@ -139,7 +139,7 @@ struct page *hugetlb_follow_page_mask(st
void unmap_hugepage_range(struct vm_area_struct *,
unsigned long, unsigned long, struct page *,
zap_flags_t);
-void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+void __unmap_hugepage_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct page *ref_page, zap_flags_t zap_flags);
@@ -246,6 +246,25 @@ int huge_pmd_unshare(struct mm_struct *m
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end);
+extern void __hugetlb_zap_begin(struct vm_area_struct *vma,
+ unsigned long *begin, unsigned long *end);
+extern void __hugetlb_zap_end(struct vm_area_struct *vma,
+ struct zap_details *details);
+
+static inline void hugetlb_zap_begin(struct vm_area_struct *vma,
+ unsigned long *start, unsigned long *end)
+{
+ if (is_vm_hugetlb_page(vma))
+ __hugetlb_zap_begin(vma, start, end);
+}
+
+static inline void hugetlb_zap_end(struct vm_area_struct *vma,
+ struct zap_details *details)
+{
+ if (is_vm_hugetlb_page(vma))
+ __hugetlb_zap_end(vma, details);
+}
+
void hugetlb_vma_lock_read(struct vm_area_struct *vma);
void hugetlb_vma_unlock_read(struct vm_area_struct *vma);
void hugetlb_vma_lock_write(struct vm_area_struct *vma);
@@ -298,6 +317,18 @@ static inline void adjust_range_if_pmd_s
{
}
+static inline void hugetlb_zap_begin(
+ struct vm_area_struct *vma,
+ unsigned long *start, unsigned long *end)
+{
+}
+
+static inline void hugetlb_zap_end(
+ struct vm_area_struct *vma,
+ struct zap_details *details)
+{
+}
+
static inline struct page *hugetlb_follow_page_mask(
struct vm_area_struct *vma, unsigned long address, unsigned int flags,
unsigned int *page_mask)
@@ -443,7 +474,7 @@ static inline long hugetlb_change_protec
return 0;
}
-static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page,
zap_flags_t zap_flags)
--- a/mm/hugetlb.c~hugetlbfs-close-race-between-madv_dontneed-and-page-fault
+++ a/mm/hugetlb.c
@@ -5348,9 +5348,9 @@ int move_hugetlb_page_tables(struct vm_a
return len + old_addr - old_end;
}
-static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct page *ref_page, zap_flags_t zap_flags)
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page, zap_flags_t zap_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -5479,16 +5479,19 @@ static void __unmap_hugepage_range(struc
tlb_flush_mmu_tlbonly(tlb);
}
-void __unmap_hugepage_range_final(struct mmu_gather *tlb,
- struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page,
- zap_flags_t zap_flags)
+void __hugetlb_zap_begin(struct vm_area_struct *vma,
+ unsigned long *start, unsigned long *end)
{
+ adjust_range_if_pmd_sharing_possible(vma, start, end);
hugetlb_vma_lock_write(vma);
- i_mmap_lock_write(vma->vm_file->f_mapping);
+ if (vma->vm_file)
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+}
- /* mmu notification performed in caller */
- __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
+void __hugetlb_zap_end(struct vm_area_struct *vma,
+ struct zap_details *details)
+{
+ zap_flags_t zap_flags = details ? details->zap_flags : 0;
if (zap_flags & ZAP_FLAG_UNMAP) { /* final unmap */
/*
@@ -5501,11 +5504,12 @@ void __unmap_hugepage_range_final(struct
* someone else.
*/
__hugetlb_vma_unlock_write_free(vma);
- i_mmap_unlock_write(vma->vm_file->f_mapping);
} else {
- i_mmap_unlock_write(vma->vm_file->f_mapping);
hugetlb_vma_unlock_write(vma);
}
+
+ if (vma->vm_file)
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
}
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
--- a/mm/memory.c~hugetlbfs-close-race-between-madv_dontneed-and-page-fault
+++ a/mm/memory.c
@@ -1692,7 +1692,7 @@ static void unmap_single_vma(struct mmu_
if (vma->vm_file) {
zap_flags_t zap_flags = details ?
details->zap_flags : 0;
- __unmap_hugepage_range_final(tlb, vma, start, end,
+ __unmap_hugepage_range(tlb, vma, start, end,
NULL, zap_flags);
}
} else
@@ -1737,8 +1737,12 @@ void unmap_vmas(struct mmu_gather *tlb,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
do {
- unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
+ unsigned long start = start_addr;
+ unsigned long end = end_addr;
+ hugetlb_zap_begin(vma, &start, &end);
+ unmap_single_vma(tlb, vma, start, end, &details,
mm_wr_locked);
+ hugetlb_zap_end(vma, &details);
} while ((vma = mas_find(mas, tree_end - 1)) != NULL);
mmu_notifier_invalidate_range_end(&range);
}
@@ -1762,9 +1766,7 @@ void zap_page_range_single(struct vm_are
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
address, end);
- if (is_vm_hugetlb_page(vma))
- adjust_range_if_pmd_sharing_possible(vma, &range.start,
- &range.end);
+ hugetlb_zap_begin(vma, &range.start, &range.end);
tlb_gather_mmu(&tlb, vma->vm_mm);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
@@ -1775,6 +1777,7 @@ void zap_page_range_single(struct vm_are
unmap_single_vma(&tlb, vma, address, end, details, false);
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
+ hugetlb_zap_end(vma, details);
}
/**
_
Patches currently in -mm which might be from riel(a)surriel.com are
hugetlbfs-extend-hugetlb_vma_lock-to-private-vmas.patch
hugetlbfs-close-race-between-madv_dontneed-and-page-fault.patch
The patch titled
Subject: hugetlbfs: extend hugetlb_vma_lock to private VMAs
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
hugetlbfs-extend-hugetlb_vma_lock-to-private-vmas.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Rik van Riel <riel(a)surriel.com>
Subject: hugetlbfs: extend hugetlb_vma_lock to private VMAs
Date: Tue, 3 Oct 2023 23:25:03 -0400
Extend the locking scheme used to protect shared hugetlb mappings from
truncate vs page fault races, in order to protect private hugetlb mappings
(with resv_map) against MADV_DONTNEED.
Add a read-write semaphore to the resv_map data structure, and use that
from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
race between MADV_DONTNEED and page faults.
Link: https://lkml.kernel.org/r/20231004032814.3108383-2-riel@surriel.com
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <riel(a)surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 6 +++++
mm/hugetlb.c | 41 ++++++++++++++++++++++++++++++++++----
2 files changed, 43 insertions(+), 4 deletions(-)
--- a/include/linux/hugetlb.h~hugetlbfs-extend-hugetlb_vma_lock-to-private-vmas
+++ a/include/linux/hugetlb.h
@@ -60,6 +60,7 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+ struct rw_semaphore rw_sema;
#ifdef CONFIG_CGROUP_HUGETLB
/*
* On private mappings, the counter to uncharge reservations is stored
@@ -1246,6 +1247,11 @@ static inline bool __vma_shareable_lock(
return (vma->vm_flags & VM_MAYSHARE) && vma->vm_private_data;
}
+static inline bool __vma_private_lock(struct vm_area_struct *vma)
+{
+ return (!(vma->vm_flags & VM_MAYSHARE)) && vma->vm_private_data;
+}
+
/*
* Safe version of huge_pte_offset() to check the locks. See comments
* above huge_pte_offset().
--- a/mm/hugetlb.c~hugetlbfs-extend-hugetlb_vma_lock-to-private-vmas
+++ a/mm/hugetlb.c
@@ -97,6 +97,7 @@ static void hugetlb_vma_lock_alloc(struc
static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
@@ -267,6 +268,10 @@ void hugetlb_vma_lock_read(struct vm_are
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
down_read(&vma_lock->rw_sema);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ down_read(&resv_map->rw_sema);
}
}
@@ -276,6 +281,10 @@ void hugetlb_vma_unlock_read(struct vm_a
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
up_read(&vma_lock->rw_sema);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ up_read(&resv_map->rw_sema);
}
}
@@ -285,6 +294,10 @@ void hugetlb_vma_lock_write(struct vm_ar
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
down_write(&vma_lock->rw_sema);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ down_write(&resv_map->rw_sema);
}
}
@@ -294,17 +307,27 @@ void hugetlb_vma_unlock_write(struct vm_
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
up_write(&vma_lock->rw_sema);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ up_write(&resv_map->rw_sema);
}
}
int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
{
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
- if (!__vma_shareable_lock(vma))
- return 1;
+ if (__vma_shareable_lock(vma)) {
+ struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
- return down_write_trylock(&vma_lock->rw_sema);
+ return down_write_trylock(&vma_lock->rw_sema);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ return down_write_trylock(&resv_map->rw_sema);
+ }
+
+ return 1;
}
void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
@@ -313,6 +336,10 @@ void hugetlb_vma_assert_locked(struct vm
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
lockdep_assert_held(&vma_lock->rw_sema);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ lockdep_assert_held(&resv_map->rw_sema);
}
}
@@ -345,6 +372,11 @@ static void __hugetlb_vma_unlock_write_f
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
__hugetlb_vma_unlock_write_put(vma_lock);
+ } else if (__vma_private_lock(vma)) {
+ struct resv_map *resv_map = vma_resv_map(vma);
+
+ /* no free for anon vmas, but still need to unlock */
+ up_write(&resv_map->rw_sema);
}
}
@@ -1061,6 +1093,7 @@ struct resv_map *resv_map_alloc(void)
kref_init(&resv_map->refs);
spin_lock_init(&resv_map->lock);
INIT_LIST_HEAD(&resv_map->regions);
+ init_rwsem(&resv_map->rw_sema);
resv_map->adds_in_progress = 0;
/*
_
Patches currently in -mm which might be from riel(a)surriel.com are
hugetlbfs-extend-hugetlb_vma_lock-to-private-vmas.patch
hugetlbfs-close-race-between-madv_dontneed-and-page-fault.patch
Iain reports that USB devices can't be used to wake a Lenovo Z13 from
suspend. This occurs because on some AMD platforms, even though the Root
Ports advertise PME_Support for D3hot and D3cold, they don't handle PME
messages and generate wakeup interrupts from those states when amd-pmc has
put the platform in a hardware sleep state.
Iain reported this on an AMD Rembrandt platform, but it also affects
Phoenix SoCs. On Iain's system, a USB4 router below the affected Root Port
generates the PME. To avoid this issue, disable D3 for the root port
associated with USB4 controllers at suspend time.
Restore D3 support at resume so that it can be used by runtime suspend.
The amd-pmc driver doesn't put the platform in a hardware sleep state for
runtime suspend, so PMEs work as advertised.
Cc: stable(a)vger.kernel.org # 6.1.y: 70b70a4: PCI/sysfs: Protect driver's D3cold preference from user space
Cc: stable(a)vger.kernel.org # 6.5.y: 70b70a4: PCI/sysfs: Protect driver's D3cold preference from user space
Cc: stable(a)vger.kernel.org # 6.6.y: 70b70a4: PCI/sysfs: Protect driver's D3cold preference from user space
Link: https://learn.microsoft.com/en-us/windows-hardware/design/device-experience… [1]
Fixes: 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
Reported-by: Iain Lane <iain(a)orangesquash.org.uk>
Closes: https://forums.lenovo.com/t5/Ubuntu/Z13-can-t-resume-from-suspend-with-exte…
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
v20-v21:
* Rewrite commit message, lifting most of what Bjorn clipped down to on v20.
* Use pci_d3cold_disable()/pci_d3cold_enable() instead
* Do the quirk on the USB4 controller instead of RP->USB->RP
---
drivers/pci/quirks.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index eeec1d6f9023..5674065011e7 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6188,3 +6188,47 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31, dpc_log_size);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
+
+#ifdef CONFIG_SUSPEND
+/*
+ * Root Ports on some AMD SoCs advertise PME_Support for D3hot and D3cold, but
+ * if the SoC is put into a hardware sleep state by the amd-pmc driver, the
+ * Root Ports don't generate wakeup interrupts for USB devices.
+ *
+ * When suspending, disable D3 support for the Root Port so we don't use it.
+ * Restore D3 support when resuming.
+ */
+static void quirk_enable_rp_d3cold(struct pci_dev *dev)
+{
+ pci_d3cold_enable(pcie_find_root_port(dev));
+}
+
+static void quirk_disable_rp_d3cold_suspend(struct pci_dev *dev)
+{
+ struct pci_dev *rp;
+
+ /*
+ * PM_SUSPEND_ON means we're doing runtime suspend, which means
+ * amd-pmc will not be involved so PMEs during D3 work as advertised.
+ *
+ * The PMEs *do* work if amd-pmc doesn't put the SoC in the hardware
+ * sleep state, but we assume amd-pmc is always present.
+ */
+ if (pm_suspend_target_state == PM_SUSPEND_ON)
+ return;
+
+ rp = pcie_find_root_port(dev);
+ pci_d3cold_disable(rp);
+ dev_info_once(&rp->dev, "quirk: disabling D3cold for suspend\n");
+}
+/* Rembrandt (yellow_carp) */
+DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x162e, quirk_disable_rp_d3cold_suspend);
+DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_AMD, 0x162e, quirk_enable_rp_d3cold);
+DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x162f, quirk_disable_rp_d3cold_suspend);
+DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_AMD, 0x162f, quirk_enable_rp_d3cold);
+/* Phoenix (pink_sardine) */
+DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x1668, quirk_disable_rp_d3cold_suspend);
+DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_AMD, 0x1668, quirk_enable_rp_d3cold);
+DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x1669, quirk_disable_rp_d3cold_suspend);
+DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_AMD, 0x1669, quirk_enable_rp_d3cold);
+#endif /* CONFIG_SUSPEND */
--
2.34.1
There is a potential race condition in amdtee_close_session that may
cause use-after-free in amdtee_open_session. For instance, if a session
has refcount == 1, and one thread tries to free this session via:
kref_put(&sess->refcount, destroy_session);
the reference count will get decremented, and the next step would be to
call destroy_session(). However, if in another thread,
amdtee_open_session() is called before destroy_session() has completed
execution, alloc_session() may return 'sess' that will be freed up
later in destroy_session() leading to use-after-free in
amdtee_open_session.
To fix this issue, treat decrement of sess->refcount and removal of
'sess' from session list in destroy_session() as a critical section, so
that it is executed atomically.
Fixes: 757cc3e9ff1d ("tee: add AMD-TEE driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Rijo Thomas <Rijo-john.Thomas(a)amd.com>
---
v2:
* Introduced kref_put_mutex() as suggested by Sumit Garg.
drivers/tee/amdtee/core.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/tee/amdtee/core.c b/drivers/tee/amdtee/core.c
index 372d64756ed6..3c15f6a9e91c 100644
--- a/drivers/tee/amdtee/core.c
+++ b/drivers/tee/amdtee/core.c
@@ -217,12 +217,12 @@ static int copy_ta_binary(struct tee_context *ctx, void *ptr, void **ta,
return rc;
}
+/* mutex must be held by caller */
static void destroy_session(struct kref *ref)
{
struct amdtee_session *sess = container_of(ref, struct amdtee_session,
refcount);
- mutex_lock(&session_list_mutex);
list_del(&sess->list_node);
mutex_unlock(&session_list_mutex);
kfree(sess);
@@ -272,7 +272,8 @@ int amdtee_open_session(struct tee_context *ctx,
if (arg->ret != TEEC_SUCCESS) {
pr_err("open_session failed %d\n", arg->ret);
handle_unload_ta(ta_handle);
- kref_put(&sess->refcount, destroy_session);
+ kref_put_mutex(&sess->refcount, destroy_session,
+ &session_list_mutex);
goto out;
}
@@ -290,7 +291,8 @@ int amdtee_open_session(struct tee_context *ctx,
pr_err("reached maximum session count %d\n", TEE_NUM_SESSIONS);
handle_close_session(ta_handle, session_info);
handle_unload_ta(ta_handle);
- kref_put(&sess->refcount, destroy_session);
+ kref_put_mutex(&sess->refcount, destroy_session,
+ &session_list_mutex);
rc = -ENOMEM;
goto out;
}
@@ -331,7 +333,7 @@ int amdtee_close_session(struct tee_context *ctx, u32 session)
handle_close_session(ta_handle, session_info);
handle_unload_ta(ta_handle);
- kref_put(&sess->refcount, destroy_session);
+ kref_put_mutex(&sess->refcount, destroy_session, &session_list_mutex);
return 0;
}
--
2.25.1
This suggests a commit (and a follow-up fix for it) from 5.16+ for stable
because it fixes the usage of watchdog.handle_boot_enabled=0 for iTCO,
closing a monitoring gap in OTA update scenarios.
These patches are applicable to and have been tested with 5.15, 5.10 and
5.4 stable heads.
Jan
Cc: Malin Jonsson <malin.jonsson(a)ericsson.com>
Mika Westerberg (2):
watchdog: iTCO_wdt: No need to stop the timer in probe
watchdog: iTCO_wdt: Set NO_REBOOT if the watchdog is not already
running
drivers/watchdog/iTCO_wdt.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
--
2.35.3
Hi,
Please backport commit dad651b2a44e ("nvme-pci: do not set the NUMA node
of device if it has none") to stable branches 6.1, 5.15, 5.10, 5.4,
4.19, and 4.14. The commit fixes a4aea5623d4a ("NVMe: Convert to
blk-mq") which was introduced in v3.19 but I forgot to add this when
sending the patch.
--
Regards,
Pratyush Yadav
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
The Qualcomm SPMI PMIC revid implementation is broken in multiple ways.
First, it assumes that just because the sibling base device has been
registered that means that it is also bound to a driver, which may not
be the case (e.g. due to probe deferral or asynchronous probe). This
could trigger a NULL-pointer dereference when attempting to access the
driver data of the unbound device.
Second, it accesses driver data of a sibling device directly and without
any locking, which means that the driver data may be freed while it is
being accessed (e.g. on driver unbind).
Third, it leaks a struct device reference to the sibling device which is
looked up using the spmi_device_from_of() every time a function (child)
device is calling the revid function (e.g. on probe).
Fix this mess by reimplementing the revid lookup so that it is done only
at probe of the PMIC device; the base device fetches the revid info from
the hardware, while any secondary SPMI device fetches the information
from the base device and caches it so that it can be accessed safely
from its children. If the base device has not been probed yet then probe
of a secondary device is deferred.
Fixes: e9c11c6e3a0e ("mfd: qcom-spmi-pmic: expose the PMIC revid information to clients")
Cc: stable(a)vger.kernel.org # 6.0
Cc: Caleb Connolly <caleb.connolly(a)linaro.org>
Cc: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/mfd/qcom-spmi-pmic.c | 69 +++++++++++++++++++++++++++---------
1 file changed, 53 insertions(+), 16 deletions(-)
diff --git a/drivers/mfd/qcom-spmi-pmic.c b/drivers/mfd/qcom-spmi-pmic.c
index 47738f7e492c..8e449cff5cec 100644
--- a/drivers/mfd/qcom-spmi-pmic.c
+++ b/drivers/mfd/qcom-spmi-pmic.c
@@ -30,6 +30,8 @@ struct qcom_spmi_dev {
struct qcom_spmi_pmic pmic;
};
+static DEFINE_MUTEX(pmic_spmi_revid_lock);
+
#define N_USIDS(n) ((void *)n)
static const struct of_device_id pmic_spmi_id_table[] = {
@@ -76,24 +78,21 @@ static const struct of_device_id pmic_spmi_id_table[] = {
*
* This only supports PMICs with 1 or 2 USIDs.
*/
-static struct spmi_device *qcom_pmic_get_base_usid(struct device *dev)
+static struct spmi_device *qcom_pmic_get_base_usid(struct spmi_device *sdev, struct qcom_spmi_dev *ctx)
{
- struct spmi_device *sdev;
- struct qcom_spmi_dev *ctx;
struct device_node *spmi_bus;
struct device_node *child;
int function_parent_usid, ret;
u32 pmic_addr;
- sdev = to_spmi_device(dev);
- ctx = dev_get_drvdata(&sdev->dev);
-
/*
* Quick return if the function device is already in the base
* USID. This will always be hit for PMICs with only 1 USID.
*/
- if (sdev->usid % ctx->num_usids == 0)
+ if (sdev->usid % ctx->num_usids == 0) {
+ get_device(&sdev->dev);
return sdev;
+ }
function_parent_usid = sdev->usid;
@@ -118,10 +117,8 @@ static struct spmi_device *qcom_pmic_get_base_usid(struct device *dev)
sdev = spmi_device_from_of(child);
if (!sdev) {
/*
- * If the base USID for this PMIC hasn't probed yet
- * but the secondary USID has, then we need to defer
- * the function driver so that it will attempt to
- * probe again when the base USID is ready.
+ * If the base USID for this PMIC hasn't been
+ * registered yet then we need to defer.
*/
sdev = ERR_PTR(-EPROBE_DEFER);
}
@@ -135,6 +132,35 @@ static struct spmi_device *qcom_pmic_get_base_usid(struct device *dev)
return sdev;
}
+static int pmic_spmi_get_base_revid(struct spmi_device *sdev, struct qcom_spmi_dev *ctx)
+{
+ struct qcom_spmi_dev *base_ctx;
+ struct spmi_device *base;
+ int ret = 0;
+
+ base = qcom_pmic_get_base_usid(sdev, ctx);
+ if (IS_ERR(base))
+ return PTR_ERR(base);
+
+ /*
+ * Copy revid info from base device if it has probed and is still
+ * bound to its driver.
+ */
+ mutex_lock(&pmic_spmi_revid_lock);
+ base_ctx = spmi_device_get_drvdata(base);
+ if (!base_ctx) {
+ ret = -EPROBE_DEFER;
+ goto out_unlock;
+ }
+ memcpy(&ctx->pmic, &base_ctx->pmic, sizeof(ctx->pmic));
+out_unlock:
+ mutex_unlock(&pmic_spmi_revid_lock);
+
+ put_device(&base->dev);
+
+ return ret;
+}
+
static int pmic_spmi_load_revid(struct regmap *map, struct device *dev,
struct qcom_spmi_pmic *pmic)
{
@@ -210,11 +236,7 @@ const struct qcom_spmi_pmic *qcom_pmic_get(struct device *dev)
if (!of_match_device(pmic_spmi_id_table, dev->parent))
return ERR_PTR(-EINVAL);
- sdev = qcom_pmic_get_base_usid(dev->parent);
-
- if (IS_ERR(sdev))
- return ERR_CAST(sdev);
-
+ sdev = to_spmi_device(dev->parent);
spmi = dev_get_drvdata(&sdev->dev);
return &spmi->pmic;
@@ -249,16 +271,31 @@ static int pmic_spmi_probe(struct spmi_device *sdev)
ret = pmic_spmi_load_revid(regmap, &sdev->dev, &ctx->pmic);
if (ret < 0)
return ret;
+ } else {
+ ret = pmic_spmi_get_base_revid(sdev, ctx);
+ if (ret)
+ return ret;
}
+
+ mutex_lock(&pmic_spmi_revid_lock);
spmi_device_set_drvdata(sdev, ctx);
+ mutex_unlock(&pmic_spmi_revid_lock);
return devm_of_platform_populate(&sdev->dev);
}
+static void pmic_spmi_remove(struct spmi_device *sdev)
+{
+ mutex_lock(&pmic_spmi_revid_lock);
+ spmi_device_set_drvdata(sdev, NULL);
+ mutex_unlock(&pmic_spmi_revid_lock);
+}
+
MODULE_DEVICE_TABLE(of, pmic_spmi_id_table);
static struct spmi_driver pmic_spmi_driver = {
.probe = pmic_spmi_probe,
+ .remove = pmic_spmi_remove,
.driver = {
.name = "pmic-spmi",
.of_match_table = pmic_spmi_id_table,
--
2.41.0
The Qualcomm SPMI PMIC revid implementation is broken in multiple ways.
First, it totally ignores struct device_node reference counting and
leaks references to the parent bus node as well as each child it
iterates over using an open-coded for_each_child_of_node().
Second, it leaks references to each spmi device on the bus that it
iterates over by failing to drop the reference taken by the
spmi_device_from_of() helper.
Fix the struct device_node leaks by reimplementing the lookup using
for_each_child_of_node() and adding the missing reference count
decrements. Fix the sibling struct device leaks by dropping the
unnecessary lookups of devices with the wrong USID.
Note that this still leaves one struct device reference leak in case a
base device is found but it is not the parent of the device used for the
lookup. This will be addressed in a follow-on patch.
Fixes: e9c11c6e3a0e ("mfd: qcom-spmi-pmic: expose the PMIC revid information to clients")
Cc: stable(a)vger.kernel.org # 6.0
Cc: Caleb Connolly <caleb.connolly(a)linaro.org>
Cc: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/mfd/qcom-spmi-pmic.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/drivers/mfd/qcom-spmi-pmic.c b/drivers/mfd/qcom-spmi-pmic.c
index 7e2cd79d17eb..47738f7e492c 100644
--- a/drivers/mfd/qcom-spmi-pmic.c
+++ b/drivers/mfd/qcom-spmi-pmic.c
@@ -81,7 +81,7 @@ static struct spmi_device *qcom_pmic_get_base_usid(struct device *dev)
struct spmi_device *sdev;
struct qcom_spmi_dev *ctx;
struct device_node *spmi_bus;
- struct device_node *other_usid = NULL;
+ struct device_node *child;
int function_parent_usid, ret;
u32 pmic_addr;
@@ -105,28 +105,34 @@ static struct spmi_device *qcom_pmic_get_base_usid(struct device *dev)
* device for USID 2.
*/
spmi_bus = of_get_parent(sdev->dev.of_node);
- do {
- other_usid = of_get_next_child(spmi_bus, other_usid);
-
- ret = of_property_read_u32_index(other_usid, "reg", 0, &pmic_addr);
- if (ret)
- return ERR_PTR(ret);
+ sdev = ERR_PTR(-ENODATA);
+ for_each_child_of_node(spmi_bus, child) {
+ ret = of_property_read_u32_index(child, "reg", 0, &pmic_addr);
+ if (ret) {
+ of_node_put(child);
+ sdev = ERR_PTR(ret);
+ break;
+ }
- sdev = spmi_device_from_of(other_usid);
if (pmic_addr == function_parent_usid - (ctx->num_usids - 1)) {
- if (!sdev)
+ sdev = spmi_device_from_of(child);
+ if (!sdev) {
/*
* If the base USID for this PMIC hasn't probed yet
* but the secondary USID has, then we need to defer
* the function driver so that it will attempt to
* probe again when the base USID is ready.
*/
- return ERR_PTR(-EPROBE_DEFER);
- return sdev;
+ sdev = ERR_PTR(-EPROBE_DEFER);
+ }
+ of_node_put(child);
+ break;
}
- } while (other_usid->sibling);
+ }
+
+ of_node_put(spmi_bus);
- return ERR_PTR(-ENODATA);
+ return sdev;
}
static int pmic_spmi_load_revid(struct regmap *map, struct device *dev,
--
2.41.0
commit ec03804552e9a723569e14d2512f36a8e70dc640 upstream
Driver remove() must call pm_runtime_dont_use_autosuspend().
Drivers that call pm_runtime_use_autosuspend() must disable
it in driver remove(). Unfortunately until recently this was
only mentioned in 1 line in a 900+ line document so most
people hadn't noticed this. It has only recently been added
to the kerneldoc of pm_runtime_use_autosuspend().
Backport note: This is the same change as the upstream
commit but the cs35l56->base.dev argument in the upstream
code is cs35l56->dev in older releases.
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Link: https://lore.kernel.org/r/20230908101716.2658582-1-rf@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Please apply to 6.5.
---
sound/soc/codecs/cs35l56.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/sound/soc/codecs/cs35l56.c b/sound/soc/codecs/cs35l56.c
index fd06b9f9d496..7e241908b5f1 100644
--- a/sound/soc/codecs/cs35l56.c
+++ b/sound/soc/codecs/cs35l56.c
@@ -1594,6 +1594,7 @@ void cs35l56_remove(struct cs35l56_private *cs35l56)
flush_workqueue(cs35l56->dsp_wq);
destroy_workqueue(cs35l56->dsp_wq);
+ pm_runtime_dont_use_autosuspend(cs35l56->dev);
pm_runtime_suspend(cs35l56->dev);
pm_runtime_disable(cs35l56->dev);
--
2.30.2
Il 03/10/23 14:45, Sasha Levin ha scritto:
> On Tue, Oct 03, 2023 at 01:13:42PM +0200, AngeloGioacchino Del Regno wrote:
>> Hello SoC maintainers,
>>
>> As suggested by Arnd in [1], I am sending an update to the MAINTAINERS
>> file, adding myself to the MediaTek SoC maintainers, and some urgent
>> bugfixes for MediaTek devicetrees.
>
> Just curious: if it's urgent, and there are patches cc'ed to stable, why
> wait for 6.7 rather than one of the 6.6 -rc cycles?
>
I may have misunderstood what Arnd said - in that case, I'm sorry.
As for the fixes, I would be happier if those could land in v6.6-rcX.
Regards,
Angelo
03/10/23 15:39, Anastasia Belova пишет:
>
> 14/09/23 12:45, Anastasia Belova пишет:
>> count_bitmap_extents was deleted in version 5.11, but
>> there is possible mistake in versions 5.6-5.10.
>>
>> Region size should be calculated by subtracting
>> the end from the beginning.
> Do I understand correctly that bytes should decrease and
> (rs - re) is always negative and should be replaced by (re - rs)?
>> Found by Linux Verification Center (linuxtesting.org) with SVACE.
>>
>> Fixes: dfb79ddb130e ("btrfs: track discardable extents for async discard")
>> Signed-off-by: Anastasia Belova<abelova(a)astralinux.ru>
>> ---
>> fs/btrfs/free-space-cache.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index 4989c60b1df9..a34e266a0969 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -1930,7 +1930,7 @@ static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
>>
>> bitmap_for_each_set_region(bitmap_info->bitmap, rs, re, 0,
>> BITS_PER_BITMAP) {
>> - bytes -= (rs - re) * ctl->unit;
>> + bytes -= (re - rs) * ctl->unit;
>> count++;
>>
>> if (!bytes)
> Anastasia
Sorry for sending html accidentally
Hello SoC maintainers,
As suggested by Arnd in [1], I am sending an update to the MAINTAINERS
file, adding myself to the MediaTek SoC maintainers, and some urgent
bugfixes for MediaTek devicetrees.
mt6358 PMIC
- Bogus property dropped
mt7622/mt7986a
- Fix dtbs_check for t-phy nodes
mt8195:
- Fix early boot failure when DSU PMU enabled
- Fix memory layout for demo board
Regards,
Angelo
---
AngeloGioacchino Del Regno (1):
MAINTAINERS: Add Angelo as MediaTek SoC co-maintainer
Eugen Hristev (1):
arm64: dts: mediatek: fix t-phy unit name
Macpaul Lin (2):
arm64: dts: mediatek: mt8195-demo: fix the memory size to 8GB
arm64: dts: mediatek: mt8195-demo: update and reorder reserved memory regions
Nícolas F. R. A. Prado (1):
arm64: dts: mediatek: mt8195: Set DSU PMU status to fail
MAINTAINERS | 2 +-
arch/arm64/boot/dts/mediatek/mt7622.dtsi | 2 +-
arch/arm64/boot/dts/mediatek/mt7986a.dtsi | 2 +-
arch/arm64/boot/dts/mediatek/mt8195-demo.dts | 39 ++++++++++++++++++++++------
arch/arm64/boot/dts/mediatek/mt8195.dtsi | 1 +
5 files changed, 35 insertions(+), 11 deletions(-)
---
base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
change-id: 20231003-mediatek-fixes-v6-7-341b5cfdc07a
Best regards,
--
AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com>
In BMC environments with concurrent access to multiple registers, certain
registers occasionally yield a value of 0 even after 3 retries due to
hardware errata. As a fix, we have extended the retry count from 3 to 30.
The same errata applies to the mpt3sas driver, and a similar patch has
been accepted. Please find more details in the mpt3sas patch reference
link.
Link: https://lore.kernel.org/r/20230829090020.5417-2-ranjan.kumar@broadcom.com
Fixes: 272652fcbf1a ("scsi: megaraid_sas: add retry logic in megasas_readl")
Cc: stable(a)vger.kernel.org
Signed-off-by: Chandrakanth patil <chandrakanth.patil(a)broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena(a)broadcom.com>
---
drivers/scsi/megaraid/megaraid_sas_base.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index e1aa667dae66..3d4f13da1ae8 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -263,13 +263,13 @@ u32 megasas_readl(struct megasas_instance *instance,
* Fusion registers could intermittently return all zeroes.
* This behavior is transient in nature and subsequent reads will
* return valid value. As a workaround in driver, retry readl for
- * upto three times until a non-zero value is read.
+ * up to thirty times until a non-zero value is read.
*/
if (instance->adapter_type == AERO_SERIES) {
do {
ret_val = readl(addr);
i++;
- } while (ret_val == 0 && i < 3);
+ } while (ret_val == 0 && i < 30);
return ret_val;
} else {
return readl(addr);
--
2.39.3
For an unknown TX CQE error type (probably from a newer hardware),
still free the SKB, update the queue tail, etc., otherwise the
accounting will be wrong.
Also, TX errors can be triggered by injecting corrupted packets, so
replace the WARN_ONCE to ratelimited error logging.
Cc: stable(a)vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Reviewed-by: Simon Horman <horms(a)kernel.org>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 4a16ebff3d1d..5cdcf7561b38 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1317,19 +1317,23 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
case CQE_TX_VPORT_IDX_OUT_OF_RANGE:
case CQE_TX_VPORT_DISABLED:
case CQE_TX_VLAN_TAGGING_VIOLATION:
- WARN_ONCE(1, "TX: CQE error %d: ignored.\n",
- cqe_oob->cqe_hdr.cqe_type);
+ if (net_ratelimit())
+ netdev_err(ndev, "TX: CQE error %d\n",
+ cqe_oob->cqe_hdr.cqe_type);
+
apc->eth_stats.tx_cqe_err++;
break;
default:
- /* If the CQE type is unexpected, log an error, assert,
- * and go through the error path.
+ /* If the CQE type is unknown, log an error,
+ * and still free the SKB, update tail, etc.
*/
- WARN_ONCE(1, "TX: Unexpected CQE type %d: HW BUG?\n",
- cqe_oob->cqe_hdr.cqe_type);
+ if (net_ratelimit())
+ netdev_err(ndev, "TX: unknown CQE type %d\n",
+ cqe_oob->cqe_hdr.cqe_type);
+
apc->eth_stats.tx_cqe_unknown_type++;
- return;
+ break;
}
if (WARN_ON_ONCE(txq->gdma_txq_id != completions[i].wq_num))
--
2.25.1
From: Fabiano Rosas <farosas(a)suse.de>
Our linker script for optionroms specifies only the placement of the
.text section, leaving the linker free to place the remaining sections
at arbitrary places in the file.
Since at least binutils 2.39, the .note.gnu.build-id section is now
being placed at the start of the file, which causes label addresses to
be shifted. For linuxboot_dma.bin that means that the PnP header
(among others) will not be found when determining the type of ROM at
optionrom_setup():
(0x1c is the label _pnph, where the magic "PnP" is)
$ xxd /usr/share/qemu/linuxboot_dma.bin | grep "PnP"
00000010: 0000 0000 0000 0000 0000 1c00 2450 6e50 ............$PnP
$ xxd pc-bios/optionrom/linuxboot_dma.bin | grep "PnP"
00000010: 0000 0000 0000 0000 0000 4c00 2450 6e50 ............$PnP
^bad
Using a freshly built linuxboot_dma.bin ROM results in a broken boot:
SeaBIOS (version rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org)
Booting from Hard Disk...
Boot failed: could not read the boot disk
Booting from Floppy...
Boot failed: could not read the boot disk
No bootable device.
We're not using the build-id section, so pass the --build-id=none
option to the linker to remove it entirely.
Note: In theory, this same issue could happen with any other
section. The ideal solution would be to have all unused sections
discarded in the linker script. However that would be a larger change,
specially for the pvh rom which uses the .bss and COMMON sections so
I'm addressing only the immediate issue here.
Reported-by: Vasiliy Ulyanov <vulyanov(a)suse.de>
Signed-off-by: Fabiano Rosas <farosas(a)suse.de>
Reviewed-by: Thomas Huth <thuth(a)redhat.com>
Message-ID: <20230926192502.15986-1-farosas(a)suse.de>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
pc-bios/optionrom/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/pc-bios/optionrom/Makefile b/pc-bios/optionrom/Makefile
index b1fff0ba6c8..30d07026c79 100644
--- a/pc-bios/optionrom/Makefile
+++ b/pc-bios/optionrom/Makefile
@@ -36,7 +36,7 @@ config-cc.mak: Makefile
$(call cc-option,-Wno-array-bounds)) 3> config-cc.mak
-include config-cc.mak
-override LDFLAGS = -nostdlib -Wl,-T,$(SRC_DIR)/flat.lds
+override LDFLAGS = -nostdlib -Wl,--build-id=none,-T,$(SRC_DIR)/flat.lds
pvh.img: pvh.o pvh_main.o
--
2.41.0
Attn: stable(a)vger.kernel.org
Date: 03-10-2023
Subject: Letter of Intent (LOI) (03-10-2023)
This is a Letter of Intent concerning my Board of investors intent to fund some of your available investment projects.
If this is of any interest to you then kindly advise.
Yours Faithfully,
Ahmad R. Deeb
Head of Investment Team
The following problem exists since x2avic was enabled in the KVM:
svm_set_x2apic_msr_interception is called to enable the interception of
the x2apic msrs.
In particular it is called at the moment the guest resets its apic.
Assuming that the guest's apic was in x2apic mode, the reset will bring
it back to the xapic mode.
The svm_set_x2apic_msr_interception however has an erroneous check for
'!apic_x2apic_mode()' which prevents it from doing anything in this case.
As a result of this, all x2apic msrs are left unintercepted, and that
exposes the bare metal x2apic (if enabled) to the guest.
Oops.
Remove the erroneous '!apic_x2apic_mode()' check to fix that.
This fixes CVE-2023-5090
Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable(a)vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
arch/x86/kvm/svm/svm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9507df93f410a63..acdd0b89e4715a3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -913,8 +913,7 @@ void svm_set_x2apic_msr_interception(struct vcpu_svm *svm, bool intercept)
if (intercept == svm->x2avic_msrs_intercepted)
return;
- if (!x2avic_enabled ||
- !apic_x2apic_mode(svm->vcpu.arch.apic))
+ if (!x2avic_enabled)
return;
for (i = 0; i < MAX_DIRECT_ACCESS_MSRS; i++) {
--
2.26.3
From: Catalin Marinas <catalin.marinas(a)arm.com>
Since the actual slab freeing is deferred when calling kvfree_rcu(), so
is the kmemleak_free() callback informing kmemleak of the object
deletion. From the perspective of the kvfree_rcu() caller, the object is
freed and it may remove any references to it. Since kmemleak does not
scan RCU internal data storing the pointer, it will report such objects
as leaks during the grace period.
Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note
that the tiny RCU implementation does not have such issue since the
objects can be tracked from the rcu_ctrlblk structure.
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: Christoph Paasch <cpaasch(a)apple.com>
Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/
Cc: <stable(a)vger.kernel.org>
Tested-by: Christoph Paasch <cpaasch(a)apple.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
kernel/rcu/tree.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cb1caefa8bd0..24423877962c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -31,6 +31,7 @@
#include <linux/bitops.h>
#include <linux/export.h>
#include <linux/completion.h>
+#include <linux/kmemleak.h>
#include <linux/moduleparam.h>
#include <linux/panic.h>
#include <linux/panic_notifier.h>
@@ -3388,6 +3389,14 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
success = true;
}
+ /*
+ * The kvfree_rcu() caller considers the pointer freed at this point
+ * and likely removes any references to it. Since the actual slab
+ * freeing (and kmemleak_free()) is deferred, tell kmemleak to ignore
+ * this object (no scanning or false positives reporting).
+ */
+ kmemleak_ignore(ptr);
+
// Set timer to drain after KFREE_DRAIN_JIFFIES.
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
schedule_delayed_monitor_work(krcp);
--
2.42.0.582.g8ccd20d70d-goog
Hi Thinh,
I can confirm your patch fixed the issue on RK3399 when I was running
on Linux 6.1.54.
I'm not on the ML for this so I'm sorry if this email causes any issue
as I'm not sure how to reply to a thread from a ML I am not on.
Best,
Da
Like the Lenovo 82TL, 82V2, 82QF and 82UG, the 82YM (Yoga 7 14ARP8)
requires an entry in the quirk list to enable the internal microphone.
The latter two received similar fixes in commit 1263cc0f414d
("ASoC: amd: yc: Fix non-functional mic on Lenovo 82QF and 82UG").
Fixes: c008323fe361 ("ASoC: amd: yc: Fix a non-functional mic on Lenovo 82SJ")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sven Frotscher <sven.frotscher(a)gmail.com>
---
v3->v4 changes:
* re-add blank line between commit message title and body
---
v2->v3 changes:
* add message title of referenced commit to commit message
* make whitespace consistent with surrounding code
* use a patch-friendly e-mail client
---
v1->v2 changes:
* add Fixes and Cc tags to commit message
* remove redundant LKML link from commit message
* fix mangled diff
---
sound/soc/amd/yc/acp6x-mach.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/sound/soc/amd/yc/acp6x-mach.c b/sound/soc/amd/yc/acp6x-mach.c
index 94e9eb8e73f2..15a864dcd7bd 100644
--- a/sound/soc/amd/yc/acp6x-mach.c
+++ b/sound/soc/amd/yc/acp6x-mach.c
@@ -241,6 +241,13 @@ static const struct dmi_system_id yc_acp_quirk_table[] = {
DMI_MATCH(DMI_PRODUCT_NAME, "82V2"),
}
},
+ {
+ .driver_data = &acp6x_card,
+ .matches = {
+ DMI_MATCH(DMI_BOARD_VENDOR, "LENOVO"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "82YM"),
+ }
+ },
{
.driver_data = &acp6x_card,
.matches = {
--
2.42.0
The quilt patch titled
Subject: i915: limit the length of an sg list to the requested length
has been removed from the -mm tree. Its filename was
i915-limit-the-length-of-an-sg-list-to-the-requested-length.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Subject: i915: limit the length of an sg list to the requested length
Date: Tue, 19 Sep 2023 20:48:55 +0100
The folio conversion changed the behaviour of shmem_sg_alloc_table() to
put the entire length of the last folio into the sg list, even if the sg
list should have been shorter. gen8_ggtt_insert_entries() relied on the
list being the right length and would overrun the end of the page tables.
Other functions may also have been affected.
Clamp the length of the last entry in the sg list to be the expected
length.
Link: https://lkml.kernel.org/r/20230919194855.347582-1-willy@infradead.org
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9256
Fixes: 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch")
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reported-by: Oleksandr Natalenko <oleksandr(a)natalenko.name>
Closes: https://lore.kernel.org/lkml/6287208.lOV4Wx5bFT@natalenko.name/
Tested-by: Oleksandr Natalenko <oleksandr(a)natalenko.name>
Reviewed-by: Andrzej Hajda <andrzej.hajda(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> [6.5.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c~i915-limit-the-length-of-an-sg-list-to-the-requested-length
+++ a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915
st->nents = 0;
for (i = 0; i < page_count; i++) {
struct folio *folio;
+ unsigned long nr_pages;
const unsigned int shrink[] = {
I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
0,
@@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915
}
} while (1);
+ nr_pages = min_t(unsigned long,
+ folio_nr_pages(folio), page_count - i);
if (!i ||
sg->length >= max_segment ||
folio_pfn(folio) != next_pfn) {
@@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915
sg = sg_next(sg);
st->nents++;
- sg_set_folio(sg, folio, folio_size(folio), 0);
+ sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0);
} else {
/* XXX: could overflow? */
- sg->length += folio_size(folio);
+ sg->length += nr_pages * PAGE_SIZE;
}
- next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
- i += folio_nr_pages(folio) - 1;
+ next_pfn = folio_pfn(folio) + nr_pages;
+ i += nr_pages - 1;
/* Check that the i965g/gm workaround works. */
GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x00100000UL);
_
Patches currently in -mm which might be from willy(a)infradead.org are
mm-convert-dax-lock-unlock-page-to-lock-unlock-folio.patch
buffer-pass-gfp-flags-to-folio_alloc_buffers.patch
buffer-hoist-gfp-flags-from-grow_dev_page-to-__getblk_gfp.patch
buffer-hoist-gfp-flags-from-grow_dev_page-to-__getblk_gfp-fix.patch
ext4-use-bdev_getblk-to-avoid-memory-reclaim-in-readahead-path.patch
buffer-use-bdev_getblk-to-avoid-memory-reclaim-in-readahead-path.patch
buffer-convert-getblk_unmovable-and-__getblk-to-use-bdev_getblk.patch
buffer-convert-sb_getblk-to-call-__getblk.patch
ext4-call-bdev_getblk-from-sb_getblk_gfp.patch
buffer-remove-__getblk_gfp.patch
hugetlb-use-a-folio-in-free_hpage_workfn.patch
hugetlb-remove-a-few-calls-to-page_folio.patch
hugetlb-convert-remove_pool_huge_page-to-remove_pool_hugetlb_folio.patch
mm-make-lock_folio_maybe_drop_mmap-vma-lock-aware.patch
mm-call-wp_page_copy-under-the-vma-lock.patch
mm-handle-shared-faults-under-the-vma-lock.patch
mm-handle-cow-faults-under-the-vma-lock.patch
mm-handle-read-faults-under-the-vma-lock.patch
mm-handle-write-faults-to-ro-pages-under-the-vma-lock.patch
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
CAN controller reception: the Rx buffer is only 5 messages long, so when
the bus loaded (eg. a message every 50us), overrun may easily
happen. Upon an overrun situation, due to a possible internal crosstalk
situation, the controller enters a frozen state which only can be
unlocked with a soft reset (experimentally). The solution was to offload
a call to sja1000_start() in a threaded handler. This needs to happen in
process context as this operation requires to sleep. sja1000_start()
basically enters "reset mode", performs a proper software reset and
returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in
reception. However it was sporadically observed that the transmit path
would now freeze. Further investigation blamed the fix mentioned above,
and especially the reset operation. Reproducing the reset in a loop
helped identifying what could possibly go wrong. The sja1000 is a single
Tx queue device, which leverages the netdev helpers to process one Tx
message at a time. The logic is: the queue is stopped, the message sent
to the transceiver, once properly transmitted the controller sets a
status bit which triggers an interrupt, in the interrupt handler the
transmission status is checked and the queue woken up. Unfortunately, if
an overrun happens, we might perform the soft reset precisely between
the transmission of the buffer to the transceiver and the advent of the
transmission status bit. We would then stop the transmission operation
without re-enabling the queue, leading to all further transmissions to
be ignored.
The reset interrupt can only happen while the device is "open", and
after a reset we anyway want to resume normal operations, no matter if a
packet to transmit got dropped in the process, so we shall wake up the
queue. Restarting the device and waking-up the queue is exactly what
sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
the queue state, we must acquire a lock both in the reset handler and in
the transmit path to ensure serialization of both operations. As the
reset handler might still be called after the transmission of a frame to
the transceiver but before it actually gets transmitted, we must ensure
we don't leak the skb, so we free it (the behavior is consistent, no
matter if there was an skb on the stack or not).
Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
Changes in v2:
* As Marc sugested, use netif_tx_{,un}lock() instead of our own
spin_lock.
drivers/net/can/sja1000/sja1000.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index ae47fc72aa96..91e3fb3eed20 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -297,6 +297,7 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
if (can_dropped_invalid_skb(dev, skb))
return NETDEV_TX_OK;
+ netif_tx_lock(dev);
netif_stop_queue(dev);
fi = dlc = cf->can_dlc;
@@ -335,6 +336,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
sja1000_write_cmdreg(priv, cmd_reg_val);
+ netif_tx_unlock(dev);
+
return NETDEV_TX_OK;
}
@@ -396,7 +399,13 @@ static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id)
struct net_device *dev = (struct net_device *)dev_id;
netdev_dbg(dev, "performing a soft reset upon overrun\n");
- sja1000_start(dev);
+
+ netif_tx_lock(dev);
+
+ can_free_echo_skb(dev, 0);
+ sja1000_set_mode(dev, CAN_MODE_START);
+
+ netif_tx_unlock(dev);
return IRQ_HANDLED;
}
--
2.34.1
This is a note to let you know that I've just added the patch titled
usb: hub: Guard against accesses to uninitialized BOS descriptors
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From f74a7afc224acd5e922c7a2e52244d891bbe44ee Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ricardo=20Ca=C3=B1uelo?= <ricardo.canuelo(a)collabora.com>
Date: Wed, 30 Aug 2023 12:04:18 +0200
Subject: usb: hub: Guard against accesses to uninitialized BOS descriptors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Many functions in drivers/usb/core/hub.c and drivers/usb/core/hub.h
access fields inside udev->bos without checking if it was allocated and
initialized. If usb_get_bos_descriptor() fails for whatever
reason, udev->bos will be NULL and those accesses will result in a
crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 5 PID: 17818 Comm: kworker/5:1 Tainted: G W 5.15.108-18910-gab0e1cb584e1 #1 <HASH:1f9e 1>
Hardware name: Google Kindred/Kindred, BIOS Google_Kindred.12672.413.0 02/03/2021
Workqueue: usb_hub_wq hub_event
RIP: 0010:hub_port_reset+0x193/0x788
Code: 89 f7 e8 20 f7 15 00 48 8b 43 08 80 b8 96 03 00 00 03 75 36 0f b7 88 92 03 00 00 81 f9 10 03 00 00 72 27 48 8b 80 a8 03 00 00 <48> 83 78 18 00 74 19 48 89 df 48 8b 75 b0 ba 02 00 00 00 4c 89 e9
RSP: 0018:ffffab740c53fcf8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffa1bc5f678000 RCX: 0000000000000310
RDX: fffffffffffffdff RSI: 0000000000000286 RDI: ffffa1be9655b840
RBP: ffffab740c53fd70 R08: 00001b7d5edaa20c R09: ffffffffb005e060
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: ffffab740c53fd3e R14: 0000000000000032 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffa1be96540000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 000000022e80c005 CR4: 00000000003706e0
Call Trace:
hub_event+0x73f/0x156e
? hub_activate+0x5b7/0x68f
process_one_work+0x1a2/0x487
worker_thread+0x11a/0x288
kthread+0x13a/0x152
? process_one_work+0x487/0x487
? kthread_associate_blkcg+0x70/0x70
ret_from_fork+0x1f/0x30
Fall back to a default behavior if the BOS descriptor isn't accessible
and skip all the functionalities that depend on it: LPM support checks,
Super Speed capabilitiy checks, U1/U2 states setup.
Signed-off-by: Ricardo Cañuelo <ricardo.canuelo(a)collabora.com>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20230830100418.1952143-1-ricardo.canuelo@collabor…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/core/hub.c | 25 ++++++++++++++++++++++---
drivers/usb/core/hub.h | 2 +-
2 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 3c54b218301c..0ff47eeffb49 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -151,6 +151,10 @@ int usb_device_supports_lpm(struct usb_device *udev)
if (udev->quirks & USB_QUIRK_NO_LPM)
return 0;
+ /* Skip if the device BOS descriptor couldn't be read */
+ if (!udev->bos)
+ return 0;
+
/* USB 2.1 (and greater) devices indicate LPM support through
* their USB 2.0 Extended Capabilities BOS descriptor.
*/
@@ -327,6 +331,10 @@ static void usb_set_lpm_parameters(struct usb_device *udev)
if (!udev->lpm_capable || udev->speed < USB_SPEED_SUPER)
return;
+ /* Skip if the device BOS descriptor couldn't be read */
+ if (!udev->bos)
+ return;
+
hub = usb_hub_to_struct_hub(udev->parent);
/* It doesn't take time to transition the roothub into U0, since it
* doesn't have an upstream link.
@@ -2704,13 +2712,17 @@ int usb_authorize_device(struct usb_device *usb_dev)
static enum usb_ssp_rate get_port_ssp_rate(struct usb_device *hdev,
u32 ext_portstatus)
{
- struct usb_ssp_cap_descriptor *ssp_cap = hdev->bos->ssp_cap;
+ struct usb_ssp_cap_descriptor *ssp_cap;
u32 attr;
u8 speed_id;
u8 ssac;
u8 lanes;
int i;
+ if (!hdev->bos)
+ goto out;
+
+ ssp_cap = hdev->bos->ssp_cap;
if (!ssp_cap)
goto out;
@@ -4215,8 +4227,15 @@ static void usb_enable_link_state(struct usb_hcd *hcd, struct usb_device *udev,
enum usb3_link_state state)
{
int timeout;
- __u8 u1_mel = udev->bos->ss_cap->bU1devExitLat;
- __le16 u2_mel = udev->bos->ss_cap->bU2DevExitLat;
+ __u8 u1_mel;
+ __le16 u2_mel;
+
+ /* Skip if the device BOS descriptor couldn't be read */
+ if (!udev->bos)
+ return;
+
+ u1_mel = udev->bos->ss_cap->bU1devExitLat;
+ u2_mel = udev->bos->ss_cap->bU2DevExitLat;
/* If the device says it doesn't have *any* exit latency to come out of
* U1 or U2, it's probably lying. Assume it doesn't implement that link
diff --git a/drivers/usb/core/hub.h b/drivers/usb/core/hub.h
index 37897afd1b64..d44dd7f6623e 100644
--- a/drivers/usb/core/hub.h
+++ b/drivers/usb/core/hub.h
@@ -153,7 +153,7 @@ static inline int hub_is_superspeedplus(struct usb_device *hdev)
{
return (hdev->descriptor.bDeviceProtocol == USB_HUB_PR_SS &&
le16_to_cpu(hdev->descriptor.bcdUSB) >= 0x0310 &&
- hdev->bos->ssp_cap);
+ hdev->bos && hdev->bos->ssp_cap);
}
static inline unsigned hub_power_on_good_delay(struct usb_hub *hub)
--
2.42.0
Fix four issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes
and the uninitialized member bug was discovered during the review.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v3:
- Add fix to uninitialized sa_flags
- Handle ksft_exit_fail_msg() in per test functions
- Make signal handler register fails to also exit
- Improve changelogs
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (7):
selftests/resctrl: Fix uninitialized .sa_flags
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++++++++++-------
tools/testing/selftests/resctrl/resctrl_val.c | 24 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++----------
9 files changed, 96 insertions(+), 103 deletions(-)
--
2.30.2
During the error path, the vma iterator may not be correctly positioned
or set to the correct range. Undo the vma_prev() call by resetting to
the passed in address. Re-walking to the same range will fix the range
to the area previously passed in.
Users would notice increased cycles as vma_merge() would be called an
extra time with vma == prev, and thus would fail to merge and return.
Link: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejt…
Closes: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejt…
Fixes: 18b098af2890 ("vma_merge: set vma iterator to correct position.")
Cc: stable(a)vger.kernel.org
Cc: Jann Horn <jannh(a)google.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
---
mm/mmap.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index b56a7f0c9f85..acb7dea49e23 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -975,7 +975,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
/* Error in anon_vma clone. */
if (err)
- return NULL;
+ goto anon_vma_fail;
if (vma_start < vma->vm_start || vma_end > vma->vm_end)
vma_expanded = true;
@@ -988,7 +988,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
}
if (vma_iter_prealloc(vmi, vma))
- return NULL;
+ goto prealloc_fail;
init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
@@ -1016,6 +1016,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
vma_complete(&vp, vmi, mm);
khugepaged_enter_vma(res, vm_flags);
return res;
+
+prealloc_fail:
+anon_vma_fail:
+ vma_iter_set(vmi, addr);
+ vma_iter_load(vmi);
+ return NULL;
}
/*
--
2.40.1
This suggests a commit (and a follow-up fix for it) from 5.16+ for stable
because it fixes the usage of watchdog.handle_boot_enabled=0 for iTCO,
closing a monitoring gap in OTA update scenarios.
These patches are applicable to and have been tested with 4.19, 4.14
stable heads. The second patch required rebasing, the first one applied
as-is from upstream.
Jan
Cc: Malin Jonsson <malin.jonsson(a)ericsson.com>
Mika Westerberg (2):
watchdog: iTCO_wdt: No need to stop the timer in probe
watchdog: iTCO_wdt: Set NO_REBOOT if the watchdog is not already
running
drivers/watchdog/iTCO_wdt.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
--
2.35.3
commit 0bdf399342c5 ("net: Avoid address overwrite in kernel_connect")
ensured that kernel_connect() will not overwrite the address parameter
in cases where BPF connect hooks perform an address rewrite. This change
replaces direct calls to sock->ops->connect() in net with kernel_connect()
to make these call safe.
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Cc: stable(a)vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: Jordan Rife <jrife(a)google.com>
---
v4->v5: Remove non-net changes.
v3->v4: Remove precondition check for addrlen.
v2->v3: Add "Fixes" tag. Check for positivity in addrlen sanity check.
v1->v2: Split up original patch into patch series. Insulate calls with
kernel_connect() instead of pushing address copy deeper into
sock->ops->connect().
net/netfilter/ipvs/ip_vs_sync.c | 4 ++--
net/rds/tcp_connect.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index da5af28ff57b5..6e4ed1e11a3b7 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1505,8 +1505,8 @@ static int make_send_sock(struct netns_ipvs *ipvs, int id,
}
get_mcast_sockaddr(&mcast_addr, &salen, &ipvs->mcfg, id);
- result = sock->ops->connect(sock, (struct sockaddr *) &mcast_addr,
- salen, 0);
+ result = kernel_connect(sock, (struct sockaddr *)&mcast_addr,
+ salen, 0);
if (result < 0) {
pr_err("Error connecting to the multicast addr\n");
goto error;
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index f0c477c5d1db4..d788c6d28986f 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -173,7 +173,7 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
* own the socket
*/
rds_tcp_set_callbacks(sock, cp);
- ret = sock->ops->connect(sock, addr, addrlen, O_NONBLOCK);
+ ret = kernel_connect(sock, addr, addrlen, O_NONBLOCK);
rdsdebug("connect to address %pI6c returned %d\n", &conn->c_faddr, ret);
if (ret == -EINPROGRESS)
--
2.42.0.515.g380fc7ccd1-goog
Hello guys,
During the last few months, I felt a performance regression when using
read() and write() on my high-speed Nvme SSD (about 7GB/s).
To get more precise information about it I quickly developed benchmark
tool basically running read() or write() in a loop to simulate a
sequential file read or write. The tool also measures the real time
consumed by the loop. Finally, the tool can call open() with or without
O_DIRECT.
I ran the tests on EXT4 and Exfat with following settings (buffer
values have been set for best result):
- Write settings: buffer 400mb * 100
- Read settings: buffer 200mb
- Drop caches before non-direct read/write test
With this hardware:
- CPU AMD Ryzen 7600X
- RAM DDR5 5200 32GB
- SSD Kingston Fury Renegade 4TB with 4K LBA
Here are some results I got with last upstream kernels (default
config):
+------------------+----------+------------------+------------------+--
----------------+------------------+------------------+
| ~42GB | O_DIRECT | Linux 6.2.0 | Linux 6.3.0 |
Linux 6.4.0 | Linux 6.5.0 | Linux 6.5.5 |
+------------------+----------+------------------+------------------+--
----------------+------------------+------------------+
| Ext4 (sector 4k) | | | |
| | |
| Read | no | 7.2s (5800MB/s) | 7.1s (5890MB/s) |
8.3s (5050MB/s) | 13.2s (3180MB/s) | 13.2s (3180MB/s) |
| Write | no | 12.0s (3500MB/s) | 12.6s (3340MB/s) |
12.2s (3440MB/s) | 28.9s (1450MB/s) | 28.9s (1450MB/s) |
| Read | yes | 6.0s (7000MB/s) | 6.0s (7020MB/s) |
5.9s (7170MB/s) | 5.9s (7100MB/s) | 5.9s (7100MB/s) |
| Write | yes | 6.7s (6220MB/s) | 6.7s (6290MB/s) |
6.9s (6080MB/s) | 6.9s (6080MB/s) | 6.9s (6970MB/s) |
| Exfat (sector ?) | | | |
| | |
| Read | no | 7.3s (5770MB/s) | 7.2s (5830MB/s) |
9s (4620MB/s) | 13.3s (3150MB/s) | 13.2s (3180MB/s) |
| Write | no | 8.3s (5040MB/s) | 8.9s (4750MB/s) |
8.3s (5040MB/s) | 18.3s (2290MB/s) | 18.5s (2260MB/s) |
| Read | yes | 6.2s (6760MB/s) | 6.1s (6870MB/s) |
6.0s (6980MB/s) | 6.5s (6440MB/s) | 6.6s (6320MB/s) |
| Write | yes | 16.1s (2610MB/s) | 16.0s (2620MB/s) |
18.7s (2240MB/s) | 34.1s (1230MB/s) | 34.5s (1220MB/s) |
+------------------+----------+------------------+------------------+--
----------------+------------------+------------------+
Please note that I rounded some values to clarify readiness. Small
variations can be considered as margin error.
Ext4 results: cached reads/writes time have increased of almost 100%
from 6.2.0 to 6.5.0 with a first increase with 6.4.0. Direct access
times have stayed similar though.
Exfat results: performance decrease too with and without direct access
this time.
I realize there are thousands of commits between, plus the issue can
come from multiple kernel parts such as the page cache, the file system
implementation (especially for Exfat), the IO engine, a driver, etc.
The results also showed that there is not only a specific version
impacted. Anyway, at the end the performance have highly decreased.
If you want to verify my benchmark tool source code, please ask.
PS: sending again as only text body is accepted
Regards
Florent DELAHAYE
Hi,
I have a good news for you.Please contact me for more details. Sorry
if
you received this letter in your spam, Due to recent connection error
here in my country.a
Looking forward for your immediate response to me through my private
e-mail id: (jackthomsom7(a)gmail.com)
Best Regards,
Regards,
Dr. Thomsom Jack.
My Telephone number
+226-67 -44 - 42- 36
[ Replying to show this patch that got dropped from the mailing list ]
On Sat, 30 Sep 2023 16:32:16 -0400
Steven Rostedt <rostedt(a)goodmis.org> wrote:
> From: Beau Belgrave <beaub(a)linux.microsoft.com>
>
> All architectures should use a long aligned address passed to set_bit().
> User processes can pass either a 32-bit or 64-bit sized value to be
> updated when tracing is enabled when on a 64-bit kernel. Both cases are
> ensured to be naturally aligned, however, that is not enough. The
> address must be long aligned without affecting checks on the value
> within the user process which require different adjustments for the bit
> for little and big endian CPUs.
>
> Add a compat flag to user_event_enabler that indicates when a 32-bit
> value is being used on a 64-bit kernel. Long align addresses and correct
> the bit to be used by set_bit() to account for this alignment. Ensure
> compat flags are copied during forks and used during deletion clears.
>
> Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux…
> Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@r…
>
> Cc: stable(a)vger.kernel.org
> Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
> Reported-by: Clément Léger <cleger(a)rivosinc.com>
> Suggested-by: Clément Léger <cleger(a)rivosinc.com>
> Signed-off-by: Beau Belgrave <beaub(a)linux.microsoft.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
> ---
> kernel/trace/trace_events_user.c | 58 ++++++++++++++++++++++++++++----
> 1 file changed, 51 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> index 6f046650e527..b87f41187c6a 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -127,8 +127,13 @@ struct user_event_enabler {
> /* Bit 7 is for freeing status of enablement */
> #define ENABLE_VAL_FREEING_BIT 7
>
> -/* Only duplicate the bit value */
> -#define ENABLE_VAL_DUP_MASK ENABLE_VAL_BIT_MASK
> +/* Bit 8 is for marking 32-bit on 64-bit */
> +#define ENABLE_VAL_32_ON_64_BIT 8
> +
> +#define ENABLE_VAL_COMPAT_MASK (1 << ENABLE_VAL_32_ON_64_BIT)
> +
> +/* Only duplicate the bit and compat values */
> +#define ENABLE_VAL_DUP_MASK (ENABLE_VAL_BIT_MASK | ENABLE_VAL_COMPAT_MASK)
>
> #define ENABLE_BITOPS(e) (&(e)->values)
>
> @@ -174,6 +179,30 @@ struct user_event_validator {
> int flags;
> };
>
> +static inline void align_addr_bit(unsigned long *addr, int *bit,
> + unsigned long *flags)
> +{
> + if (IS_ALIGNED(*addr, sizeof(long))) {
> +#ifdef __BIG_ENDIAN
> + /* 32 bit on BE 64 bit requires a 32 bit offset when aligned. */
> + if (test_bit(ENABLE_VAL_32_ON_64_BIT, flags))
> + *bit += 32;
> +#endif
> + return;
> + }
> +
> + *addr = ALIGN_DOWN(*addr, sizeof(long));
> +
> + /*
> + * We only support 32 and 64 bit values. The only time we need
> + * to align is a 32 bit value on a 64 bit kernel, which on LE
> + * is always 32 bits, and on BE requires no change when unaligned.
> + */
> +#ifdef __LITTLE_ENDIAN
> + *bit += 32;
> +#endif
> +}
> +
> typedef void (*user_event_func_t) (struct user_event *user, struct iov_iter *i,
> void *tpdata, bool *faulted);
>
> @@ -482,6 +511,7 @@ static int user_event_enabler_write(struct user_event_mm *mm,
> unsigned long *ptr;
> struct page *page;
> void *kaddr;
> + int bit = ENABLE_BIT(enabler);
> int ret;
>
> lockdep_assert_held(&event_mutex);
> @@ -497,6 +527,8 @@ static int user_event_enabler_write(struct user_event_mm *mm,
> test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
> return -EBUSY;
>
> + align_addr_bit(&uaddr, &bit, ENABLE_BITOPS(enabler));
> +
> ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> &page, NULL);
>
> @@ -515,9 +547,9 @@ static int user_event_enabler_write(struct user_event_mm *mm,
>
> /* Update bit atomically, user tracers must be atomic as well */
> if (enabler->event && enabler->event->status)
> - set_bit(ENABLE_BIT(enabler), ptr);
> + set_bit(bit, ptr);
> else
> - clear_bit(ENABLE_BIT(enabler), ptr);
> + clear_bit(bit, ptr);
>
> kunmap_local(kaddr);
> unpin_user_pages_dirty_lock(&page, 1, true);
> @@ -849,6 +881,12 @@ static struct user_event_enabler
> enabler->event = user;
> enabler->addr = uaddr;
> enabler->values = reg->enable_bit;
> +
> +#if BITS_PER_LONG >= 64
> + if (reg->enable_size == 4)
> + set_bit(ENABLE_VAL_32_ON_64_BIT, ENABLE_BITOPS(enabler));
> +#endif
> +
> retry:
> /* Prevents state changes from racing with new enablers */
> mutex_lock(&event_mutex);
> @@ -2377,7 +2415,8 @@ static long user_unreg_get(struct user_unreg __user *ureg,
> }
>
> static int user_event_mm_clear_bit(struct user_event_mm *user_mm,
> - unsigned long uaddr, unsigned char bit)
> + unsigned long uaddr, unsigned char bit,
> + unsigned long flags)
> {
> struct user_event_enabler enabler;
> int result;
> @@ -2385,7 +2424,7 @@ static int user_event_mm_clear_bit(struct user_event_mm *user_mm,
>
> memset(&enabler, 0, sizeof(enabler));
> enabler.addr = uaddr;
> - enabler.values = bit;
> + enabler.values = bit | flags;
> retry:
> /* Prevents state changes from racing with new enablers */
> mutex_lock(&event_mutex);
> @@ -2415,6 +2454,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> struct user_event_mm *mm = current->user_event_mm;
> struct user_event_enabler *enabler, *next;
> struct user_unreg reg;
> + unsigned long flags;
> long ret;
>
> ret = user_unreg_get(ureg, ®);
> @@ -2425,6 +2465,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> if (!mm)
> return -ENOENT;
>
> + flags = 0;
> ret = -ENOENT;
>
> /*
> @@ -2441,6 +2482,9 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> ENABLE_BIT(enabler) == reg.disable_bit) {
> set_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler));
>
> + /* We must keep compat flags for the clear */
> + flags |= enabler->values & ENABLE_VAL_COMPAT_MASK;
> +
> if (!test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)))
> user_event_enabler_destroy(enabler, true);
>
> @@ -2454,7 +2498,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> /* Ensure bit is now cleared for user, regardless of event status */
> if (!ret)
> ret = user_event_mm_clear_bit(mm, reg.disable_addr,
> - reg.disable_bit);
> + reg.disable_bit, flags);
>
> return ret;
> }
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84ff by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.
The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.
Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.
To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.
Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.
Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschac…
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall(a)inria.fr>
Tested-by: Julia Lawall <julia.lawall(a)inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/ring_buffer.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28daf0ce95c5..515cafdb18d9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1137,6 +1137,9 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
if (full) {
poll_wait(filp, &work->full_waiters, poll_table);
work->full_waiters_pending = true;
+ if (!cpu_buffer->shortest_full ||
+ cpu_buffer->shortest_full > full)
+ cpu_buffer->shortest_full = full;
} else {
poll_wait(filp, &work->waiters, poll_table);
work->waiters_pending = true;
--
2.40.1
For an unknown TX CQE error type (probably from a newer hardware),
still free the SKB, update the queue tail, etc., otherwise the
accounting will be wrong.
Also, TX errors can be triggered by injecting corrupted packets, so
replace the WARN_ONCE to ratelimited error logging, because we don't
need stack trace here.
Cc: stable(a)vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz(a)microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 4a16ebff3d1d..5cdcf7561b38 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1317,19 +1317,23 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
case CQE_TX_VPORT_IDX_OUT_OF_RANGE:
case CQE_TX_VPORT_DISABLED:
case CQE_TX_VLAN_TAGGING_VIOLATION:
- WARN_ONCE(1, "TX: CQE error %d: ignored.\n",
- cqe_oob->cqe_hdr.cqe_type);
+ if (net_ratelimit())
+ netdev_err(ndev, "TX: CQE error %d\n",
+ cqe_oob->cqe_hdr.cqe_type);
+
apc->eth_stats.tx_cqe_err++;
break;
default:
- /* If the CQE type is unexpected, log an error, assert,
- * and go through the error path.
+ /* If the CQE type is unknown, log an error,
+ * and still free the SKB, update tail, etc.
*/
- WARN_ONCE(1, "TX: Unexpected CQE type %d: HW BUG?\n",
- cqe_oob->cqe_hdr.cqe_type);
+ if (net_ratelimit())
+ netdev_err(ndev, "TX: unknown CQE type %d\n",
+ cqe_oob->cqe_hdr.cqe_type);
+
apc->eth_stats.tx_cqe_unknown_type++;
- return;
+ break;
}
if (WARN_ON_ONCE(txq->gdma_txq_id != completions[i].wq_num))
--
2.25.1
It is possible for multiple vCPUs to fault on the same IPA and attempt
to resolve the fault. One of the page table walks will actually update
the PTE and the rest will return -EAGAIN per our race detection scheme.
KVM elides the TLB invalidation on the racing threads as the return
value is nonzero.
Before commit a12ab1378a88 ("KVM: arm64: Use local TLBI on permission
relaxation") KVM always used broadcast TLB invalidations when handling
permission faults, which had the convenient property of making the
stage-2 updates visible to all CPUs in the system. However now we do a
local invalidation, and TLBI elision leads to vCPUs getting stuck in a
permission fault loop. Remember that the architecture permits the TLB to
cache translations that precipitate a permission fault.
Invalidate the TLB entry responsible for the permission fault if the
stage-2 descriptor has been relaxed, regardless of which thread actually
did the job.
Cc: stable(a)vger.kernel.org
Fixes: a12ab1378a88 ("KVM: arm64: Use local TLBI on permission relaxation")
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
---
arch/arm64/kvm/hyp/pgtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index f155b8c9e98c..286888751793 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1314,7 +1314,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level,
KVM_PGTABLE_WALK_HANDLE_FAULT |
KVM_PGTABLE_WALK_SHARED);
- if (!ret)
+ if (!ret || ret == -EAGAIN)
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
return ret;
}
base-commit: ce9ecca0238b140b88f43859b211c9fdfd8e5b70
--
2.42.0.515.g380fc7ccd1-goog
The quilt patch titled
Subject: Crash: add lock to serialize crash hotplug handling
has been removed from the -mm tree. Its filename was
crash-add-lock-to-serialize-crash-hotplug-handling.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Baoquan He <bhe(a)redhat.com>
Subject: Crash: add lock to serialize crash hotplug handling
Date: Tue, 26 Sep 2023 20:09:05 +0800
Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period. They failed because failing to take __kexec_lock.
=======
[ 78.714569] Fallback order for Node 0: 0
[ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886
[ 78.717133] Policy zone: Normal
[ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 80.056643] PEFILE: Unsigned PE binary
=======
The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively. So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.
Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically. This doesn't impact the usage of __kexec_lock.
Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 247262756121 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He <bhe(a)redhat.com>
Tested-by: Eric DeVolder <eric.devolder(a)oracle.com>
Reviewed-by: Eric DeVolder <eric.devolder(a)oracle.com>
Reviewed-by: Valentin Schneider <vschneid(a)redhat.com>
Cc: Sourabh Jain <sourabhjain(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/crash_core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
--- a/kernel/crash_core.c~crash-add-lock-to-serialize-crash-hotplug-handling
+++ a/kernel/crash_core.c
@@ -740,6 +740,17 @@ subsys_initcall(crash_notes_memory_init)
#define pr_fmt(fmt) "crash hp: " fmt
/*
+ * Different than kexec/kdump loading/unloading/jumping/shrinking which
+ * usually rarely happen, there will be many crash hotplug events notified
+ * during one short period, e.g one memory board is hot added and memory
+ * regions are online. So mutex lock __crash_hotplug_lock is used to
+ * serialize the crash hotplug handling specifically.
+ */
+DEFINE_MUTEX(__crash_hotplug_lock);
+#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock)
+#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock)
+
+/*
* This routine utilized when the crash_hotplug sysfs node is read.
* It reflects the kernel's ability/permission to update the crash
* elfcorehdr directly.
@@ -748,9 +759,11 @@ int crash_check_update_elfcorehdr(void)
{
int rc = 0;
+ crash_hotplug_lock();
/* Obtain lock while reading crash information */
if (!kexec_trylock()) {
pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
+ crash_hotplug_unlock();
return 0;
}
if (kexec_crash_image) {
@@ -761,6 +774,7 @@ int crash_check_update_elfcorehdr(void)
}
/* Release lock now that update complete */
kexec_unlock();
+ crash_hotplug_unlock();
return rc;
}
@@ -783,9 +797,11 @@ static void crash_handle_hotplug_event(u
{
struct kimage *image;
+ crash_hotplug_lock();
/* Obtain lock while changing crash information */
if (!kexec_trylock()) {
pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
+ crash_hotplug_unlock();
return;
}
@@ -852,6 +868,7 @@ static void crash_handle_hotplug_event(u
out:
/* Release lock now that update complete */
kexec_unlock();
+ crash_hotplug_unlock();
}
static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v)
_
Patches currently in -mm which might be from bhe(a)redhat.com are
crash_corec-remove-unnecessary-parameter-of-function.patch
crash_core-change-the-prototype-of-function-parse_crashkernel.patch
crash_core-change-parse_crashkernel-to-support-crashkernel=highlow-parsing.patch
crash_core-add-generic-function-to-do-reservation.patch
crash_core-move-crashk_res-definition-into-crash_corec.patch
x86-kdump-use-generic-interface-to-simplify-crashkernel-reservation-code.patch
x86-kdump-use-generic-interface-to-simplify-crashkernel-reservation-code-fix.patch
arm64-kdump-use-generic-interface-to-simplify-crashkernel-reservation.patch
riscv-kdump-use-generic-interface-to-simplify-crashkernel-reservation.patch
crash_corec-remove-unneeded-functions.patch
The quilt patch titled
Subject: selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
has been removed from the -mm tree. Its filename was
selftests-mm-fix-awk-usage-in-charge_reserved_hugetlbsh-and-hugetlb_reparenting_testsh-that-may-cause-error.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Juntong Deng <juntong.deng(a)outlook.com>
Subject: selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
Date: Wed, 27 Sep 2023 02:19:44 +0800
According to the awk manual, the -e option does not need to be specified
in front of 'program' (unless you need to mix program-file).
The redundant -e option can cause error when users use awk tools other
than gawk (for example, mawk does not support the -e option).
Error Example:
awk: not an option: -e
Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0…
Signed-off-by: Juntong Deng <juntong.deng(a)outlook.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/charge_reserved_hugetlb.sh | 4 ++--
tools/testing/selftests/mm/hugetlb_reparenting_test.sh | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
--- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh~selftests-mm-fix-awk-usage-in-charge_reserved_hugetlbsh-and-hugetlb_reparenting_testsh-that-may-cause-error
+++ a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
@@ -25,7 +25,7 @@ if [[ "$1" == "-cgroup-v2" ]]; then
fi
if [[ $cgroup2 ]]; then
- cgroup_path=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+ cgroup_path=$(mount -t cgroup2 | head -1 | awk '{print $3}')
if [[ -z "$cgroup_path" ]]; then
cgroup_path=/dev/cgroup/memory
mount -t cgroup2 none $cgroup_path
@@ -33,7 +33,7 @@ if [[ $cgroup2 ]]; then
fi
echo "+hugetlb" >$cgroup_path/cgroup.subtree_control
else
- cgroup_path=$(mount -t cgroup | grep ",hugetlb" | awk -e '{print $3}')
+ cgroup_path=$(mount -t cgroup | grep ",hugetlb" | awk '{print $3}')
if [[ -z "$cgroup_path" ]]; then
cgroup_path=/dev/cgroup/memory
mount -t cgroup memory,hugetlb $cgroup_path
--- a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh~selftests-mm-fix-awk-usage-in-charge_reserved_hugetlbsh-and-hugetlb_reparenting_testsh-that-may-cause-error
+++ a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh
@@ -20,7 +20,7 @@ fi
if [[ $cgroup2 ]]; then
- CGROUP_ROOT=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+ CGROUP_ROOT=$(mount -t cgroup2 | head -1 | awk '{print $3}')
if [[ -z "$CGROUP_ROOT" ]]; then
CGROUP_ROOT=/dev/cgroup/memory
mount -t cgroup2 none $CGROUP_ROOT
@@ -28,7 +28,7 @@ if [[ $cgroup2 ]]; then
fi
echo "+hugetlb +memory" >$CGROUP_ROOT/cgroup.subtree_control
else
- CGROUP_ROOT=$(mount -t cgroup | grep ",hugetlb" | awk -e '{print $3}')
+ CGROUP_ROOT=$(mount -t cgroup | grep ",hugetlb" | awk '{print $3}')
if [[ -z "$CGROUP_ROOT" ]]; then
CGROUP_ROOT=/dev/cgroup/memory
mount -t cgroup memory,hugetlb $CGROUP_ROOT
_
Patches currently in -mm which might be from juntong.deng(a)outlook.com are
The quilt patch titled
Subject: mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
has been removed from the -mm tree. Its filename was
mm-mempolicy-keep-vma-walk-if-both-mpol_mf_strict-and-mpol_mf_move-are-specified.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yang Shi <yang(a)os.amperecomputing.com>
Subject: mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
Date: Wed, 20 Sep 2023 15:32:42 -0700
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page. Then commit 6f4576e3687b ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone. The
return value problem was fixed by commit a7f40cfe3b7a ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.…
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mempolicy.c | 39 +++++++++++++++++++--------------------
1 file changed, 19 insertions(+), 20 deletions(-)
--- a/mm/mempolicy.c~mm-mempolicy-keep-vma-walk-if-both-mpol_mf_strict-and-mpol_mf_move-are-specified
+++ a/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ unlock:
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *p
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *p
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
_
Patches currently in -mm which might be from yang(a)os.amperecomputing.com are
The quilt patch titled
Subject: mm, memcg: reconsider kmem.limit_in_bytes deprecation
has been removed from the -mm tree. Its filename was
mm-memcg-reconsider-kmemlimit_in_bytes-deprecation.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: mm, memcg: reconsider kmem.limit_in_bytes deprecation
Date: Thu, 21 Sep 2023 09:38:29 +0200
This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes") and
partially reverts 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") which have incrementally removed support for the
kernel memory accounting hard limit. Unfortunately it has turned out that
there is still userspace depending on the existence of
memory.kmem.limit_in_bytes [1]. The underlying functionality is not
really required but the non-existent file just confuses the userspace
which fails in the result. The patch to fix this on the userspace side
has been submitted but it is hard to predict how it will propagate through
the maze of 3rd party consumers of the software.
Now, reverting alone 86327e8eb94c is not an option because there is
another set of userspace which cannot cope with ENOTSUPP returned when
writing to the file. Therefore we have to go and revisit 58056f77502f as
well. There are two ways to go ahead. Either we give up on the
deprecation and fully revert 58056f77502f as well or we can keep
kmem.limit_in_bytes but make the write a noop and warn about the fact.
This should work for both known breaking workloads which depend on the
existence but do not depend on the hard limit enforcement.
Note to backporters to stable trees. a8c49af3be5f ("memcg: add per-memcg
total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
so the accounting is not done by obj_cgroup_charge_pages directly for v1
anymore. Prior kernels need to add it explicitly (thanks to Johannes for
pointing this out).
[akpm(a)linux-foundation.org: fix build - remove unused local]
Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1e… [1]
Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz
Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Tejun heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 7 +++++++
mm/memcontrol.c | 13 +++++++++++++
2 files changed, 20 insertions(+)
--- a/Documentation/admin-guide/cgroup-v1/memory.rst~mm-memcg-reconsider-kmemlimit_in_bytes-deprecation
+++ a/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -92,6 +92,13 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
+ memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
+ memory hard limit. Kernel hard limit is not
+ supported since 5.16. Writing any value to
+ do file will not have any effect same as if
+ nokmem kernel parameter was specified.
+ Kernel memory is still charged and reported
+ by memory.kmem.usage_in_bytes.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
--- a/mm/memcontrol.c~mm-memcg-reconsider-kmemlimit_in_bytes-deprecation
+++ a/mm/memcontrol.c
@@ -3867,6 +3867,13 @@ static ssize_t mem_cgroup_write(struct k
case _MEMSWAP:
ret = mem_cgroup_resize_max(memcg, nr_pages, true);
break;
+ case _KMEM:
+ pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+ "Writing any value to this file has no effect. "
+ "Please report your usecase to linux-mm(a)kvack.org if you "
+ "depend on this functionality.\n");
+ ret = 0;
+ break;
case _TCP:
ret = memcg_update_tcp_max(memcg, nr_pages);
break;
@@ -5078,6 +5085,12 @@ static struct cftype mem_cgroup_legacy_f
},
#endif
{
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read_u64,
+ },
+ {
.name = "kmem.usage_in_bytes",
.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
.read_u64 = mem_cgroup_read_u64,
_
Patches currently in -mm which might be from mhocko(a)suse.com are
The quilt patch titled
Subject: arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
has been removed from the -mm tree. Its filename was
arm64-hugetlb-fix-set_huge_pte_at-to-work-with-all-swap-entries.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
Date: Fri, 22 Sep 2023 12:58:04 +0100
When called with a swap entry that does not embed a PFN (e.g.
PTE_MARKER_POISONED or PTE_MARKER_UFFD_WP), the previous implementation of
set_huge_pte_at() would either cause a BUG() to fire (if CONFIG_DEBUG_VM
is enabled) or cause a dereference of an invalid address and subsequent
panic.
arm64's huge pte implementation supports multiple huge page sizes, some of
which are implemented in the page table with multiple contiguous entries.
So set_huge_pte_at() needs to work out how big the logical pte is, so that
it can also work out how many physical ptes (or pmds) need to be written.
It previously did this by grabbing the folio out of the pte and querying
its size.
However, there are cases when the pte being set is actually a swap entry.
But this also used to work fine, because for huge ptes, we only ever saw
migration entries and hwpoison entries. And both of these types of swap
entries have a PFN embedded, so the code would grab that and everything
still worked out.
But over time, more calls to set_huge_pte_at() have been added that set
swap entry types that do not embed a PFN. And this causes the code to go
bang. The triggering case is for the uffd poison test, commit
99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
added in v6.5-rc7. Although review shows that there are other call sites
that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
on arm64 because arm64 doesn't support UFFD WP.
Arguably, the root cause is really due to commit 18f3962953e4 ("mm:
hugetlb: kill set_huge_swap_pte_at()"), which aimed to simplify the
interface to the core code by removing set_huge_swap_pte_at() (which took
a page size parameter) and replacing it with calls to set_huge_pte_at()
where the size was inferred from the folio, as descibed above. While that
commit didn't break anything at the time, it did break the interface
because it couldn't handle swap entries without PFNs. And since then new
callers have come along which rely on this working. But given the
brokeness is only observable after commit 8a13897fb0da ("mm: userfaultfd:
support UFFDIO_POISON for hugetlbfs"), that one gets the Fixes tag.
Now that we have modified the set_huge_pte_at() interface to pass the huge
page size in the previous patch, we can trivially fix this issue.
Link: https://lkml.kernel.org/r/20230922115804.2043771-3-ryan.roberts@arm.com
Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Reviewed-by: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Albert Ou <aou(a)eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Alexandre Ghiti <alex(a)ghiti.fr>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Helge Deller <deller(a)gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley(a)HansenPartnership.com>
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: SeongJae Park <sj(a)kernel.org>
Cc: Sven Schnelle <svens(a)linux.ibm.com>
Cc: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/arm64/mm/hugetlbpage.c | 17 +++--------------
1 file changed, 3 insertions(+), 14 deletions(-)
--- a/arch/arm64/mm/hugetlbpage.c~arm64-hugetlb-fix-set_huge_pte_at-to-work-with-all-swap-entries
+++ a/arch/arm64/mm/hugetlbpage.c
@@ -241,13 +241,6 @@ static void clear_flush(struct mm_struct
flush_tlb_range(&vma, saddr, addr);
}
-static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
-{
- VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
-
- return page_folio(pfn_to_page(swp_offset_pfn(entry)));
-}
-
void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned long sz)
{
@@ -257,13 +250,10 @@ void set_huge_pte_at(struct mm_struct *m
unsigned long pfn, dpfn;
pgprot_t hugeprot;
- if (!pte_present(pte)) {
- struct folio *folio;
-
- folio = hugetlb_swap_entry_to_folio(pte_to_swp_entry(pte));
- ncontig = num_contig_ptes(folio_size(folio), &pgsize);
+ ncontig = num_contig_ptes(sz, &pgsize);
- for (i = 0; i < ncontig; i++, ptep++)
+ if (!pte_present(pte)) {
+ for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
set_pte_at(mm, addr, ptep, pte);
return;
}
@@ -273,7 +263,6 @@ void set_huge_pte_at(struct mm_struct *m
return;
}
- ncontig = find_num_contig(mm, addr, ptep, &pgsize);
pfn = pte_pfn(pte);
dpfn = pgsize >> PAGE_SHIFT;
hugeprot = pte_pgprot(pte);
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch
mm-non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap.patch
mm-thp-account-pte-mapped-anonymous-thp-usage.patch
mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files.patch
mm-thp-extend-thp-to-allocate-anonymous-large-folios.patch
mm-thp-add-recommend-option-for-anon_orders.patch
arm64-mm-override-arch_wants_pte_order.patch
selftests-mm-cow-generalize-do_run_with_thp-helper.patch
selftests-mm-cow-add-tests-for-small-order-anon-thp.patch
The quilt patch titled
Subject: maple_tree: add mas_is_active() to detect in-tree walks
has been removed from the -mm tree. Its filename was
maple_tree-add-mas_active-to-detect-in-tree-walks.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: maple_tree: add mas_is_active() to detect in-tree walks
Date: Thu, 21 Sep 2023 14:12:35 -0400
Patch series "maple_tree: Fix mas_prev() state regression".
Pedro Falcato retported an mprotect regression [1] which was bisected back
to the iterator changes for maple tree. Root cause analysis showed the
mas_prev() running off the end of the VMA space (previous from 0) followed
by mas_find(), would skip the first value.
This patchset introduces maple state underflow/overflow so the sequence of
calls on the maple state will return what the user expects.
Users who encounter this bug may see mprotect(), userfaultfd_register(),
and mlock() fail on VMAs mapped with address 0.
This patch (of 2):
Instead of constantly checking each possibility of the maple state,
create a fast path that will skip over checking unlikely states.
Link: https://lkml.kernel.org/r/20230921181236.509072-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230921181236.509072-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Pedro Falcato <pedro.falcato(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/maple_tree.h | 9 +++++++++
1 file changed, 9 insertions(+)
--- a/include/linux/maple_tree.h~maple_tree-add-mas_active-to-detect-in-tree-walks
+++ a/include/linux/maple_tree.h
@@ -511,6 +511,15 @@ static inline bool mas_is_paused(const s
return mas->node == MAS_PAUSE;
}
+/* Check if the mas is pointing to a node or not */
+static inline bool mas_is_active(struct ma_state *mas)
+{
+ if ((unsigned long)mas->node >= MAPLE_RESERVED_RANGE)
+ return true;
+
+ return false;
+}
+
/**
* mas_reset() - Reset a Maple Tree operation state.
* @mas: Maple Tree operation state.
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
maple_tree-add-mas_underflow-and-mas_overflow-states.patch
mm-mempolicy-fix-set_mempolicy_home_node-previous-vma-pointer.patch
mmap-fix-vma_iterator-in-error-path-of-vma_merge.patch
mmap-fix-error-paths-with-dup_anon_vma.patch
mmap-add-clarifying-comment-to-vma_merge-code.patch
radix-tree-test-suite-fix-allocation-calculation-in-kmem_cache_alloc_bulk.patch
The quilt patch titled
Subject: nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
has been removed from the -mm tree. Its filename was
nilfs2-fix-potential-use-after-free-in-nilfs_gccache_submit_read_data.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Pan Bian <bianpan2016(a)163.com>
Subject: nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
Date: Thu, 21 Sep 2023 23:17:31 +0900
In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the
reference count of bh when the call to nilfs_dat_translate() fails. If
the reference count hits 0 and its owner page gets unlocked, bh may be
freed. However, bh->b_page is dereferenced to put the page after that,
which may result in a use-after-free bug. This patch moves the release
operation after unlocking and putting the page.
NOTE: The function in question is only called in GC, and in combination
with current userland tools, address translation using DAT does not occur
in that function, so the code path that causes this issue will not be
executed. However, it is possible to run that code path by intentionally
modifying the userland GC library or by calling the GC ioctl directly.
[konishi.ryusuke(a)gmail.com: NOTE added to the commit log]
Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163…
Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com
Fixes: a3d93f709e89 ("nilfs2: block cache for garbage collection")
Signed-off-by: Pan Bian <bianpan2016(a)163.com>
Reported-by: Ferry Meng <mengferry(a)linux.alibaba.com>
Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.c…
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/gcinode.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/fs/nilfs2/gcinode.c~nilfs2-fix-potential-use-after-free-in-nilfs_gccache_submit_read_data
+++ a/fs/nilfs2/gcinode.c
@@ -73,10 +73,8 @@ int nilfs_gccache_submit_read_data(struc
struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
err = nilfs_dat_translate(nilfs->ns_dat, vbn, &pbn);
- if (unlikely(err)) { /* -EIO, -ENOMEM, -ENOENT */
- brelse(bh);
+ if (unlikely(err)) /* -EIO, -ENOMEM, -ENOENT */
goto failed;
- }
}
lock_buffer(bh);
@@ -102,6 +100,8 @@ int nilfs_gccache_submit_read_data(struc
failed:
unlock_page(bh->b_page);
put_page(bh->b_page);
+ if (unlikely(err))
+ brelse(bh);
return err;
}
_
Patches currently in -mm which might be from bianpan2016(a)163.com are
The quilt patch titled
Subject: fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
has been removed from the -mm tree. Its filename was
fs-binfmt_elf_efpic-fix-personality-for-elf-fdpic.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Greg Ungerer <gerg(a)kernel.org>
Subject: fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
Date: Thu, 7 Sep 2023 11:18:08 +1000
The elf-fdpic loader hard sets the process personality to either
PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF
binaries (in this case they would be constant displacement compiled with
-pie for example). The problem with that is that it will lose any other
bits that may be in the ELF header personality (such as the "bug
emulation" bits).
On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a
normal 32bit binary - as opposed to a legacy 26bit address binary. This
matters since start_thread() will set the ARM CPSR register as required
based on this flag. If the elf-fdpic loader loses this bit the process
will be mis-configured and crash out pretty quickly.
Modify elf-fdpic loader personality setting so that it preserves the upper
three bytes by using the SET_PERSONALITY macro to set it. This macro in
the generic case sets PER_LINUX and preserves the upper bytes.
Architectures can override this for their specific use case, and ARM does
exactly this.
The problem shows up quite easily running under qemu using the ARM
architecture, but not necessarily on all types of real ARM hardware. If
the underlying ARM processor does not support the legacy 26-bit addressing
mode then everything will work as expected.
Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org
Fixes: 1bde925d23547 ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries")
Signed-off-by: Greg Ungerer <gerg(a)kernel.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Eric W. Biederman <ebiederm(a)xmission.com>
Cc: Greg Ungerer <gerg(a)kernel.org>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/binfmt_elf_fdpic.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
--- a/fs/binfmt_elf_fdpic.c~fs-binfmt_elf_efpic-fix-personality-for-elf-fdpic
+++ a/fs/binfmt_elf_fdpic.c
@@ -345,10 +345,9 @@ static int load_elf_fdpic_binary(struct
/* there's now no turning back... the old userspace image is dead,
* defunct, deceased, etc.
*/
+ SET_PERSONALITY(exec_params.hdr);
if (elf_check_fdpic(&exec_params.hdr))
- set_personality(PER_LINUX_FDPIC);
- else
- set_personality(PER_LINUX);
+ current->personality |= PER_LINUX_FDPIC;
if (elf_read_implies_exec(&exec_params.hdr, executable_stack))
current->personality |= READ_IMPLIES_EXEC;
_
Patches currently in -mm which might be from gerg(a)kernel.org are
Hi All,
This series fixes a bug in arm64's implementation of set_huge_pte_at(), which
can result in an unprivileged user causing a kernel panic. The problem was
triggered when running the new uffd poison mm selftest for HUGETLB memory. This
test (and the uffd poison feature) was merged for v6.5-rc7.
Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly
this time) to get it backported to v6.5, where the issue first showed up.
Description of Bug
------------------
arm64's huge pte implementation supports multiple huge page sizes, some of which
are implemented in the page table with multiple contiguous entries. So
set_huge_pte_at() needs to work out how big the logical pte is, so that it can
also work out how many physical ptes (or pmds) need to be written. It previously
did this by grabbing the folio out of the pte and querying its size.
However, there are cases when the pte being set is actually a swap entry. But
this also used to work fine, because for huge ptes, we only ever saw migration
entries and hwpoison entries. And both of these types of swap entries have a PFN
embedded, so the code would grab that and everything still worked out.
But over time, more calls to set_huge_pte_at() have been added that set swap
entry types that do not embed a PFN. And this causes the code to go bang. The
triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm:
add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap
entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support
UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that
there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN),
these don't trigger on arm64 because arm64 doesn't support UFFD WP.
If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it
will dereference a bad pointer in page_folio():
static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}
Fix
---
The simplest fix would have been to revert the dodgy cleanup commit 18f3962953e4
("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on,
this would have required an audit of all the new set_huge_pte_at() call sites to
see if they should be converted to set_huge_swap_pte_at(). As per the original
intent of the change, it would also leave us open to future bugs when people
invariably get it wrong and call the wrong helper.
So instead, I've added a huge page size parameter to set_huge_pte_at(). This
means that the arm64 code has the size in all cases. It's a bigger change, due
to needing to touch the arches that implement the function, but it is entirely
mechanical, so in my view, low risk.
I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390,
sparc (and additionally x86_64). I've additionally booted and run mm selftests
against arm64, where I observe the uffd poison test is fixed, and there are no
other regressions.
Patches
-------
patch 1: Convert core mm and arches to pass extra param (no behavioral change)
patch 8: Fix the arm64 bug
Patches based on v6.6-rc2.
Changes since v1 [1]
--------------------
- Pass extra size param instead of converting mm to vma.
- Passing vma was problematic for kernel mapping case without vma
- Squash all interface changes to single patch
- Simplify powerpc so that is doesn't require __set_huge_page_at()
- Added Reviewed-bys
[1] https://lore.kernel.org/linux-arm-kernel/20230921162007.1630149-1-ryan.robe…
Thanks,
Ryan
Ryan Roberts (2):
mm: hugetlb: Add huge page size param to set_huge_pte_at()
arm64: hugetlb: Fix set_huge_pte_at() to work with all swap entries
arch/arm64/include/asm/hugetlb.h | 2 +-
arch/arm64/mm/hugetlbpage.c | 23 +++-------
arch/parisc/include/asm/hugetlb.h | 2 +-
arch/parisc/mm/hugetlbpage.c | 2 +-
.../include/asm/nohash/32/hugetlb-8xx.h | 3 +-
arch/powerpc/mm/book3s64/hugetlbpage.c | 5 ++-
arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 3 +-
arch/powerpc/mm/nohash/8xx.c | 3 +-
arch/powerpc/mm/pgtable.c | 3 +-
arch/riscv/include/asm/hugetlb.h | 3 +-
arch/riscv/mm/hugetlbpage.c | 3 +-
arch/s390/include/asm/hugetlb.h | 6 ++-
arch/s390/mm/hugetlbpage.c | 8 +++-
arch/sparc/include/asm/hugetlb.h | 6 ++-
arch/sparc/mm/hugetlbpage.c | 8 +++-
include/asm-generic/hugetlb.h | 2 +-
include/linux/hugetlb.h | 6 ++-
mm/damon/vaddr.c | 3 +-
mm/hugetlb.c | 43 +++++++++++--------
mm/migrate.c | 7 ++-
mm/rmap.c | 23 +++++++---
mm/vmalloc.c | 2 +-
22 files changed, 103 insertions(+), 63 deletions(-)
--
2.25.1
The two users of mbind_range() are expecting that mbind_range() will
update the pointer to the previous VMA, or return an error. However,
set_mempolicy_home_node() does not call mbind_range() if there is no VMA
policy. The fix is to update the pointer to the previous VMA prior to
continuing iterating the VMAs when there is no policy.
Users may experience a WARN_ON() during VMA policy updates when updating
a range of VMAs on the home node.
Reported-by: Yikebaer Aizezi <yikebaer61(a)gmail.com>
Closes: https://lore.kernel.org/linux-mm/CALcu4rbT+fMVNaO_F2izaCT+e7jzcAciFkOvk21HG…
Link: https://lore.kernel.org/linux-mm/CALcu4rbT+fMVNaO_F2izaCT+e7jzcAciFkOvk21HG…
Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
Cc: stable(a)vger.kernel.org
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
---
mm/mempolicy.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
For completeness, here is the syzbot reproducer so that it is available
from the mailing list:
#define _GNU_SOURCE
#include <endian.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
#ifndef __NR_set_mempolicy_home_node
#define __NR_set_mempolicy_home_node 450
#endif
int main(void)
{
syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, /*prot=*/7ul, /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
*(uint64_t*)0x20000000 = 0xffffffffffffff81;
syscall(__NR_mbind, /*addr=*/0x20ffa000ul, /*len=*/0x4000ul, /*mode=*/2ul, /*nodemask=*/0x20000000ul, /*maxnode=*/7ul, /*flags=*/0ul);
syscall(__NR_mbind, /*addr=*/0x20ff9000ul, /*len=*/0x3000ul, /*mode=*/0ul, /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/0ul);
syscall(__NR_set_mempolicy_home_node, /*addr=*/0x20ffa000ul, /*len=*/0x4000ul, /*home_node=*/0ul, /*flags=*/0ul);
return 0;
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..717d93c175f2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1544,8 +1544,10 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
* the home node for vmas we already updated before.
*/
old = vma_policy(vma);
- if (!old)
+ if (!old) {
+ prev = vma;
continue;
+ }
if (old->mode != MPOL_BIND && old->mode != MPOL_PREFERRED_MANY) {
err = -EOPNOTSUPP;
break;
--
2.40.1
PARF_SLV_ADDR_SPACE_SIZE_2_3_3 macro is used for qcom_pcie_post_init_2_3_3.
PCIe slave address space size register offset is 0x358, but was wrongly
changed to 0x16c as a part of commit 39171b33f652 ("PCI: qcom: Remove
PCIE20_ prefix from register definitions"). Fixing it, by using the right
macro and remove the unused PARF_SLV_ADDR_SPACE_SIZE_2_3_3.
Without this access to the registers of slave addr space like iATU etc
are broken leading to PCIe enumeration failure on IPQ8074.
Fixes: 39171b33f652 ("PCI: qcom: Remove PCIE20_ prefix from register definitions")
Cc: <Stable(a)vger.kernel.org>
Reviewed-by: Manivannan Sadhasivam <mani(a)kernel.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio(a)linaro.org>
Tested-by: Robert Marko <robimarko(a)gmail.com>
Signed-off-by: Sricharan Ramabadhran <quic_srichara(a)quicinc.com>
---
[V6] Fixed subject and commit text as per Bjorn Helgaas
drivers/pci/controller/dwc/pcie-qcom.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c
index e2f29404c84e..64420ecc24d1 100644
--- a/drivers/pci/controller/dwc/pcie-qcom.c
+++ b/drivers/pci/controller/dwc/pcie-qcom.c
@@ -43,7 +43,6 @@
#define PARF_PHY_REFCLK 0x4c
#define PARF_CONFIG_BITS 0x50
#define PARF_DBI_BASE_ADDR 0x168
-#define PARF_SLV_ADDR_SPACE_SIZE_2_3_3 0x16c /* Register offset specific to IP ver 2.3.3 */
#define PARF_MHI_CLOCK_RESET_CTRL 0x174
#define PARF_AXI_MSTR_WR_ADDR_HALT 0x178
#define PARF_AXI_MSTR_WR_ADDR_HALT_V2 0x1a8
@@ -797,8 +796,7 @@ static int qcom_pcie_post_init_2_3_3(struct qcom_pcie *pcie)
u16 offset = dw_pcie_find_capability(pci, PCI_CAP_ID_EXP);
u32 val;
- writel(SLV_ADDR_SPACE_SZ,
- pcie->parf + PARF_SLV_ADDR_SPACE_SIZE_2_3_3);
+ writel(SLV_ADDR_SPACE_SZ, pcie->parf + PARF_SLV_ADDR_SPACE_SIZE);
val = readl(pcie->parf + PARF_PHY_CTRL);
val &= ~PHY_TEST_PWR_DOWN;
--
2.34.1
The patch titled
Subject: mmap: fix error paths with dup_anon_vma()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mmap-fix-error-paths-with-dup_anon_vma.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: mmap: fix error paths with dup_anon_vma()
Date: Fri, 29 Sep 2023 14:30:40 -0400
When the calling function fails after the dup_anon_vma(), the
duplication of the anon_vma is not being undone. Add the necessary
unlink_anon_vma() call to the error paths that are missing them.
This issue showed up during inspection of the error path in vma_merge()
for an unrelated vma iterator issue.
Users may experience increased memory usage, which may be problematic as
the failure would likely be caused by a low memory situation.
Link: https://lkml.kernel.org/r/20230929183041.2835469-3-Liam.Howlett@oracle.com
Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)
--- a/mm/mmap.c~mmap-fix-error-paths-with-dup_anon_vma
+++ a/mm/mmap.c
@@ -583,11 +583,12 @@ again:
* dup_anon_vma() - Helper function to duplicate anon_vma
* @dst: The destination VMA
* @src: The source VMA
+ * @dup: Pointer to the destination VMA when successful.
*
* Returns: 0 on success.
*/
static inline int dup_anon_vma(struct vm_area_struct *dst,
- struct vm_area_struct *src)
+ struct vm_area_struct *src, struct vm_area_struct **dup)
{
/*
* Easily overlooked: when mprotect shifts the boundary, make sure the
@@ -595,9 +596,15 @@ static inline int dup_anon_vma(struct vm
* anon pages imported.
*/
if (src->anon_vma && !dst->anon_vma) {
+ int ret;
+
vma_assert_write_locked(dst);
dst->anon_vma = src->anon_vma;
- return anon_vma_clone(dst, src);
+ ret = anon_vma_clone(dst, src);
+ if (ret)
+ return ret;
+
+ *dup = dst;
}
return 0;
@@ -624,6 +631,7 @@ int vma_expand(struct vma_iterator *vmi,
unsigned long start, unsigned long end, pgoff_t pgoff,
struct vm_area_struct *next)
{
+ struct vm_area_struct *anon_dup = NULL;
bool remove_next = false;
struct vma_prepare vp;
@@ -633,7 +641,7 @@ int vma_expand(struct vma_iterator *vmi,
remove_next = true;
vma_start_write(next);
- ret = dup_anon_vma(vma, next);
+ ret = dup_anon_vma(vma, next, &anon_dup);
if (ret)
return ret;
}
@@ -661,6 +669,8 @@ int vma_expand(struct vma_iterator *vmi,
return 0;
nomem:
+ if (anon_dup)
+ unlink_anon_vmas(anon_dup);
return -ENOMEM;
}
@@ -860,6 +870,7 @@ struct vm_area_struct *vma_merge(struct
{
struct vm_area_struct *curr, *next, *res;
struct vm_area_struct *vma, *adjust, *remove, *remove2;
+ struct vm_area_struct *anon_dup = NULL;
struct vma_prepare vp;
pgoff_t vma_pgoff;
int err = 0;
@@ -927,18 +938,18 @@ struct vm_area_struct *vma_merge(struct
vma_start_write(next);
remove = next; /* case 1 */
vma_end = next->vm_end;
- err = dup_anon_vma(prev, next);
+ err = dup_anon_vma(prev, next, &anon_dup);
if (curr) { /* case 6 */
vma_start_write(curr);
remove = curr;
remove2 = next;
if (!next->anon_vma)
- err = dup_anon_vma(prev, curr);
+ err = dup_anon_vma(prev, curr, &anon_dup);
}
} else if (merge_prev) { /* case 2 */
if (curr) {
vma_start_write(curr);
- err = dup_anon_vma(prev, curr);
+ err = dup_anon_vma(prev, curr, &anon_dup);
if (end == curr->vm_end) { /* case 7 */
remove = curr;
} else { /* case 5 */
@@ -954,7 +965,7 @@ struct vm_area_struct *vma_merge(struct
vma_end = addr;
adjust = next;
adj_start = -(prev->vm_end - addr);
- err = dup_anon_vma(next, prev);
+ err = dup_anon_vma(next, prev, &anon_dup);
} else {
/*
* Note that cases 3 and 8 are the ONLY ones where prev
@@ -968,7 +979,7 @@ struct vm_area_struct *vma_merge(struct
vma_pgoff = curr->vm_pgoff;
vma_start_write(curr);
remove = curr;
- err = dup_anon_vma(next, curr);
+ err = dup_anon_vma(next, curr, &anon_dup);
}
}
}
@@ -1018,6 +1029,9 @@ struct vm_area_struct *vma_merge(struct
return res;
prealloc_fail:
+ if (anon_dup)
+ unlink_anon_vmas(anon_dup);
+
anon_vma_fail:
vma_iter_set(vmi, addr);
vma_iter_load(vmi);
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
maple_tree-add-mas_active-to-detect-in-tree-walks.patch
maple_tree-add-mas_underflow-and-mas_overflow-states.patch
mm-mempolicy-fix-set_mempolicy_home_node-previous-vma-pointer.patch
mmap-fix-vma_iterator-in-error-path-of-vma_merge.patch
mmap-fix-error-paths-with-dup_anon_vma.patch
The patch titled
Subject: mmap: fix vma_iterator in error path of vma_merge()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mmap-fix-vma_iterator-in-error-path-of-vma_merge.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: mmap: fix vma_iterator in error path of vma_merge()
Date: Fri, 29 Sep 2023 14:30:39 -0400
During the error path, the vma iterator may not be correctly positioned or
set to the correct range. Undo the vma_prev() call by resetting to the
passed in address. Re-walking to the same range will fix the range to the
area previously passed in.
Users would notice increased cycles as vma_merge() would be called an
extra time with vma == prev, and thus would fail to merge and return.
Link: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejt…
Link: https://lkml.kernel.org/r/20230929183041.2835469-2-Liam.Howlett@oracle.com
Fixes: 18b098af2890 ("vma_merge: set vma iterator to correct position.")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reported-by: Jann Horn <jannh(a)google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejt…
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
--- a/mm/mmap.c~mmap-fix-vma_iterator-in-error-path-of-vma_merge
+++ a/mm/mmap.c
@@ -975,7 +975,7 @@ struct vm_area_struct *vma_merge(struct
/* Error in anon_vma clone. */
if (err)
- return NULL;
+ goto anon_vma_fail;
if (vma_start < vma->vm_start || vma_end > vma->vm_end)
vma_expanded = true;
@@ -988,7 +988,7 @@ struct vm_area_struct *vma_merge(struct
}
if (vma_iter_prealloc(vmi, vma))
- return NULL;
+ goto prealloc_fail;
init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
@@ -1016,6 +1016,12 @@ struct vm_area_struct *vma_merge(struct
vma_complete(&vp, vmi, mm);
khugepaged_enter_vma(res, vm_flags);
return res;
+
+prealloc_fail:
+anon_vma_fail:
+ vma_iter_set(vmi, addr);
+ vma_iter_load(vmi);
+ return NULL;
}
/*
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
maple_tree-add-mas_active-to-detect-in-tree-walks.patch
maple_tree-add-mas_underflow-and-mas_overflow-states.patch
mm-mempolicy-fix-set_mempolicy_home_node-previous-vma-pointer.patch
mmap-fix-vma_iterator-in-error-path-of-vma_merge.patch
mmap-fix-error-paths-with-dup_anon_vma.patch
Hi,
I see the following build failure with v5.15.133.
Build reference: v5.15.133
Compiler version: aarch64-linux-gcc (GCC) 11.4.0
Assembler version: GNU assembler (GNU Binutils) 2.40
Building arm64:allnoconfig ... passed
Building arm64:tinyconfig ... passed
Building arm64:defconfig ... failed
--------------
Error log:
drivers/interconnect/core.c: In function 'icc_init':
drivers/interconnect/core.c:1148:9: error: implicit declaration of function 'fs_reclaim_acquire' [-Werror=implicit-function-declaration]
1148 | fs_reclaim_acquire(GFP_KERNEL);
| ^~~~~~~~~~~~~~~~~~
drivers/interconnect/core.c:1150:9: error: implicit declaration of function 'fs_reclaim_release' [-Werror=implicit-function-declaration]
1150 | fs_reclaim_release(GFP_KERNEL);
| ^~~~~~~~~~~~~~~~~~
This also affects alpha:allmodconfig and m68k:allmodconfig. The problem
was introduced with 'interconnect: Teach lockdep about icc_bw_lock order'.
#include <linux/sched/mm.h> is missing. Presumably that is included
indirectly in the upstream kernel, but I wasn't able to determine which
commit added it.
Guenter
From: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
Some uart drivers specify a rs485_config() function and then decide later
to disable RS485 support for some reason (e.g. imx and ar933).
In these cases userspace may be able to activate RS485 via TIOCSRS485
nevertheless, since in uart_set_rs485_config() an existing rs485_config()
function indicates that RS485 is supported.
Make sure that this is not longer possible by checking the uarts
rs485_supported.flags instead and bailing out if SER_RS485_ENABLED is not
set.
Furthermore instead of returning an empty structure return -ENOTTY if the
RS485 configuration is requested via TIOCGRS485 but RS485 is not supported.
Cc: stable(a)vger.kernel.org
Signed-off-by: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
---
drivers/tty/serial/serial_core.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index f4feebf8200f..dca09877fabc 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -1432,6 +1432,9 @@ static int uart_get_rs485_config(struct uart_port *port,
unsigned long flags;
struct serial_rs485 aux;
+ if (!(port->rs485_supported.flags & SER_RS485_ENABLED))
+ return -ENOTTY;
+
spin_lock_irqsave(&port->lock, flags);
aux = port->rs485;
spin_unlock_irqrestore(&port->lock, flags);
@@ -1449,7 +1452,7 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
int ret;
unsigned long flags;
- if (!port->rs485_config)
+ if (!(port->rs485_supported.flags & SER_RS485_ENABLED))
return -ENOTTY;
if (copy_from_user(&rs485, rs485_user, sizeof(*rs485_user)))
--
2.40.1
From: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
Both the imx and stm32 driver set the rx-during-tx GPIO in the
rs485_config() function by means of gpiod_set_value(). Since rs485_config()
is called with the port lock held, this can be an problem in case that
setting the GPIO line can sleep (e.g. if a GPIO expander is used which is
connected via SPI or I2C).
Avoid this issue by setting the GPIO outside of the port lock in the serial
core and by using gpiod_set_value_cansleep() instead of gpiod_set_value().
Since now both the term and the rx-during-tx GPIO are set within the serial
core use a common function uart_set_rs485_gpios() to set both.
With moving it into the serial core setting the rx-during-tx GPIO is now
automatically done for all drivers that support such a GPIO.
Cc: stable(a)vger.kernel.org
Signed-off-by: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
---
drivers/tty/serial/imx.c | 4 ----
drivers/tty/serial/serial_core.c | 10 ++++++----
drivers/tty/serial/stm32-usart.c | 5 +----
3 files changed, 7 insertions(+), 12 deletions(-)
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 13cb78340709..edb2ec6a5567 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -1947,10 +1947,6 @@ static int imx_uart_rs485_config(struct uart_port *port, struct ktermios *termio
rs485conf->flags & SER_RS485_RX_DURING_TX)
imx_uart_start_rx(port);
- if (port->rs485_rx_during_tx_gpio)
- gpiod_set_value_cansleep(port->rs485_rx_during_tx_gpio,
- !!(rs485conf->flags & SER_RS485_RX_DURING_TX));
-
return 0;
}
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index 7bdc21d5e13b..ef0500be3553 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -1391,14 +1391,16 @@ static void uart_sanitize_serial_rs485(struct uart_port *port, struct serial_rs4
memset(rs485->padding1, 0, sizeof(rs485->padding1));
}
-static void uart_set_rs485_termination(struct uart_port *port,
- const struct serial_rs485 *rs485)
+static void uart_set_rs485_gpios(struct uart_port *port,
+ const struct serial_rs485 *rs485)
{
if (!(rs485->flags & SER_RS485_ENABLED))
return;
gpiod_set_value_cansleep(port->rs485_term_gpio,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
+ gpiod_set_value_cansleep(port->rs485_rx_during_tx_gpio,
+ !!(rs485->flags & SER_RS485_RX_DURING_TX));
}
static int uart_rs485_config(struct uart_port *port)
@@ -1407,7 +1409,7 @@ static int uart_rs485_config(struct uart_port *port)
int ret;
uart_sanitize_serial_rs485(port, rs485);
- uart_set_rs485_termination(port, rs485);
+ uart_set_rs485_gpios(port, rs485);
ret = port->rs485_config(port, NULL, rs485);
if (ret)
@@ -1449,7 +1451,7 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
if (ret)
return ret;
uart_sanitize_serial_rs485(port, &rs485);
- uart_set_rs485_termination(port, &rs485);
+ uart_set_rs485_gpios(port, &rs485);
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
diff --git a/drivers/tty/serial/stm32-usart.c b/drivers/tty/serial/stm32-usart.c
index 5e9cf0c48813..8eb13bf055f2 100644
--- a/drivers/tty/serial/stm32-usart.c
+++ b/drivers/tty/serial/stm32-usart.c
@@ -226,10 +226,7 @@ static int stm32_usart_config_rs485(struct uart_port *port, struct ktermios *ter
stm32_usart_clr_bits(port, ofs->cr1, BIT(cfg->uart_enable_bit));
- if (port->rs485_rx_during_tx_gpio)
- gpiod_set_value_cansleep(port->rs485_rx_during_tx_gpio,
- !!(rs485conf->flags & SER_RS485_RX_DURING_TX));
- else
+ if (!port->rs485_rx_during_tx_gpio)
rs485conf->flags |= SER_RS485_RX_DURING_TX;
if (rs485conf->flags & SER_RS485_ENABLED) {
--
2.40.1
Hi,
recently, Debian updated their stable kernel from 6.1.38 to 6.1.52 which broke nftables ruleset loading on one of my machines with lots of "Operation not supported" errors. I've reported this to the Debian project (see link below) and Salvatore Bonaccorso and I identified "netfilter: nf_tables: disallow rule addition to bound chain via NFTA_RULE_CHAIN_ID" (0ebc1064e487) as the offending commit that introduced the regression. Salvatore also found that this issue affects the 5.10 stable tree as well (observed in 5.10.191), but he cannot reproduce it on 6.4.13 and 6.5.2.
The issue only occurs with some rulesets. While I can't trigger it with simple/minimal rulesets that I use on some machines, it does occur with a more complex ruleset that has been in use for months (if not years, for large parts of it). I'm attaching a somewhat stripped down version of the ruleset from the machine I originally observed this issue on. It's still not a small or simple ruleset, but I'll try to reduce it further when I have more time.
The error messages shown when trying to load the ruleset don't seem to be helpful. Just two simple examples:
Just to give two simple examples from the log when nftables fails to start:
/etc/nftables.conf:99:4-44: Error: Could not process rule: Operation not supported
tcp option maxseg size 1-500 counter drop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/etc/nftables.conf:308:4-27: Error: Could not process rule: Operation not supported
tcp dport sip-tls accept
^^^^^^^^^^^^^^^^^^^^^^^^
Since the issue only affects some stable trees, Salvatore thought it might be an incomplete backport that causes this.
If you need further information, please let me know.
Thanks and kind regards,
Timo
#regzbot introduced: 0ebc1064e487
#regzbot monitor: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051592
Jann Horn reported a potential vma iterator issue in the failure path of
the vma_merge() code. After examining the interface, it seemed the best
course of action is to simply add an undo path in the unlikely case of
an error.
On examining the vma iterator issue, another issue was discovered that
would increase the memory usage during failure scenarios, so this is
addressed in patch 2.
Since it is unclear in the code, another patch adds comments to the
vma_merge() function on why dup_anon_vma() is safe in 'case 6'.
Changes since v1:
- Moved dup_anon_vma() call update from patch 1 to patch 2 - Thanks
Matthew Wilcox
- Changed comment on patch 1 - Thanks Andrew Morton
v1: https://lore.kernel.org/linux-mm/20230927160746.1928098-1-Liam.Howlett@orac…
Liam R. Howlett (3):
mmap: Fix vma_iterator in error path of vma_merge()
mmap: Fix error paths with dup_anon_vma()
mmap: Add clarifying comment to vma_merge() code
mm/mmap.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
--
2.40.1
There is a potential race condition in amdtee_close_session that may
cause use-after-free in amdtee_open_session. For instance, if a session
has refcount == 1, and one thread tries to free this session via:
kref_put(&sess->refcount, destroy_session);
the reference count will get decremented, and the next step would be to
call destroy_session(). However, if in another thread,
amdtee_open_session() is called before destroy_session() has completed
execution, alloc_session() may return 'sess' that will be freed up
later in destroy_session() leading to use-after-free in
amdtee_open_session.
To fix this issue, treat decrement of sess->refcount and invocation of
destroy_session() as a single critical section, so that it is executed
atomically.
Fixes: 757cc3e9ff1d ("tee: add AMD-TEE driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Rijo Thomas <Rijo-john.Thomas(a)amd.com>
---
drivers/tee/amdtee/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/tee/amdtee/core.c b/drivers/tee/amdtee/core.c
index 372d64756ed6..04cee03bec9d 100644
--- a/drivers/tee/amdtee/core.c
+++ b/drivers/tee/amdtee/core.c
@@ -217,14 +217,13 @@ static int copy_ta_binary(struct tee_context *ctx, void *ptr, void **ta,
return rc;
}
+/* mutex must be held by caller */
static void destroy_session(struct kref *ref)
{
struct amdtee_session *sess = container_of(ref, struct amdtee_session,
refcount);
- mutex_lock(&session_list_mutex);
list_del(&sess->list_node);
- mutex_unlock(&session_list_mutex);
kfree(sess);
}
@@ -272,7 +271,9 @@ int amdtee_open_session(struct tee_context *ctx,
if (arg->ret != TEEC_SUCCESS) {
pr_err("open_session failed %d\n", arg->ret);
handle_unload_ta(ta_handle);
+ mutex_lock(&session_list_mutex);
kref_put(&sess->refcount, destroy_session);
+ mutex_unlock(&session_list_mutex);
goto out;
}
@@ -290,7 +291,9 @@ int amdtee_open_session(struct tee_context *ctx,
pr_err("reached maximum session count %d\n", TEE_NUM_SESSIONS);
handle_close_session(ta_handle, session_info);
handle_unload_ta(ta_handle);
+ mutex_lock(&session_list_mutex);
kref_put(&sess->refcount, destroy_session);
+ mutex_unlock(&session_list_mutex);
rc = -ENOMEM;
goto out;
}
@@ -331,7 +334,9 @@ int amdtee_close_session(struct tee_context *ctx, u32 session)
handle_close_session(ta_handle, session_info);
handle_unload_ta(ta_handle);
+ mutex_lock(&session_list_mutex);
kref_put(&sess->refcount, destroy_session);
+ mutex_unlock(&session_list_mutex);
return 0;
}
--
2.25.1
During SMBus block data read process, we have seen high interrupt rate
because of TX_EMPTY irq status while waiting for block length byte (the
first data byte after the address phase). The interrupt handler does not
do anything because the internal state is kept as STATUS_WRITE_IN_PROGRESS.
Hence, we should disable TX_EMPTY irq until I2C DW receives first data
byte from I2C device, then re-enable it.
It takes 0.789 ms for host to receive data length from slave.
Without the patch, i2c_dw_isr is called 99 times by TX_EMPTY interrupt.
And it is none after applying the patch.
Cc: stable(a)vger.kernel.org
Signed-off-by: Chuong Tran <chuong(a)os.amperecomputing.com>
Signed-off-by: Tam Nguyen <tamnguyenchi(a)os.amperecomputing.com>
---
drivers/i2c/busses/i2c-designware-master.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
index 55ea91a63382..2152b1f9b27c 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -462,6 +462,13 @@ i2c_dw_xfer_msg(struct dw_i2c_dev *dev)
if (buf_len > 0 || flags & I2C_M_RECV_LEN) {
/* more bytes to be written */
dev->status |= STATUS_WRITE_IN_PROGRESS;
+ /*
+ * In I2C_FUNC_SMBUS_BLOCK_DATA case, there is no data
+ * to send before receiving data length from slave.
+ * Disable TX_EMPTY while waiting for data length byte
+ */
+ if (flags & I2C_M_RECV_LEN)
+ intr_mask &= ~DW_IC_INTR_TX_EMPTY;
break;
} else
dev->status &= ~STATUS_WRITE_IN_PROGRESS;
@@ -485,6 +492,7 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
{
struct i2c_msg *msgs = dev->msgs;
u32 flags = msgs[dev->msg_read_idx].flags;
+ u32 intr_mask;
/*
* Adjust the buffer length and mask the flag
@@ -495,6 +503,11 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
msgs[dev->msg_read_idx].len = len;
msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN;
+ /* Re-enable TX_EMPTY interrupt. */
+ regmap_read(dev->map, DW_IC_INTR_MASK, &intr_mask);
+ intr_mask |= DW_IC_INTR_TX_EMPTY;
+ regmap_write(dev->map, DW_IC_INTR_MASK, intr_mask);
+
return len;
}
--
2.25.1
svm_leave_nested() similar to a nested VM exit, get the vCPU out of nested
mode and thus should end the local inhibition of AVIC on this vCPU.
Failure to do so, can lead to hangs on guest reboot.
Raise the KVM_REQ_APICV_UPDATE request to refresh the AVIC state of the
current vCPU in this case.
Fixes: f44509f849fe ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running")
Cc: stable(a)vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
arch/x86/kvm/svm/nested.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index dd496c9e5f91f28..3fea8c47679e689 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1253,6 +1253,9 @@ void svm_leave_nested(struct kvm_vcpu *vcpu)
nested_svm_uninit_mmu_context(vcpu);
vmcb_mark_all_dirty(svm->vmcb);
+
+ if (kvm_apicv_activated(vcpu->kvm))
+ kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
}
kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
--
2.26.3
In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:
"Invalid IPI Vector - The vector for the specified IPI was set to an
illegal value (VEC < 16)"
Note that tests on Zen2 machine show that this VM exit doesn't happen and
instead AVIC just does nothing.
Add support for this exit code by doing nothing, instead of filling
the kernel log with errors.
Also replace an unthrottled 'pr_err()' if another unknown incomplete
IPI exit happens with vcpu_unimpl()
(e.g in case AMD adds yet another 'Invalid IPI' exit reason)
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
arch/x86/include/asm/svm.h | 1 +
arch/x86/kvm/svm/avic.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 19bf955b67e0da0..3ac0ffc4f3e202b 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -268,6 +268,7 @@ enum avic_ipi_failure_cause {
AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
AVIC_IPI_FAILURE_INVALID_TARGET,
AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
+ AVIC_IPI_FAILURE_INVALID_IPI_VECTOR,
};
#define AVIC_PHYSICAL_MAX_INDEX_MASK GENMASK_ULL(8, 0)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 2092db892d7d052..4b74ea91f4e6bb6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -529,8 +529,11 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE:
WARN_ONCE(1, "Invalid backing page\n");
break;
+ case AVIC_IPI_FAILURE_INVALID_IPI_VECTOR:
+ /* Invalid IPI with vector < 16 */
+ break;
default:
- pr_err("Unknown IPI interception\n");
+ vcpu_unimpl(vcpu, "Unknown avic incomplete IPI interception\n");
}
return 1;
--
2.26.3
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: c6c2adcba50c2622ed25ba5d5e7f05f584711358
Gitweb: https://git.kernel.org/tip/c6c2adcba50c2622ed25ba5d5e7f05f584711358
Author: Haitao Huang <haitao.huang(a)linux.intel.com>
AuthorDate: Thu, 27 Jul 2023 22:10:24 -07:00
Committer: Dave Hansen <dave.hansen(a)linux.intel.com>
CommitterDate: Thu, 28 Sep 2023 16:16:40 -07:00
x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an
enclave and set secs.epc_page to NULL. The SECS page is used for EAUG
and ELDU in the SGX page fault handler. However, the NULL check for
secs.epc_page is only done for ELDU, not EAUG before being used.
Fix this by doing the same NULL check and reloading of the SECS page as
needed for both EAUG and ELDU.
The SECS page holds global enclave metadata. It can only be reclaimed
when there are no other enclave pages remaining. At that point,
virtually nothing can be done with the enclave until the SECS page is
paged back in.
An enclave can not run nor generate page faults without a resident SECS
page. But it is still possible for a #PF for a non-SECS page to race
with paging out the SECS page: when the last resident non-SECS page A
triggers a #PF in a non-resident page B, and then page A and the SECS
both are paged out before the #PF on B is handled.
Hitting this bug requires that race triggered with a #PF for EAUG.
Following is a trace when it happens.
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:sgx_encl_eaug_page+0xc7/0x210
Call Trace:
? __kmem_cache_alloc_node+0x16a/0x440
? xa_load+0x6e/0xa0
sgx_vma_fault+0x119/0x230
__do_fault+0x36/0x140
do_fault+0x12f/0x400
__handle_mm_fault+0x728/0x1110
handle_mm_fault+0x105/0x310
do_user_addr_fault+0x1ee/0x750
? __this_cpu_preempt_check+0x13/0x20
exc_page_fault+0x76/0x180
asm_exc_page_fault+0x27/0x30
Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <haitao.huang(a)linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Reviewed-by: Kai Huang <kai.huang(a)intel.com>
Acked-by: Reinette Chatre <reinette.chatre(a)intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.int…
---
arch/x86/kernel/cpu/sgx/encl.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 91fa70e..279148e 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -235,6 +235,21 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
return epc_page;
}
+/*
+ * Ensure the SECS page is not swapped out. Must be called with encl->lock
+ * to protect the enclave states including SECS and ensure the SECS page is
+ * not swapped out again while being used.
+ */
+static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
+{
+ struct sgx_epc_page *epc_page = encl->secs.epc_page;
+
+ if (!epc_page)
+ epc_page = sgx_encl_eldu(&encl->secs, NULL);
+
+ return epc_page;
+}
+
static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
struct sgx_encl_page *entry)
{
@@ -248,11 +263,9 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
return entry;
}
- if (!(encl->secs.epc_page)) {
- epc_page = sgx_encl_eldu(&encl->secs, NULL);
- if (IS_ERR(epc_page))
- return ERR_CAST(epc_page);
- }
+ epc_page = sgx_encl_load_secs(encl);
+ if (IS_ERR(epc_page))
+ return ERR_CAST(epc_page);
epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
if (IS_ERR(epc_page))
@@ -339,6 +352,13 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
mutex_lock(&encl->lock);
+ epc_page = sgx_encl_load_secs(encl);
+ if (IS_ERR(epc_page)) {
+ if (PTR_ERR(epc_page) == -EBUSY)
+ vmret = VM_FAULT_NOPAGE;
+ goto err_out_unlock;
+ }
+
epc_page = sgx_alloc_epc_page(encl_page, false);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
From: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
Among other things uart_sanitize_serial_rs485() tests the sanity of the RTS
settings in a RS485 configuration that has been passed by userspace.
If RTS-on-send and RTS-after-send are both set or unset the configuration
is adjusted and RTS-after-send is disabled and RTS-on-send enabled.
This however makes only sense if both RTS modes are actually supported by
the driver.
With commit be2e2cb1d281 ("serial: Sanitize rs485_struct") the code does
take the driver support into account but only checks if one of both RTS
modes are supported. This may lead to the errorneous result of RTS-on-send
being set even if only RTS-after-send is supported.
Fix this by changing the implemented logic: First clear all unsupported
flags in the RS485 configuration, then adjust an invalid RTS setting by
taking into account which RTS mode is supported.
Cc: stable(a)vger.kernel.org
Fixes: be2e2cb1d281 ("serial: Sanitize rs485_struct")
Signed-off-by: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
---
drivers/tty/serial/serial_core.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index 697c36dc7ec8..f4feebf8200f 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -1370,19 +1370,27 @@ static void uart_sanitize_serial_rs485(struct uart_port *port, struct serial_rs4
return;
}
+ rs485->flags &= supported_flags;
+
/* Pick sane settings if the user hasn't */
- if ((supported_flags & (SER_RS485_RTS_ON_SEND|SER_RS485_RTS_AFTER_SEND)) &&
- !(rs485->flags & SER_RS485_RTS_ON_SEND) ==
+ if (!(rs485->flags & SER_RS485_RTS_ON_SEND) ==
!(rs485->flags & SER_RS485_RTS_AFTER_SEND)) {
- dev_warn_ratelimited(port->dev,
- "%s (%d): invalid RTS setting, using RTS_ON_SEND instead\n",
- port->name, port->line);
- rs485->flags |= SER_RS485_RTS_ON_SEND;
- rs485->flags &= ~SER_RS485_RTS_AFTER_SEND;
- supported_flags |= SER_RS485_RTS_ON_SEND|SER_RS485_RTS_AFTER_SEND;
- }
+ if (supported_flags & SER_RS485_RTS_ON_SEND) {
+ rs485->flags |= SER_RS485_RTS_ON_SEND;
+ rs485->flags &= ~SER_RS485_RTS_AFTER_SEND;
- rs485->flags &= supported_flags;
+ dev_warn_ratelimited(port->dev,
+ "%s (%d): invalid RTS setting, using RTS_ON_SEND instead\n",
+ port->name, port->line);
+ } else {
+ rs485->flags |= SER_RS485_RTS_AFTER_SEND;
+ rs485->flags &= ~SER_RS485_RTS_ON_SEND;
+
+ dev_warn_ratelimited(port->dev,
+ "%s (%d): invalid RTS setting, using RTS_AFTER_SEND instead\n",
+ port->name, port->line);
+ }
+ }
uart_sanitize_serial_rs485_delays(port, rs485);
--
2.40.1
From: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
If the RS485 feature RX-during-TX is supported by means of a GPIO set the
according supported flag. Otherwise setting this feature from userspace may
not be possible, since in uart_sanitize_serial_rs485() the passed RS485
configuration is matched against the supported features and unsupported
settings are thereby removed and thus take no effect.
Cc: stable(a)vger.kernel.org
Fixes: 163f080eb717 ("serial: core: Add option to output RS485 RX_DURING_TX state via GPIO")
Signed-off-by: Lino Sanfilippo <l.sanfilippo(a)kunbus.com>
---
drivers/tty/serial/serial_core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index ef0500be3553..697c36dc7ec8 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -3622,6 +3622,8 @@ int uart_get_rs485_mode(struct uart_port *port)
port->rs485_rx_during_tx_gpio = NULL;
return dev_err_probe(dev, ret, "Cannot get rs485-rx-during-tx-gpios\n");
}
+ if (port->rs485_rx_during_tx_gpio)
+ port->rs485_supported.flags |= SER_RS485_RX_DURING_TX;
return 0;
}
--
2.40.1
During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.
On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running. During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.
It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.
Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.
Tested-by: Paul E. McKenney <paulmck(a)kernel.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
kernel/sched/cpupri.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e726eb4b..42c40cfdf836 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
if (lowest_mask) {
cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+ cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
/*
* We have to ensure that we have at least one bit
--
2.42.0.515.g380fc7ccd1-goog